Hi Group,
Am working with Unicode (UTF8 coded) stuff and facing problem with regular expression. s/(\p{HinNumerals})\s+($tokenize_string)+\s+(\p{HinNumerals})/$1$2$3/g; and, my HinNumerals is defined as, sub HinNumerals { return <<END; 0966\t096F END } $tokenize_string is a set of punctuation marks glued together in a string with ‘|’ between them for OR. Here it goes: $tokenize_string = "\x{0021}|\x{0022}|\x{0023}|\x{0025}|\x{0026}|\x{0027}|\x{0028}|\x{0029}|\x{002A}|\x{002B}|\x{002C}|\x{002D}|\x{002E}|\x{002F}|\x{003A}|\x{003B}|\x{003C}|\x{003D}|\x{003E}|\x{003F}|\x{0040}|\x{005B}|\x{005C}|\x{005D}|\x{005E}|\x{005F}|\x{007B}|\x{007C}|\x{007D}|\x{007E}|\x{0964}|\x{0965}"; Initially $_ consists of इस बिन्दु पर लाभ बहुत ही कम , रु .२ , ००० – ४ , ००० रु प्रति कार हैं । and in the substitute command I am trying to remove the blank spaces between the hindi numerals and number separators. So, I want the substring २ , ००० – ४ , ००० to become २,०००–४,०००. The HinNumerals defines the range of the Hindi numerals and my regular expression looks for punctuation marks sandwiched (with spaces on both sides) between Hindi numerals and removes the spaces and the switch ‘g’ ensures that it is applied in as many places as possible in $_. But, the result is weird and I get बिन्दु पर लाभ बहुत ही कम , रु .२,०००–४ , ००० रु प्रति कार हैं । The problem is that the regexp applies correctly for the first two instances: viz. २ , ० and ० – ४ but doesn’t work for the last instance, which is ४ , ०. I am puzzled why this happens. Any clue/ solution will be useful. Baskaran