To match a given orthography in Unicode, you can use Unicode Character
Properties, which in Perl regexes looks like /\p{Bengali}/ for example.
See the following link for more explanation:
http://perldoc.perl.org/perlunicode.html#*Scripts*
So an oversimplified example:
echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe 's/(\p{Bengali})/$1 /g'
Unicode Character Properties are very useful for many kinds of text
preprocessing.
Best,
-Jon
On Wed, Dec 04, 2013 at 01:41:44PM +0100, Prasanth K wrote:
> @Thomas, I dont think his intention was to search for a given utf8
> string; it was more like to identify strings in a different language.Â
> @Pranjal, when you say unicode words, understand that everything (even
> the english words) are encoded in utf8. Its more like you want to pick
> the Bengali (read as foreign language) fragments from your sentence so
> that they are skipped when given to the decoder. I recall doing
> something like this by defining a range of valid characters (which is
> the utf8 code points for characters in bengali in this case) and
> writing a filter to mark such characters. May be not a cool solution,
> but one that will work for ILs  given that they have different scripts
> and code-values in Unicode.Â
> - Regards,
> Prasanth
>
> On Wed, Dec 4, 2013 at 11:12 AM, Thomas Meyer
> <[1][email protected]> wrote:
>
> Hi,
> echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe '/(à¦à¦¾à¦°à¦¤)/; print
> $1."\n"'
> if you want to replace it with something:
> echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe
> 's/à¦à¦¾à¦°à¦¤/something/g'
> But this is normally not the list to ask perl questions to...
> Best,
> Thomas
> On 04/12/13 10:55, Pranjal Das wrote:
>
> can anyone help with a perl script to extract unicode words from a
> sentence. For eg. i want to extract the word à¦à¦¾à¦°à¦¤ from the
> sentence "à¦à¦¾à¦°à¦¤ is a vast country"..
> Â Pranjal Das
> Department of Information Technology,
> Institute of Science and Technology,
> Gauhati University,Guwahati,Assam
> Phone- [2]+91-8399879454
>
> _______________________________________________
> Moses-support mailing list
> [3][email protected]
> [4]http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [5][email protected]
> [6]http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> "Theories have four stages of acceptance. i) this is worthless
> nonsense; ii) this is an interesting, but perverse, point of view, iii)
> this is true, but quite unimportant; iv) I always said so."
> Â --- J.B.S. Haldane
>
> References
>
> 1. mailto:[email protected]
> 2. tel:%2B91-8399879454
> 3. mailto:[email protected]
> 4. http://mailman.mit.edu/mailman/listinfo/moses-support
> 5. mailto:[email protected]
> 6. http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support