To match a given orthography in Unicode, you can use Unicode Character
Properties, which in Perl regexes looks like /\p{Bengali}/ for example.
See the following link for more explanation:
http://perldoc.perl.org/perlunicode.html#*Scripts*

So an oversimplified example:
echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe 's/(\p{Bengali})/$1 /g'


Unicode Character Properties are very useful for many kinds of text
preprocessing.

Best,
-Jon

On Wed, Dec 04, 2013 at 01:41:44PM +0100, Prasanth K wrote:
>    @Thomas, I dont think his intention was to search for a given utf8
>    string; it was more like to identify strings in a different language.Â
>    @Pranjal, when you say unicode words, understand that everything (even
>    the english words) are encoded in utf8. Its more like you want to pick
>    the Bengali (read as foreign language) fragments from your sentence so
>    that they are skipped when given to the decoder. I recall doing
>    something like this by defining a range of valid characters (which is
>    the utf8 code points for characters in bengali in this case) and
>    writing a filter to mark such characters. May be not a cool solution,
>    but one that will work for ILs  given that they have different scripts
>    and code-values in Unicode.Â
>    - Regards,
>    Prasanth
> 
>    On Wed, Dec 4, 2013 at 11:12 AM, Thomas Meyer
>    <[1][email protected]> wrote:
> 
>    Hi,
>    echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe '/(à¦à¦¾à¦°à¦¤)/; print
>    $1."\n"'
>    if you want to replace it with something:
>    echo "à¦à¦¾à¦°à¦¤ is a vast country" | perl -pe
>    's/à¦à¦¾à¦°à¦¤/something/g'
>    But this is normally not the list to ask perl questions to...
>    Best,
>    Thomas
>    On 04/12/13 10:55, Pranjal Das wrote:
> 
>    can anyone help with a perl script to extract unicode words from a
>    sentence. For eg. i want to extract the word à¦à¦¾à¦°à¦¤ from the
>    sentence "à¦à¦¾à¦°à¦¤ is a vast country"..
>    Â Pranjal Das
>    Department of Information Technology,
>    Institute of Science and Technology,
>    Gauhati University,Guwahati,Assam
>    Phone- [2]+91-8399879454
> 
> _______________________________________________
> Moses-support mailing list
> [3][email protected]
> [4]http://mailman.mit.edu/mailman/listinfo/moses-support
> 
>      _______________________________________________
>      Moses-support mailing list
>      [5][email protected]
>      [6]http://mailman.mit.edu/mailman/listinfo/moses-support
> 
>    --
>    "Theories have four stages of acceptance. i) this is worthless
>    nonsense; ii) this is an interesting, but perverse, point of view, iii)
>    this is true, but quite unimportant; iv) I always said so."
>    Â  --- J.B.S. Haldane
> 
> References
> 
>    1. mailto:[email protected]
>    2. tel:%2B91-8399879454
>    3. mailto:[email protected]
>    4. http://mailman.mit.edu/mailman/listinfo/moses-support
>    5. mailto:[email protected]
>    6. http://mailman.mit.edu/mailman/listinfo/moses-support

> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to