Tharaka,
Two general questions about Sinhalese. 1) When written, is a sentence normally segmented with spaces between the words? 2) Is the end-of-sentence marker a standard Latin full-stop (period) with other standard Latin comma, semi-colon, etc? If both answers are yes, then tokenizer.perl can work with the addition of a new nonbreaking_prefix file. If not, you'll need the help of a linguist with Sinhalese experience. In this case, you might consider training a Senegalese tokenizer with the Stanford Segmenter. http://nlp.stanford.edu/software/segmenter.shtml Another point to consider. Is the character encoding of your Sinhalese text UTF-8? The tokenizer.perl script must have UTF-8 text and this might explain why "some characters" are not read by the tokenizer. Tom On Wed, 30 May 2012 19:38:28 +0530, tharaka weheragoda wrote: Thank's Lane.Some characters in sinhala are not read by this tokenizer. That's the problem i'm having. It gives me some abnormal characters intead of unread characters. Do you know how to overcome this problem? On Wed, May 30, 2012 at 6:57 PM, Lane Schwartz wrote: Tharaka, If you take a look at the file for English, you can see some good examples of what constitutes a nonbreaking prefix. The file is commented, explaining what's going on. The basic idea is that the tokenizer needs to know when token that ends in a period should be split, separating the period into its own token. For example: Mr. Smith spoke to Dr. Jones. For the sentence above, we would like the tokenizer to return the following: Mr. Smith spoke to Dr. Jones . The tokens "Mr." and "Dr." are abbreviations. Thus "Mr" and "Dr" represent prefixes which, when followed by a period, should be considered nonbreaking. That is, if you encounter "Mr." or "Dr." the tokenizer should leave them. Contrast this with the token "Jones." The token "Jones" is not found in the nonbreaking prefixes list for English, and so the tokenizer should assume that "Jones." should be split into two tokens: "Jones" and "." The rules about when a period should be split into a separate token are language-specific. If you want to write a nonbreaking prefixes file for a new language, you will probably need some knowledge of the language in order to write a sensible new nonbreaking prefixes file. If you don't have access to someone with knowledge of the language of interest, you could try to develop an unsupervised statistical technique that attempts to learn nonbreaking prefixes from untokenized text. In the worst case, you can just let the tokenizer run without a language-specific nonbreaking prefixes file. Hope that helps, Lane On Wed, May 30, 2012 at 9:09 AM, Tom Hoar wrote: > When you compiled moses, it created a scripts folder. In there, you'll find > the subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in this > folder all have the same name with a 2-letter language code extension. These > file have language-specific rules for how the tokenizer & detokenizer work. > > Anyone, is there a better resource than reading the existing files to learn > how the files work? > > Tom > > > > On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda > wrote: > > Thank you very much for your answer.But i'm new to this field and i'm not > aware about how to create nonbreaking_prefixfiles.Is there any perticular > way of doing this.Can you explain me something more. > > On Wed, May 30, 2012 at 6:13 PM, Tom Hoar > wrote: >> >> Build your own nonbreaking_prefixes file. Name it with the extension you >> want to use and save it in the nonbreaking_prefixes subfolder under the >> moses scripts/tokenizer folder. The existing files are commented with >> instructions to help you. >> >> Tom >> >> >> >> On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda >> wrote: >> >> Hi everybody, >> >> When i'm trying to tokenize my sinhala dataset it gives me a warning >> message like this >> "WARNING: No known abbreviations for language 'si', attempting fall-back >> to English version..." >> >> And my letters have changed a bit. Is their anyway to tokenize sinhala >> data with this tokenizer.perl ? >> >> I'm looking forward for your help. >> >> Thanks in advance! >> Tharaka > > > > _______________________________________________ > Moses-support mailing list > [email protected] [6] > http://mailman.mit.edu/mailman/listinfo/moses-support [7] > -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" Links: ------ [1] mailto:[email protected] [2] mailto:[email protected] [3] mailto:[email protected] [4] mailto:[email protected] [5] mailto:[email protected] [6] mailto:[email protected] [7] http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
