Tharaka, If you take a look at the file for English, you can see some good examples of what constitutes a nonbreaking prefix. The file is commented, explaining what's going on.
The basic idea is that the tokenizer needs to know when token that ends in a period should be split, separating the period into its own token. For example: Mr. Smith spoke to Dr. Jones. For the sentence above, we would like the tokenizer to return the following: Mr. Smith spoke to Dr. Jones . The tokens "Mr." and "Dr." are abbreviations. Thus "Mr" and "Dr" represent prefixes which, when followed by a period, should be considered nonbreaking. That is, if you encounter "Mr." or "Dr." the tokenizer should leave them. Contrast this with the token "Jones." The token "Jones" is not found in the nonbreaking prefixes list for English, and so the tokenizer should assume that "Jones." should be split into two tokens: "Jones" and "." The rules about when a period should be split into a separate token are language-specific. If you want to write a nonbreaking prefixes file for a new language, you will probably need some knowledge of the language in order to write a sensible new nonbreaking prefixes file. If you don't have access to someone with knowledge of the language of interest, you could try to develop an unsupervised statistical technique that attempts to learn nonbreaking prefixes from untokenized text. In the worst case, you can just let the tokenizer run without a language-specific nonbreaking prefixes file. Hope that helps, Lane On Wed, May 30, 2012 at 9:09 AM, Tom Hoar <[email protected]> wrote: > When you compiled moses, it created a scripts folder. In there, you'll find > the subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in this > folder all have the same name with a 2-letter language code extension. These > file have language-specific rules for how the tokenizer & detokenizer work. > > Anyone, is there a better resource than reading the existing files to learn > how the files work? > > Tom > > > > On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda > <[email protected]> wrote: > > Thank you very much for your answer.But i'm new to this field and i'm not > aware about how to create nonbreaking_prefixfiles.Is there any perticular > way of doing this.Can you explain me something more. > > On Wed, May 30, 2012 at 6:13 PM, Tom Hoar > <[email protected]> wrote: >> >> Build your own nonbreaking_prefixes file. Name it with the extension you >> want to use and save it in the nonbreaking_prefixes subfolder under the >> moses scripts/tokenizer folder. The existing files are commented with >> instructions to help you. >> >> Tom >> >> >> >> On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda >> <[email protected]> wrote: >> >> Hi everybody, >> >> When i'm trying to tokenize my sinhala dataset it gives me a warning >> message like this >> "WARNING: No known abbreviations for language 'si', attempting fall-back >> to English version..." >> >> And my letters have changed a bit. Is their anyway to tokenize sinhala >> data with this tokenizer.perl ? >> >> I'm looking forward for your help. >> >> Thanks in advance! >> Tharaka > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
