Re: [Moses-support] tokenizer problem

Lane Schwartz Wed, 30 May 2012 06:29:22 -0700

Tharaka,

If you take a look at the file for English, you can see some good
examples of what constitutes a nonbreaking prefix. The file is
commented, explaining what's going on.

The basic idea is that the tokenizer needs to know when token that
ends in a period should be split, separating the period into its own
token. For example:

Mr. Smith spoke to Dr. Jones.

For the sentence above, we would like the tokenizer to return the following:

Mr. Smith spoke to Dr. Jones .

The tokens "Mr." and "Dr." are abbreviations. Thus "Mr" and "Dr"
represent prefixes which, when followed by a period, should be
considered nonbreaking. That is, if you encounter "Mr." or "Dr." the
tokenizer should leave them.

Contrast this with the token "Jones." The token "Jones" is not found
in the nonbreaking prefixes list for English, and so the tokenizer
should assume that "Jones." should be split into two tokens: "Jones"
and "."

The rules about when a period should be split into a separate token
are language-specific. If you want to write a nonbreaking prefixes
file for a new language, you will probably need some knowledge of the
language in order to write a sensible new nonbreaking prefixes file.

If you don't have access to someone with knowledge of the language of
interest, you could try to develop an unsupervised statistical
technique that attempts to learn nonbreaking prefixes from untokenized
text.

In the worst case, you can just let the tokenizer run without a
language-specific nonbreaking prefixes file.

Hope that helps,
Lane

On Wed, May 30, 2012 at 9:09 AM, Tom Hoar
<[email protected]> wrote:
> When you compiled moses, it created a scripts folder. In there, you'll find
> the subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in this
> folder all have the same name with a 2-letter language code extension. These
> file have language-specific rules for how the tokenizer & detokenizer work.
>
> Anyone, is there a better resource than reading the existing files to learn
> how the files work?
>
> Tom
>
>
>
> On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda
> <[email protected]> wrote:
>
> Thank you very much for your answer.But i'm new to this field and i'm not
> aware about how to create nonbreaking_prefixfiles.Is there any perticular
> way of doing this.Can you explain me something more.
>
> On Wed, May 30, 2012 at 6:13 PM, Tom Hoar
> <[email protected]> wrote:
>>
>> Build your own nonbreaking_prefixes file. Name it with the extension you
>> want to use and save it in the nonbreaking_prefixes subfolder under the
>> moses scripts/tokenizer folder. The existing files are commented with
>> instructions to help you.
>>
>> Tom
>>
>>
>>
>> On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda
>> <[email protected]> wrote:
>>
>> Hi everybody,
>>
>>   When i'm trying to tokenize my sinhala dataset it gives me a warning
>> message like this
>>  "WARNING: No known abbreviations for language 'si', attempting fall-back
>> to English version..."
>>
>> And my letters have changed a bit. Is their anyway to tokenize sinhala
>> data with this tokenizer.perl ?
>>
>> I'm looking forward for your help.
>>
>> Thanks in advance!
>> Tharaka
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer problem

Reply via email to