Re: [Moses-support] tokenizer problem

tharaka weheragoda Wed, 30 May 2012 07:12:02 -0700

Thank's Lane.Some characters in sinhala are not read by this tokenizer.
That's the problem i'm having. It gives me some abnormal characters intead
of unread characters. Do you know how to overcome this problem?


On Wed, May 30, 2012 at 6:57 PM, Lane Schwartz <[email protected]> wrote:

> Tharaka,
>
> If you take a look at the file for English, you can see some good
> examples of what constitutes a nonbreaking prefix. The file is
> commented, explaining what's going on.
>
> The basic idea is that the tokenizer needs to know when token that
> ends in a period should be split, separating the period into its own
> token. For example:
>
> Mr. Smith spoke to Dr. Jones.
>
> For the sentence above, we would like the tokenizer to return the
> following:
>
> Mr. Smith spoke to Dr. Jones .
>
> The tokens "Mr." and "Dr." are abbreviations. Thus "Mr" and "Dr"
> represent prefixes which, when followed by a period, should be
> considered nonbreaking. That is, if you encounter "Mr." or "Dr." the
> tokenizer should leave them.
>
> Contrast this with the token "Jones." The token "Jones" is not found
> in the nonbreaking prefixes list for English, and so the tokenizer
> should assume that "Jones." should be split into two tokens: "Jones"
> and "."
>
> The rules about when a period should be split into a separate token
> are language-specific. If you want to write a nonbreaking prefixes
> file for a new language, you will probably need some knowledge of the
> language in order to write a sensible new nonbreaking prefixes file.
>
> If you don't have access to someone with knowledge of the language of
> interest, you could try to develop an unsupervised statistical
> technique that attempts to learn nonbreaking prefixes from untokenized
> text.
>
> In the worst case, you can just let the tokenizer run without a
> language-specific nonbreaking prefixes file.
>
> Hope that helps,
> Lane
>
> On Wed, May 30, 2012 at 9:09 AM, Tom Hoar
> <[email protected]> wrote:
> > When you compiled moses, it created a scripts folder. In there, you'll
> find
> > the subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in
> this
> > folder all have the same name with a 2-letter language code extension.
> These
> > file have language-specific rules for how the tokenizer & detokenizer
> work.
> >
> > Anyone, is there a better resource than reading the existing files to
> learn
> > how the files work?
> >
> > Tom
> >
> >
> >
> > On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda
> > <[email protected]> wrote:
> >
> > Thank you very much for your answer.But i'm new to this field and i'm not
> > aware about how to create nonbreaking_prefixfiles.Is there any perticular
> > way of doing this.Can you explain me something more.
> >
> > On Wed, May 30, 2012 at 6:13 PM, Tom Hoar
> > <[email protected]> wrote:
> >>
> >> Build your own nonbreaking_prefixes file. Name it with the extension you
> >> want to use and save it in the nonbreaking_prefixes subfolder under the
> >> moses scripts/tokenizer folder. The existing files are commented with
> >> instructions to help you.
> >>
> >> Tom
> >>
> >>
> >>
> >> On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda
> >> <[email protected]> wrote:
> >>
> >> Hi everybody,
> >>
> >>   When i'm trying to tokenize my sinhala dataset it gives me a warning
> >> message like this
> >>  "WARNING: No known abbreviations for language 'si', attempting
> fall-back
> >> to English version..."
> >>
> >> And my letters have changed a bit. Is their anyway to tokenize sinhala
> >> data with this tokenizer.perl ?
> >>
> >> I'm looking forward for your help.
> >>
> >> Thanks in advance!
> >> Tharaka
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer problem

Reply via email to