Re: [Moses-support] tokenizer problem

Tom Hoar Wed, 30 May 2012 08:09:16 -0700


Tharaka,

Two general questions about Sinhalese. 1) When written, is
a sentence normally segmented with spaces between the words? 2) Is the
end-of-sentence marker a standard Latin full-stop (period) with other
standard Latin comma, semi-colon, etc? 

If both answers are yes, then
tokenizer.perl can work with the addition of a new nonbreaking_prefix
file. If not, you'll need the help of a linguist with Sinhalese
experience. In this case, you might consider training a Senegalese
tokenizer with the Stanford Segmenter.
http://nlp.stanford.edu/software/segmenter.shtml 

Another point to
consider. Is the character encoding of your Sinhalese text UTF-8? The
tokenizer.perl script must have UTF-8 text and this might explain why
"some characters" are not read by the tokenizer. 

Tom 

On Wed, 30 May
2012 19:38:28 +0530, tharaka weheragoda  wrote:  

Thank's Lane.Some
characters in sinhala are not read by this tokenizer. That's the problem
i'm having. It gives me some abnormal characters intead of unread
characters. Do you know how to overcome this problem? 

On Wed, May 30,
2012 at 6:57 PM, Lane Schwartz  wrote:
 Tharaka,

 If you take a look at
the file for English, you can see some good
 examples of what
constitutes a nonbreaking prefix. The file is
 commented, explaining
what's going on.

 The basic idea is that the tokenizer needs to know
when token that
 ends in a period should be split, separating the period
into its own
 token. For example:

 Mr. Smith spoke to Dr. Jones.

 For
the sentence above, we would like the tokenizer to return the
following:

 Mr. Smith spoke to Dr. Jones .

 The tokens "Mr." and "Dr."
are abbreviations. Thus "Mr" and "Dr"
 represent prefixes which, when
followed by a period, should be
 considered nonbreaking. That is, if you
encounter "Mr." or "Dr." the
 tokenizer should leave them.

 Contrast
this with the token "Jones." The token "Jones" is not found
 in the
nonbreaking prefixes list for English, and so the tokenizer
 should
assume that "Jones." should be split into two tokens: "Jones"
 and "."

The rules about when a period should be split into a separate token
 are
language-specific. If you want to write a nonbreaking prefixes
 file for
a new language, you will probably need some knowledge of the
 language
in order to write a sensible new nonbreaking prefixes file.

 If you
don't have access to someone with knowledge of the language of

interest, you could try to develop an unsupervised statistical

technique that attempts to learn nonbreaking prefixes from untokenized

text.

 In the worst case, you can just let the tokenizer run without a

language-specific nonbreaking prefixes file.

 Hope that helps,
 Lane

On Wed, May 30, 2012 at 9:09 AM, Tom Hoar
  wrote:
 > When you compiled
moses, it created a scripts folder. In there, you'll find
 > the
subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in this

> folder all have the same name with a 2-letter language code extension.
These
 > file have language-specific rules for how the tokenizer &
detokenizer work.
 >
 > Anyone, is there a better resource than reading
the existing files to learn
 > how the files work?
 >
 > Tom
 >
 >
 >
 >
On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda
 >  wrote:
 >
 >
Thank you very much for your answer.But i'm new to this field and i'm
not
 > aware about how to create nonbreaking_prefixfiles.Is there any
perticular
 > way of doing this.Can you explain me something more.
 >
 >
On Wed, May 30, 2012 at 6:13 PM, Tom Hoar
 >  wrote:
 >>
 >> Build your
own nonbreaking_prefixes file. Name it with the extension you
 >> want
to use and save it in the nonbreaking_prefixes subfolder under the
 >>
moses scripts/tokenizer folder. The existing files are commented with

>> instructions to help you.
 >>
 >> Tom
 >>
 >>
 >>
 >> On Wed, 30 May
2012 17:37:19 +0530, tharaka weheragoda
 >>  wrote:
 >>
 >> Hi
everybody,
 >>
 >> When i'm trying to tokenize my sinhala dataset it
gives me a warning
 >> message like this
 >> "WARNING: No known
abbreviations for language 'si', attempting fall-back
 >> to English
version..."
 >>
 >> And my letters have changed a bit. Is their anyway
to tokenize sinhala
 >> data with this tokenizer.perl ?
 >>
 >> I'm
looking forward for your help.
 >>
 >> Thanks in advance!
 >> Tharaka

>
 >
 >  > _______________________________________________
 >
Moses-support mailing list
 > [email protected] [6]
 >
http://mailman.mit.edu/mailman/listinfo/moses-support [7]
 >

 --
 When
a place gets crowded enough to require ID's, social collapse is not
 far
away. It is time to go elsewhere. The best thing about space travel
 is
that it made it possible to go elsewhere.
 -- R.A. Heinlein, "Time
Enough For Love"

Links:
------
[1] mailto:[email protected]
[2]
mailto:[email protected]
[3]
mailto:[email protected]
[4]
mailto:[email protected]
[5]
mailto:[email protected]
[6] mailto:[email protected]
[7]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer problem

Reply via email to