Re: [lingu-dev] About the process of language guessing

Thomas Lange - Sun Germany - ham02 - Hamburg Wed, 02 Jul 2008 03:40:36 -0700

Hi,

Guillaume Guerra wrote:


> I'm interested in the language guesser of your project. I took a look to the 
> sources of SimpleGuesser.
> I'm
> actually looking for an open source language guesser, which could be
> able to detect that a document is multilingual (a first part could be
> in english, another one in french etc). I also need it to tell me the
> boundaries of these parts.
> 
> According to the way your guesser is
> used in oOO, I guess it's able to do so. Still, I don't understand how
> it's possible, as SimpleGuesser only detects one language for the whole
> text it's used for.
> 
> Is it called by oOO only for the current sentence / paragraph ?
> In fact I'd like to have a piece of help on how can we use this component in 
> a multilingual context.

In the UI is is currently called for the paragraphs text since the
algortihm fails if the text is too short. It should at least have some
words. About 30-50 seem to be enough usually.

In older versions of OOo langauge guessing for words was implemented by
checking the word with all available spell checkers. If one the first
one that did not complain about it defined the 'guessed language' for
that word.


The function to use in the UNO API is guessPrimaryLanguage. I think
getAvailableLanguages still does not have a really usable
implementation. And both function should work with the sub-string
defined by the function paramters. Thus the size of the text to be
guessed is up to you.
E.g a whole paragraph or a word. The latter does not work well with the
current implemented algorithm though.
A sensible text part would of cause be a whole sentence. After all a
sentence should usually use just one language. Thus you may use the
XBreakIterator interface to get sentence start and end positions and
then you can pass that very sentence to the language guesser.

Also with the current algorithm implemented you must NEVER pass a
sub-string of different language to the guesser! It will just result in
getting some rather random guesses that usually are of no use at all.
Thus sticking to sentence boundaries is recommended because of this as well.

And if you are a developer you could even rather easily improve the
current algorithm and thus have many more language guessed. It should
not be to complicated but may take some time.
You could also add the possibility to guess the language for just a
single character for many languages.


Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] About the process of language guessing

Reply via email to