Re: [lingu-dev][lang guesser]Next steps

Thomas Lange Thu, 21 Sep 2006 06:46:44 -0700

Hi Jocelyn and all,

>    Results for mixed languages texts are quite bad (general conclusion)


That is if that text is passed on to the language guessing component
in one chunk and a single fingerprint for that text is calculated.
I was hoping that the combined fingerprint would be 'close' to the
two actual languages being used.
But that turned out to be a too simple approach.

> All these points make me doubtful about the interest of using libtextcat for
> next version of the component. Because it's coded in C and this code not
> seems to be designed for reusing and for easy modifying.

Basically I see no essential problem with libTextCat here. The code and
the fingerprint data do work. Of course though the original code was not
intended to run with Unicode strings...

> In addition, if we want to guess the language from texts witch are composed
> of some different languages, we have to find typical text parts like quoted
> or bracketed words sequences. I expect that there is a UNO component to do
> that, isn't it? 

That would be the breakiterator. It can be used to identify
word-boundaries and start and end of sentences. I do not specifically
know how well it works with quoted or bracketed text though.
After all since it is used for cursor traveling and to identify words
for spell checking it is required to be more fast than accurate.
What I mean to say by this is that I think in order to properly identify
sentence boundaries one would already need a grammar checker or sth.
of similar level of complexity. And that would be way to much overhead
for the purposes the breakiterator is used for.


>I chat with Thomas LEBARBE – a Researcher at the Grenoble
> University (France) – during the OooCon and he suggested me to use something
> he called "virgulo" witch is a kind of grammatical separator. I also
> thought, at the beginning of the summer, when I was searching a good way to
> guess multi-languages texts, that language changes are often on beginning or
> end of grammatical blocks. So this should be a possible way to improve the
> efficiency of multi-guesses (to analyze block by block).

Remember to make it rather quick (maybe a regular-expression grammar
checker only?) because language guessing is likely to be used with the
actual grammar checking, e.g. by guessing the primary sentence of a
text. If it is to slow it will have a severe impact on the usefulness of
grammar checking.


> It sounds that a complete refactoring should be needed if we want to
> implement new functionalities and if we want to have real multi-guess
> features. 

We probably first need to identify what type of functionality we
actually do require. As already told the first client in mind
for such extended functionality would likely be grammar checking.

>From my current point of view it would at the very least require
to guess the primary language of a text as accurate as possible.
Secondary tasks would be:
  - to guess all involved languages
and maybe
  - to identify the boundaries between those languages
    (or in other words: identify the language for each word)


I propose to develop a complete C++ library, of course not from
> scratch, but I will be inspired by libtextcat especially for the fingerprint
> comparison witch have been implemented in libtextcat in a very efficient
> way. Unfortunately, this algorithm is ad-hoc and I think I will have to
> really look at it. So we will have, for example, a component called
> "XFingerprintMaker" that also would be very useful for other linguistic
> usages.

Would be nice.

> Maybe it's not really interesting to send everybody the present version of
> the component because I think it will be modified.

> Every things that I said here are not the priority. Of course, these are
> next steps. Thomas, please, can you send me the last component snapshot in
> case of modification on your side. I will restart from this step.

Nothing is done yet.
To be more precise unfortunately it still not yet decided in which form
it should be integrated (as uno package or as library with data files
like most of the other code).
I need to inquire about this again .


Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev][lang guesser]Next steps

Reply via email to