[lingu-dev] Component for guessing the language of a text

Jocelyn Merand Fri, 02 Jun 2006 10:50:37 -0700

Hello,

I'm proud to present you the project I will work on this summer (at least). It's seems that the OOo community wants to have a new way to guess the language of texts (not only words and sentences but also longer texts).

Practically, it will allow writer to not have to select the text's language in contextual menus (in most cases). Of course, one of the objectives is to process the text with CPU and memory efficiency.

Technically, there are some ways to guess the language of a text but the preferred method will be the N-Gram one (statistical approach). Thomas Lange suggest me to look after textcat (http://software.wise-guys.nl/libtextcat/ ) which is a lib that implements this algo (under BSD license). Maybe it could be a good base to our work and why not, over license problems, the core of the component.

So, first, this project have been proposed to the Google summer of code, but I have not been selected. Fortunately Intel sponsors it now. Dhananjay V Keskar is the mentor.
About this, I want to thank Stefan Taxhet, Thomas Lange and Dhananjay for choosing and trusting me. I'll try to show that this choice is the good one ;-)

This project will be supported until September and is already started yet. Along this time I will produce some deliverables including the UNO components and documentations.

My Bio :
I'm a French undergraduate student in the "Ecole Polytechnique de l'Univesite de Nantes". My major is in Data-Mining and Business-intelligence.
If you want to see my CV : http://jocme.club.fr/cv-en.xhtml

Feel free to ask me if you have any question about the project progress, technical things, etc.

Regards

(PS: I'm French and not so fluent in English, so you should think today that the mistakes I do are not a problem but in August, you'll may be quite tired of them. So tell me now if I make so horrible sentences)

[lingu-dev] Component for guessing the language of a text

Reply via email to