Hello,
I'm proud to present you the project I will work on this summer (at least). It's seems that the OOo community wants to have a new way to guess the language of texts (not only words and sentences but also longer texts).
Practically, it will allow writer to not have to select the text's language in contextual menus (in most cases). Of course, one of the objectives is to process the text with CPU and memory efficiency.
Technically, there are some ways to guess the language of a text but the preferred method will be the N-Gram one (statistical approach). Thomas Lange suggest me to look after textcat (http://software.wise-guys.nl/libtextcat/ ) which is a lib that implements this algo (under BSD license). Maybe it could be a good base to our work and why not, over license problems, the core of the component.
So, first, this project
have been proposed to the Google summer of code, but I have not been
selected. Fortunately Intel sponsors it now. Dhananjay V Keskar is
the mentor.
About this, I want to
thank Stefan Taxhet, Thomas Lange and Dhananjay for choosing and
trusting me. I'll try to show that this choice is the good one ;-)
This project will be supported until September and is already started yet. Along this time I will produce some deliverables including the UNO components and documentations.
My Bio :
I'm a French undergraduate
student in the "Ecole Polytechnique de l'Univesite de Nantes". My
major is in Data-Mining and Business-intelligence.
If you want to see my CV :
http://jocme.club.fr/cv-en.xhtml
Feel free to ask me if you have any question about the project progress, technical things, etc.
Regards
(PS: I'm French and not so fluent in English, so you should think today that the mistakes I do are not a problem but in August, you'll may be quite tired of them. So tell me now if I make so horrible sentences)