Jérôme Charron wrote:
Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.

As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...


I think most people agree that it would be worthwhile to un-tie this component from Nutch internals. The only objections were related not to the idea itself, but to the management aspects of creating a full-blown sub-project, both wrt. to the initial setup and the continuing maintenance. An alternative solution was proposed (creating a contrib/ package). This would still help to separate the code from Nutch internals, so that it can be used in other projects, but it would require much less effort to set up and maintain.

Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it

+1. Other local modifications which I use frequently:

* exporting a list of supported languages,

* exporting an NGramProfile of the analyzed text,

* allow processing of chunks of input (i.e. LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is very useful if the text to be analyzed is already present in memory, and the choice of sections (chunks) is made elsewhere, e.g. for documents with clearly outlined sections, or for multi-language documents.

* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to