Hi Jerome, This is an interesting proposal. One thought that comes to mind is that analysing which involves stemming or stop wording does represent a loss of information.
If you store only the language specific stemmed form of the document, you might not be able to search for specific word forms or you might get false positives for an exact quoted search that contains stop words; neither can you do a language neutral search. There might be two ways you could tackle this, one could be to store both a version of the document analysed in a language specific way and a version analysed with standard analyser; but this would make the index a lot bigger. Another alternative might be to make no change to the index, storing just the standardanalyzed document in the index, and doing some kind of query expansion at query time, perhaps matching a language specific analysed version of the query term(s) against a list of terms from the index which stem to the same root; and then querying using those terms rather than the original query. I am not sure if I am totally clear... Just my two cents.... On Fri, 2005-06-10 at 17:02 +0200, J�r�me Charron wrote: > I was thinking about it for a while: Multi-Lingual support in Nutch. > After looking at Nutch code, I write a proposal on the Wiki ( > http://wiki.apache.org/nutch/MultiLingualSupport). > Since I'm not yet an expert of the Nutch core (hope to become one), this > mail is a kind of request for comments about the proposal. > > Thanks, > > Jerome -- nutdev2001 <[EMAIL PROTECTED]> ------------------------------------------------------- This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput a projector? How fast can you ride your desk chair down the office luge track? If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
