Re: [Nutch-dev] Multi-Lingual support

nutdev2001 Fri, 10 Jun 2005 19:57:26 -0700

Hi Jerome,
 This is an interesting proposal. One thought that comes to mind is that
analysing which involves stemming or stop wording does represent a loss
of information.

If you store only the language specific stemmed form of the document,
you might not be able to search for specific word forms or you might get
false positives for an exact quoted search that contains stop words;
neither can you do a language neutral search.  

There might be two ways you could tackle this, one could be to store
both a version of the document analysed in a language specific way and a
version analysed with standard analyser; but this would make the index
a lot bigger.

Another alternative might be to make no change to the index, storing
just the standardanalyzed document in the index, and doing some kind of
query expansion at query time, perhaps matching a language specific
analysed version of the query term(s) against a list of terms from the
index which stem to the same root; and then querying using those terms
rather than the original query. I am not sure if I am totally clear...

Just my two cents....

On Fri, 2005-06-10 at 17:02 +0200, J�r�me Charron wrote:
> I was thinking about it for a while: Multi-Lingual support in Nutch.
> After looking at Nutch code, I write a proposal on the Wiki (
> http://wiki.apache.org/nutch/MultiLingualSupport).
> Since I'm not yet an expert of the Nutch core (hope to become one), this 
> mail is a kind of request for comments about the proposal.
> 
> Thanks,
> 
> Jerome

-- 
nutdev2001 <[EMAIL PROTECTED]>

-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r 
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Multi-Lingual support

Reply via email to