Re: Accentued characters in result

Josef Novak Fri, 10 Nov 2006 19:01:26 -0800

Hi Marc,

I'm not an expert by looong shot, so take this as a set of pointers as to
where to look, rather than any kind of gospel.  However, when I have wanted
to make similar changes, the following has worked for me.


You might take a look at the javacc file
NutchAnalysis.jj

This file contains a backus-naur style grammar that nnutch uses to construct
charstream.java, as well as a couple other files.  I think it's where the
tokenzation process really begins.  The problem with this, however, is that
you might lose the distinction between your accented characters and their
unaccented counterparts - which you probably don't want to do.

The other thing to look at would probably be
BasicQueryFilter.java, in NUTCH_HOME/src/plugin/query-basic/.../

Unless you've modified it, I think this is where your query is getting
parsed.  The problem with this, however, is that it relies on lucene methods
like:
org.apache.lucene.search.BooleanQuery

for most of the real work.  THus, if you want to change anything you will
need to either rewrite them from scratch yourself, or download the lucene
source code, modify the proper files, recompile and replace the current
lucene related .jar files that reside in your NUTCH_HOME/lib/ folder.

hope this helpful,
joe

On 11/11/06, Marc DELERUE <[EMAIL PROTECTED]> wrote:


Hello,

In Nutch 0.7, I'd like to obtain results with accentued characters and
non-accentued characters with the same query.
egg : I want to make nutch displaying "café" and "cafe" when I type
"cafe".

I didn't find how to do so I would be very glad if someone could help me
or just send me information about it.

Thank you very much

Kind Regards

Marc

Re: Accentued characters in result

Reply via email to