Hi Marc, I'm not an expert by looong shot, so take this as a set of pointers as to where to look, rather than any kind of gospel. However, when I have wanted to make similar changes, the following has worked for me.
You might take a look at the javacc file NutchAnalysis.jj This file contains a backus-naur style grammar that nnutch uses to construct charstream.java, as well as a couple other files. I think it's where the tokenzation process really begins. The problem with this, however, is that you might lose the distinction between your accented characters and their unaccented counterparts - which you probably don't want to do. The other thing to look at would probably be BasicQueryFilter.java, in NUTCH_HOME/src/plugin/query-basic/.../ Unless you've modified it, I think this is where your query is getting parsed. The problem with this, however, is that it relies on lucene methods like: org.apache.lucene.search.BooleanQuery for most of the real work. THus, if you want to change anything you will need to either rewrite them from scratch yourself, or download the lucene source code, modify the proper files, recompile and replace the current lucene related .jar files that reside in your NUTCH_HOME/lib/ folder. hope this helpful, joe On 11/11/06, Marc DELERUE <[EMAIL PROTECTED]> wrote:
Hello, In Nutch 0.7, I'd like to obtain results with accentued characters and non-accentued characters with the same query. egg : I want to make nutch displaying "café" and "cafe" when I type "cafe". I didn't find how to do so I would be very glad if someone could help me or just send me information about it. Thank you very much Kind Regards Marc
