Lyndon Maydwell wrote:
Ah, yes, of course. I was a bit hasty with my question.

I was really referring to the results returned from the Nutch web-application.

I'm also getting a lot of requests to change some of the configuration
options relating to addresses Nutch considers equivalent. Is it

Such as?

possible to alter the configuration files in the web-application and
have these changes reflected in the results returned? Or are these
options only used on crawling/indexing etc? If so, can I regenerate
the database somehow to have new configuration options recognized?

There are two subsystems in Nutch that handle this: one is URLFilters (which basically say yes/no to urls, so that you can remove unwanted urls) and URLNormalizers, which bring urls to their "canonical" format, whatever that means in your case. The default url normalizer in Nutch simply resolves relative paths and removes some session id junk (see conf/regex-normalize.xml).

Once you tweak these two to match your expectations, you can regenerate crawldb by updating it once again from already fetched segments.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to