Hi,
I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
It seems the indexer is unable to update the field "TITLE" of the Lucene index
when processing specific html documents.
Please find below a brief summay of this issue:
1.- Extracted this new version in a separate directory and copy across the
following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt
2.- To reproduce the issue, you would need to copy the attached html document
to your webserver/filesytem.
3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22
4.- Open the index using Luke.
5.- Select the "document" tab, move thru the docs until you find the above
document.
You will see that the TITLE field is empty --> INCORRECT because this html
document contains a title.
6.- Now, open the html document, add a space anywhere then save it again.
7.- Repeat step 3 and 4.
You will notice that this time the field "TITLE" field contains the correct
information
This problem does NOT occurs using NUTCH 9.0
Any input would be appreciated,
Serg
___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good
http://uk.promotions.yahoo.com/forgood/environment.html