Hi,
I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
It seems the indexer is unable to update the field "TITLE" of the Lucene index
when processing specific html documents.
Please find below a brief summay of this issue:
1.- Extracted this new version in a separate directory and copy across the
following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt
2.- To reproduce the issue, you would need to copy the attached html document
to your webserver/filesytem.
3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22
4.- Open the index using Luke.
5.- Select the "document" tab, move thru the docs until you find the above
document.
You will see that the TITLE field is empty --> INCORRECT because this html
document contains a title.
6.- Now, open the html document, add a space anywhere then save it again.
7.- Repeat step 3 and 4.
You will notice that this time the field "TITLE" field contains the correct
information
This problem does NOT occurs using NUTCH 9.0
Please advice,
Many thanks in advance for your support
Serg
___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/