Hi,
 
I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
 
It seems the indexer is unable to update the field "TITLE" of the Lucene index 
when processing specific html documents.
 
 
Please find below a brief summay of this issue:
 
1.- Extracted this new version in a separate directory and copy across the 
following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt
 
2.- To reproduce the issue, you would need to copy the attached html document 
to your webserver/filesytem.
 
3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22
 
4.- Open the index using Luke. 
 
5.- Select the "document" tab, move thru the docs until you find the above 
document.
You will see that the TITLE field is empty  --> INCORRECT because this html 
document contains a title.
 
6.- Now, open the html document, add a space anywhere then save it again.
 
7.- Repeat step 3 and 4.

You will notice that this time the field "TITLE" field contains the correct 
information
 
This problem does NOT occurs using NUTCH 9.0
 
Please advice,
 
Many thanks in advance for your support
 
Serg


      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

Reply via email to