Hi,
 
I have upgraded from NUTCH 9.0 to nutch-2007-09-30_04-01-28.tar.gz.
 
It seems the indexer is unable to update the field "TITLE" of the Lucene index 
when processing specific html documents.
 
 
Please find below a brief summay of this issue:
 
1.- Extracted this new version in a separate directory and copy across the 
following configuration files:
- {nutch_home_9.0}/bin/url folder, containing the urls
- {nutch_home_9.0}/conf/nutch-site.xml
- {nutch_home_9.0}/conf/crawl-urlfilter.txt
 
2.- To reproduce the issue, you would need to copy the attached html document 
to your webserver/filesytem.
 
3.- Run the crawl using the following command.
./nutch crawl urls -dir crawl -depth 22
 
4.- Open the index using Luke. 
 
5.- Select the "document" tab, move thru the docs until you find the above 
document.
You will see that the TITLE field is empty  --> INCORRECT because this html 
document contains a title.
 
6.- Now, open the html document, add a space anywhere then save it again.
 
7.- Repeat step 3 and 4.

You will notice that this time the field "TITLE" field contains the correct 
information
 
This problem does NOT occurs using NUTCH 9.0
 
Any input would be appreciated,
 
Serg


      ___________________________________________________________ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html

Reply via email to