[ https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-568. ------------------------------- Resolution: Won't Fix > Indexer does not update the Lucene "TITLE" field > ------------------------------------------------ > > Key: NUTCH-568 > URL: https://issues.apache.org/jira/browse/NUTCH-568 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.0.0 > Environment: Windows XP > Reporter: smorales > Attachments: RN-071018-000024.html > > > Hi, > The indexer is unable to update the field "TITLE" of the Lucene index when > processing specific html documents. > This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 > 4:01:28 AM) > The problem does not occurs using NUTCH 9.0. > Workflow: > 1.- Extracted package and copy across the following configuration files from > NUTCH 9.0 > - {nutch_home_9.0}/bin/url folder, containing the urls > - {nutch_home_9.0}/conf/nutch-site.xml > - {nutch_home_9.0}/conf/crawl-urlfilter.txt > 2.- To reproduce the issue, you need to copy the attached html document to > your webserver/filesytem. > 3.- Run the crawl. > For example: ./nutch crawl urls -dir crawl -depth 22 > 4.- Open the index using Luke. For this test, I used lukeall-0.7.1.jar > 5.- Select the window select the "document" tab, move thru the docs until you > find our html document. > You will see that the TITLE field is empty --> INCORRECT because this html > document contains a title. > 6.- Now, open the html document, add a space anywhere then save it again. > 7.- Repeat step 3 and 4. > You will notice that this time the field "TITLE" field contains the correct > information > Please advice, > Many thanks in advance for your support. > Sergio -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira