[
https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
smorales updated NUTCH-568:
---------------------------
Attachment: RN-071018-000024.html
> Indexer does not update the Lucene "TITLE" field
> ------------------------------------------------
>
> Key: NUTCH-568
> URL: https://issues.apache.org/jira/browse/NUTCH-568
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.0.0
> Environment: Windows XP
> Reporter: smorales
> Attachments: RN-071018-000024.html
>
>
> Hi,
> The indexer is unable to update the field "TITLE" of the Lucene index when
> processing specific html documents.
> This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007
> 4:01:28 AM)
> The problem does not occurs using NUTCH 9.0.
> Workflow:
> 1.- Extracted package and copy across the following configuration files from
> NUTCH 9.0
> - {nutch_home_9.0}/bin/url folder, containing the urls
> - {nutch_home_9.0}/conf/nutch-site.xml
> - {nutch_home_9.0}/conf/crawl-urlfilter.txt
> 2.- To reproduce the issue, you need to copy the attached html document to
> your webserver/filesytem.
> 3.- Run the crawl.
> For example: ./nutch crawl urls -dir crawl -depth 22
> 4.- Open the index using Luke. For this test, I used lukeall-0.7.1.jar
> 5.- Select the window select the "document" tab, move thru the docs until you
> find our html document.
> You will see that the TITLE field is empty --> INCORRECT because this html
> document contains a title.
> 6.- Now, open the html document, add a space anywhere then save it again.
> 7.- Repeat step 3 and 4.
> You will notice that this time the field "TITLE" field contains the correct
> information
> Please advice,
> Many thanks in advance for your support.
> Sergio
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.