[ 
https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-568.
-------------------------------

    Resolution: Won't Fix

> Indexer does not update the Lucene "TITLE" field
> ------------------------------------------------
>
>                 Key: NUTCH-568
>                 URL: https://issues.apache.org/jira/browse/NUTCH-568
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>         Environment: Windows XP
>            Reporter: smorales
>         Attachments: RN-071018-000024.html
>
>
> Hi,
> The indexer is unable to update the field "TITLE" of the Lucene index when 
> processing specific html documents.
> This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 
> 4:01:28 AM)
> The problem does not occurs using NUTCH 9.0.
> Workflow:
> 1.- Extracted package and copy across the following configuration files from 
> NUTCH 9.0
> - {nutch_home_9.0}/bin/url folder, containing the urls
> - {nutch_home_9.0}/conf/nutch-site.xml
> - {nutch_home_9.0}/conf/crawl-urlfilter.txt
> 2.- To reproduce the issue, you need to copy the attached html document to 
> your webserver/filesytem.
> 3.- Run the crawl.
> For example: ./nutch crawl urls -dir crawl -depth 22
> 4.- Open the index using Luke.  For this test, I used lukeall-0.7.1.jar
> 5.- Select the window select the "document" tab, move thru the docs until you 
> find our html document.
> You will see that the TITLE field is empty  --> INCORRECT because this html 
> document contains a title.
> 6.- Now, open the html document, add a space anywhere then save it again.
> 7.- Repeat step 3 and 4.
> You will notice that this time the field "TITLE" field contains the correct 
> information
> Please advice,
> Many thanks in advance for your support.
> Sergio

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to