Hi Stefan,

that seems to have worked. And I tried out that my patch to the
PDF-parser actually prevented "unclean" IO-exceptions (see
http://issues.apache.org/jira/browse/NUTCH-290   ).

The strange thing, however, is that I still see "garbage" (undecoded
binary data from the PDF-file) in search-summaries. Could it be that
possibly since my plugin returns empty content (and by that preventing
an exception) some other place in the source still thinks "no summary?
I'll grab the raw content instead then"?

My problem is that for unparseable files I get binary data in the
summaries. The special case in my eyes are PDF-files, where the patch
now prevents an exception which leads to a "parse failed". Now the parse
is fine, but I still get binary summaries :-(


Could you maybe have a look at the issue? There is a test-PDF mentioned
as well. And I can offer more :-)


Regards,
 Stefan

Stefan Groschupf wrote:
> You can just delete the parse output folders and start the parsing tool.
> Parsing a given page again makes only sense for debug reasons since
> hadoop io system can not update entries.
> If you need to debug I suggest to write you a junit test.
> 
> HTH
> Stefan
> 
> 
> Am 29.05.2006 um 01:01 schrieb Stefan Neufeind:
> 
>> Hi,
>>
>> was is needed to re-parse documents that were already fetched into a
>> segment? Is another "nutch index ..."-run sufficient, or how could I
>> send the documents through the parse-plugins again?
>>
>>
>> Regards,
>>  Stefan


-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to