Hi Stefan, that seems to have worked. And I tried out that my patch to the PDF-parser actually prevented "unclean" IO-exceptions (see http://issues.apache.org/jira/browse/NUTCH-290 ).
The strange thing, however, is that I still see "garbage" (undecoded binary data from the PDF-file) in search-summaries. Could it be that possibly since my plugin returns empty content (and by that preventing an exception) some other place in the source still thinks "no summary? I'll grab the raw content instead then"? My problem is that for unparseable files I get binary data in the summaries. The special case in my eyes are PDF-files, where the patch now prevents an exception which leads to a "parse failed". Now the parse is fine, but I still get binary summaries :-( Could you maybe have a look at the issue? There is a test-PDF mentioned as well. And I can offer more :-) Regards, Stefan Stefan Groschupf wrote: > You can just delete the parse output folders and start the parsing tool. > Parsing a given page again makes only sense for debug reasons since > hadoop io system can not update entries. > If you need to debug I suggest to write you a junit test. > > HTH > Stefan > > > Am 29.05.2006 um 01:01 schrieb Stefan Neufeind: > >> Hi, >> >> was is needed to re-parse documents that were already fetched into a >> segment? Is another "nutch index ..."-run sufficient, or how could I >> send the documents through the parse-plugins again? >> >> >> Regards, >> Stefan ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
