Re: nutch crawl issue

matthew a. grisius Sat, 01 May 2010 20:14:26 -0700

Hi Julien,

On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote:
> Hi Matthew,
> 
> There is an open issue with Tika (e.g.
> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
> differences betwen parse-html and parse-tika. Note that you can specify :
> *parse-(html|pdf) *in order to get both HTML and PDF files.


The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
PDFs, but has problems with some html. Nutch 1.1 includes more current
PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

> 
> Could you please open an issue in JIRA
> https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
> trying to process? I'll have a look and see if it is related to TIKA-379.

I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
with the attached file

Thanks.

-m.

> 
> Thanks
> 
> Julien

Re: nutch crawl issue

Reply via email to