Hi Julien, On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote: > Hi Matthew, > > There is an open issue with Tika (e.g. > https://issues.apache.org/jira/browse/TIKA-379) that could explain the > differences betwen parse-html and parse-tika. Note that you can specify : > *parse-(html|pdf) *in order to get both HTML and PDF files.
The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my PDFs, but has problems with some html. Nutch 1.1 includes more current PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > > Could you please open an issue in JIRA > https://issues.apache.org/jira/browse/NUTCH) and attach the file you are > trying to process? I'll have a look and see if it is related to TIKA-379. I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 with the attached file Thanks. -m. > > Thanks > > Julien