Hi Matthew, >> Hi Matthew, >> >> There is an open issue with Tika (e.g. >> https://issues.apache.org/jira/browse/TIKA-379) that could explain the >> differences betwen parse-html and parse-tika. Note that you can specify : >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > PDFs, but has problems with some html. Nutch 1.1 includes more current > PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
Interesting: well one solution comes to mind. Can you test this out? * uncomment the lines: <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> In conf/parse-plugins.xml. * try your crawl again. > > I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 > with the attached file Thanks! Let me know what happens after you uncomment the line above. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++