Hi Matthew,

>> Hi Matthew,
>> There is an open issue with Tika (e.g.
>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>> differences betwen parse-html and parse-tika. Note that you can specify :
>> *parse-(html|pdf) *in order to get both HTML and PDF files.
> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> PDFs, but has problems with some html. Nutch 1.1 includes more current
> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:

        <mimeType name="text/html">
                <plugin id="parse-html" />

In conf/parse-plugins.xml.

* try your crawl again.

> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> with the attached file

Thanks! Let me know what happens after you uncomment the line above.


Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

Reply via email to