Re: nutch crawl issue

Mattmann, Chris A (388J) Sat, 01 May 2010 21:06:58 -0700

Hi Matthew,

>> Hi Matthew,
>> 
>> There is an open issue with Tika (e.g.
>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>> differences betwen parse-html and parse-tika. Note that you can specify :
>> *parse-(html|pdf) *in order to get both HTML and PDF files.
> 
> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> PDFs, but has problems with some html. Nutch 1.1 includes more current
> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.


Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:

        <mimeType name="text/html">
                <plugin id="parse-html" />
        </mimeType>

In conf/parse-plugins.xml.

* try your crawl again.

> 
> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> with the attached file

Thanks! Let me know what happens after you uncomment the line above.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: nutch crawl issue

Reply via email to