Hi Chris,

Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!

-m.

On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
> Hi Matthew,
> 
> >> Hi Matthew,
> >> 
> >> There is an open issue with Tika (e.g.
> >> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
> >> differences betwen parse-html and parse-tika. Note that you can specify :
> >> *parse-(html|pdf) *in order to get both HTML and PDF files.
> > 
> > The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> > rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> > PDFs, but has problems with some html. Nutch 1.1 includes more current
> > PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
> 
> Interesting: well one solution comes to mind. Can you test this out?
> 
> * uncomment the lines:
> 
>         <mimeType name="text/html">
>                 <plugin id="parse-html" />
>         </mimeType>
> 
> In conf/parse-plugins.xml.
> 
> * try your crawl again.
> 
> > 
> > I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> > with the attached file
> 
> Thanks! Let me know what happens after you uncomment the line above.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 

Reply via email to