Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you!
-m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > >> Hi Matthew, > >> > >> There is an open issue with Tika (e.g. > >> https://issues.apache.org/jira/browse/TIKA-379) that could explain the > >> differences betwen parse-html and parse-tika. Note that you can specify : > >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > > > The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > > rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > > PDFs, but has problems with some html. Nutch 1.1 includes more current > > PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > > Interesting: well one solution comes to mind. Can you test this out? > > * uncomment the lines: > > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > > In conf/parse-plugins.xml. > > * try your crawl again. > > > > > I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 > > with the attached file > > Thanks! Let me know what happens after you uncomment the line above. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.mattm...@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >