Hi Matthew, Awesome! Glad it worked. Now my next question < how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I should commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a systematic thing versus an exception.
Let me know and thanks! Cheers, Chris On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: > Hi Chris, > > Yes, that worked. I caught up on email and noticed that Arpit also > mentioned the same thing. Sorry I missed it, thanks to both of you! > > -m. > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: >> Hi Matthew, >> >>>> Hi Matthew, >>>> >>>> There is an open issue with Tika (e.g. >>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the >>>> differences betwen parse-html and parse-tika. Note that you can specify : >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. >>> >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my >>> PDFs, but has problems with some html. Nutch 1.1 includes more current >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. >> >> Interesting: well one solution comes to mind. Can you test this out? >> >> * uncomment the lines: >> >> <mimeType name="text/html"> >> <plugin id="parse-html" /> >> </mimeType> >> >> In conf/parse-plugins.xml. >> >> * try your crawl again. >> >>> >>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 >>> with the attached file >> >> Thanks! Let me know what happens after you uncomment the line above. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: chris.mattm...@jpl.nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++