Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later.
One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can you apply TIKA-379 to a local checkout of tika trunk (I’ll show you how) and then let me know if that fixes parse-tika for you? Here are the steps: svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika cd tika wget "http://bit.ly/bXeLkf" (if you don't have SSL support, then manually download the linked file) patch -p0 < TIKA-379-3.patch mvn install package Then grab tika-parsers and tika-core out of the respective tika-core/target and tika-parsers/target directories and drop those jars in your parse-tika/lib folder, replacing their originals. Then, try your nutch crawl again. See if that works. In the meanwhile, I'll inspect Julien's patch. Thanks! Cheers, Chris On 5/4/10 9:02 PM, "matthew a. grisius" <mgris...@comcast.net> wrote: > Hi Chris, > > It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES > and/or javascript. Using the parse-html suggested work around I am able > to process my simple test cases such as javadoc which does include > simple embedded javascript (of course I can't verify that it is actually > parsing it though). I expanded my testing to include two more complex > examples that heavily use HTML FRAMESET/FRAME and more complex > javascript: > > 134 mb, 11,269 files > 1.9 gb, 133,978 files > > They both fail at the top level with the similar errors such as: > > fetching > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > ocCommon.js > fetching > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo > cBanner.htm > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > Error parsing: > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type > text/javascript > Attempting to finish item from unknown queue: > org.apache.nutch.fetcher.fetcher$fetchi...@1532fc > fetch of > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 > -finishing thread FetcherThread, activeThreads=2 > > I tried several property settings to mimic the previous work around and > could not solve it. Any suggestions? > > So, I'm not sure how to categorize the issues more accurately. I have > many javadoc sets and lots of simple HTML that will now parse, but I > have other examples such as the two mentioned above that won't parse and > therefore can't be crawled. It seems to me to be systematic rather than > exceptional. I cannot believe that I'm the only one who will experience > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > for asking. > > -m. > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: >> Hi Matthew, >> >> Awesome! Glad it worked. Now my next question < how often are you seeing >> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are >> trying to process? Or just some of them? Or particular ones (categories of >> them). The reason I ask is that I¹m trying to determine whether I should >> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a >> systematic thing versus an exception. >> >> Let me know and thanks! >> >> Cheers, >> Chris >> >> >> On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: >> >>> Hi Chris, >>> >>> Yes, that worked. I caught up on email and noticed that Arpit also >>> mentioned the same thing. Sorry I missed it, thanks to both of you! >>> >>> -m. >>> >>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: >>>> Hi Matthew, >>>> >>>>>> Hi Matthew, >>>>>> >>>>>> There is an open issue with Tika (e.g. >>>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the >>>>>> differences betwen parse-html and parse-tika. Note that you can specify : >>>>>> *parse-(html|pdf) *in order to get both HTML and PDF files. >>>>> >>>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 >>>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my >>>>> PDFs, but has problems with some html. Nutch 1.1 includes more current >>>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. >>>> >>>> Interesting: well one solution comes to mind. Can you test this out? >>>> >>>> * uncomment the lines: >>>> >>>> <mimeType name="text/html"> >>>> <plugin id="parse-html" /> >>>> </mimeType> >>>> >>>> In conf/parse-plugins.xml. >>>> >>>> * try your crawl again. >>>> >>>>> >>>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 >>>>> with the attached file >>>> >>>> Thanks! Let me know what happens after you uncomment the line above. >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Senior Computer Scientist >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 171-266B, Mailstop: 171-246 >>>> Email: chris.mattm...@jpl.nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Assistant Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>> >>> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: chris.mattm...@jpl.nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++