Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually parsing it though). I expanded my testing to include two more complex examples that heavily use HTML FRAMESET/FRAME and more complex javascript:
134 mb, 11,269 files 1.9 gb, 133,978 files They both fail at the top level with the similar errors such as: fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js fetching http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm -finishing thread FetcherThread, activeThreads=8 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 Error parsing: http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript Attempting to finish item from unknown queue: org.apache.nutch.fetcher.fetcher$fetchi...@1532fc fetch of http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 -finishing thread FetcherThread, activeThreads=2 I tried several property settings to mimic the previous work around and could not solve it. Any suggestions? So, I'm not sure how to categorize the issues more accurately. I have many javadoc sets and lots of simple HTML that will now parse, but I have other examples such as the two mentioned above that won't parse and therefore can't be crawled. It seems to me to be systematic rather than exceptional. I cannot believe that I'm the only one who will experience these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks for asking. -m. On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > Awesome! Glad it worked. Now my next question < how often are you seeing > that parse-tika doesn¹t work on HTML files? Is it all HTML that you are > trying to process? Or just some of them? Or particular ones (categories of > them). The reason I ask is that I¹m trying to determine whether I should > commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a > systematic thing versus an exception. > > Let me know and thanks! > > Cheers, > Chris > > > On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: > > > Hi Chris, > > > > Yes, that worked. I caught up on email and noticed that Arpit also > > mentioned the same thing. Sorry I missed it, thanks to both of you! > > > > -m. > > > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > >> Hi Matthew, > >> > >>>> Hi Matthew, > >>>> > >>>> There is an open issue with Tika (e.g. > >>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the > >>>> differences betwen parse-html and parse-tika. Note that you can specify : > >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. > >>> > >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > >>> PDFs, but has problems with some html. Nutch 1.1 includes more current > >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > >> > >> Interesting: well one solution comes to mind. Can you test this out? > >> > >> * uncomment the lines: > >> > >> <mimeType name="text/html"> > >> <plugin id="parse-html" /> > >> </mimeType> > >> > >> In conf/parse-plugins.xml. > >> > >> * try your crawl again. > >> > >>> > >>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 > >>> with the attached file > >> > >> Thanks! Let me know what happens after you uncomment the line above. > >> > >> Cheers, > >> Chris > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: chris.mattm...@jpl.nasa.gov > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.mattm...@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >