Hi Matthew, As you can see from the error messages Tika does not know how to parse javascript. There is a legacy javascript parser in Nutch which you can activate in the usual way i.e. specify parse-js in plugin.includes. It generates a lot of spurious URLs but you should give it a try and see if it gives you the outlinks you expect. I think there have been quite a few discussions about javascript processing in the nutch archives.
BTW a good practice is to separate the fetching from the parsing step, so that if the parsing fails you won't need to refetch the URLs. That can be done of you call the fetch and parse commands (and not the all-in-one crawl command) and specify -noparse while fetching. HTH Julien It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES > and/or javascript. Using the parse-html suggested work around I am able > to process my simple test cases such as javadoc which does include > simple embedded javascript (of course I can't verify that it is actually > parsing it though). I expanded my testing to include two more complex > examples that heavily use HTML FRAMESET/FRAME and more complex > javascript: > > 134 mb, 11,269 files > 1.9 gb, 133,978 files > > They both fail at the top level with the similar errors such as: > > fetching > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js > fetching > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > Error parsing: > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js: > UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript > Attempting to finish item from unknown queue: > org.apache.nutch.fetcher.fetcher$fetchi...@1532fc > fetch of > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.jsfailed > with: java.lang.ArrayIndexOutOfBoundsException: -56 > -finishing thread FetcherThread, activeThreads=2 > > I tried several property settings to mimic the previous work around and > could not solve it. Any suggestions? > > So, I'm not sure how to categorize the issues more accurately. I have > many javadoc sets and lots of simple HTML that will now parse, but I > have other examples such as the two mentioned above that won't parse and > therefore can't be crawled. It seems to me to be systematic rather than > exceptional. I cannot believe that I'm the only one who will experience > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > for asking. > > -m. > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: > > Hi Matthew, > > > > Awesome! Glad it worked. Now my next question < how often are you seeing > > that parse-tika doesn¹t work on HTML files? Is it all HTML that you are > > trying to process? Or just some of them? Or particular ones (categories > of > > them). The reason I ask is that I¹m trying to determine whether I should > > commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s > a > > systematic thing versus an exception. > > > > Let me know and thanks! > > > > Cheers, > > Chris > > > > > > On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: > > > > > Hi Chris, > > > > > > Yes, that worked. I caught up on email and noticed that Arpit also > > > mentioned the same thing. Sorry I missed it, thanks to both of you! > > > > > > -m. > > > > > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > > >> Hi Matthew, > > >> > > >>>> Hi Matthew, > > >>>> > > >>>> There is an open issue with Tika (e.g. > > >>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain > the > > >>>> differences betwen parse-html and parse-tika. Note that you can > specify : > > >>>> *parse-(html|pdf) *in order to get both HTML and PDF files. > > >>> > > >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > > >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > > >>> PDFs, but has problems with some html. Nutch 1.1 includes more > current > > >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > > >> > > >> Interesting: well one solution comes to mind. Can you test this out? > > >> > > >> * uncomment the lines: > > >> > > >> <mimeType name="text/html"> > > >> <plugin id="parse-html" /> > > >> </mimeType> > > >> > > >> In conf/parse-plugins.xml. > > >> > > >> * try your crawl again. > > >> > > >>> > > >>> I submitted NUTCH-817 > https://issues.apache.org/jira/browse/NUTCH-817 > > >>> with the attached file > > >> > > >> Thanks! Let me know what happens after you uncomment the line above. > > >> > > >> Cheers, > > >> Chris > > >> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Chris Mattmann, Ph.D. > > >> Senior Computer Scientist > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > >> Office: 171-266B, Mailstop: 171-246 > > >> Email: chris.mattm...@jpl.nasa.gov > > >> WWW: > > >> http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> Adjunct Assistant Professor, Computer Science Department > > >> University of Southern California, Los Angeles, CA 90089 USA > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >> > > >> > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Senior Computer Scientist > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 171-266B, Mailstop: 171-246 > > Email: chris.mattm...@jpl.nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Assistant Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com