Re: nutch crawl issue

matthew a. grisius Tue, 04 May 2010 21:08:39 -0700

Hi Chris,

It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
parsing it though). I expanded my testing to include two more complex
examples that heavily use HTML FRAMESET/FRAME and more complex
javascript:


134 mb, 11,269 files
1.9 gb, 133,978 files

They both fail at the top level with the similar errors such as:

fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
Error parsing:
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js:
 UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript
Attempting to finish item from unknown queue:
org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
fetch of
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
 failed with: java.lang.ArrayIndexOutOfBoundsException: -56
-finishing thread FetcherThread, activeThreads=2

I tried several property settings to mimic the previous work around and
could not solve it. Any suggestions?

So, I'm not sure how to categorize the issues more accurately. I have
many javadoc sets and lots of simple HTML that will now parse, but I
have other examples such as the two mentioned above that won't parse and
therefore can't be crawled. It seems to me to be systematic rather than
exceptional. I cannot believe that I'm the only one who will experience
these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
for asking.

-m.



On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
> Hi Matthew,
> 
> Awesome! Glad it worked. Now my next question < how often are you seeing
> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
> trying to process? Or just some of them? Or particular ones (categories of
> them). The reason I ask is that I¹m trying to determine whether I should
> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
> systematic thing versus an exception.
> 
> Let me know and thanks!
> 
> Cheers,
> Chris
> 
> 
> On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote:
> 
> > Hi Chris,
> > 
> > Yes, that worked. I caught up on email and noticed that Arpit also
> > mentioned the same thing. Sorry I missed it, thanks to both of you!
> > 
> > -m.
> > 
> > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
> >> Hi Matthew,
> >> 
> >>>> Hi Matthew,
> >>>> 
> >>>> There is an open issue with Tika (e.g.
> >>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
> >>>> differences betwen parse-html and parse-tika. Note that you can specify :
> >>>> *parse-(html|pdf) *in order to get both HTML and PDF files.
> >>> 
> >>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> >>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> >>> PDFs, but has problems with some html. Nutch 1.1 includes more current
> >>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
> >> 
> >> Interesting: well one solution comes to mind. Can you test this out?
> >> 
> >> * uncomment the lines:
> >> 
> >>         <mimeType name="text/html">
> >>                 <plugin id="parse-html" />
> >>         </mimeType>
> >> 
> >> In conf/parse-plugins.xml.
> >> 
> >> * try your crawl again.
> >> 
> >>> 
> >>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> >>> with the attached file
> >> 
> >> Thanks! Let me know what happens after you uncomment the line above.
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.mattm...@jpl.nasa.gov
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> 
> > 
> > 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
>

Re: nutch crawl issue

Reply via email to