Re: nutch crawl issue

matthew a. grisius Wed, 05 May 2010 22:02:07 -0700

Hi Chris,

The 'maven install package' produced this for each
target/maven-shared-archive-resources/... file.


...
[INFO] [bundle:bundle {execution: default-bundle}]
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/NOTICE~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/LICENSE~
[ERROR] Error(s) found in bundle configuration
[INFO]
------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO]
------------------------------------------------------------------------
[INFO] Error(s) found in bundle configuration

[INFO]
------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 1 minute 24 seconds
[INFO] Finished at: Wed May 05 23:38:56 EDT 2010
[INFO] Final Memory: 40M/271M
[INFO]
------------------------------------------------------------------------

Assuming this was the right thing to do, I renamed each file to match
the missing filename, e.g. rename "DEPENDENCIES" to "DEPENDENCIES~" (and
NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars.
and produce this:

[INFO]
------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
------------------------------------------------------------------------
[INFO] Apache Tika parent .................................... SUCCESS
[2.261s]
[INFO] Apache Tika core ...................................... SUCCESS
[14.429s]
[INFO] Apache Tika parsers ................................... SUCCESS
[32.370s]
[INFO] Apache Tika application ............................... SUCCESS
[34.179s]
[INFO] Apache Tika OSGi bundle ............................... SUCCESS
[16.081s]
[INFO] Apache Tika ........................................... SUCCESS
[0.237s]
[INFO]
------------------------------------------------------------------------
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 1 minute 41 seconds
[INFO] Finished at: Wed May 05 23:43:56 EDT 2010
[INFO] Final Memory: 37M/278M
[INFO]
------------------------------------------------------------------------

in plugin/parse-tika I replaced parse-tika.jar and tika-parsers-0.7.jar
with tika-core-0.8-SNAPSHOT.jar and tika-parsers-0.8-SNAPSHOT.jar

in lib/ I replaced tika-core-0.7.jar with tika-core-0.8-SNAPSHOT.jar

I ran bin/nutch crawl and it completed w/o error. All of the javascript
was fetched and appeared to be parsed w/o error. However, and I'm not
sure the correct terminology to use,  no more urls were generated to
fetch than before. So the patch appears to be a step in the right
direction. The problem is with the FRAMESET/FRAME and how javascript is
used to generate content in the FRAMES.

As Julien suggested, I will read the archive deeper and look at the
legacy parse-js. I suppose as a really ugly 'brute force' work around I
could walk the directory tree w/ a perl script and generate a 'seed
list' of html URLs to fetch. Ugh. If you have any more ideas please let
me know how I can help. Thanks.

-m.

On Tue, 2010-05-04 at 21:50 -0700, Mattmann, Chris A (388J) wrote:
> Hi Matthew,
> 
> I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
> Julien’s patch and see if there is a way to get it committed sooner rather
> than later.
> 
> One way to help me do that ― since you already have an environment and set
> of use cases where this is reproduceable can you apply TIKA-379 to a local
> checkout of tika trunk (I’ll show you how) and then let me know if that
> fixes parse-tika for you?
> 
> Here are the steps:
> 
> svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
> cd tika
> wget "http://bit.ly/bXeLkf"; (if you don't have SSL support, then manually
> download the linked file)
> patch -p0 < TIKA-379-3.patch
> mvn install package
> 
> Then grab tika-parsers and tika-core out of the respective tika-core/target
> and tika-parsers/target directories and drop those jars in your
> parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
> again.
> 
> See if that works. In the meanwhile, I'll inspect Julien's patch.
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> On 5/4/10 9:02 PM, "matthew a. grisius" <[email protected]> wrote:
> 
> > Hi Chris,
> > 
> > It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
> > and/or javascript. Using the parse-html suggested work around I am able
> > to process my simple test cases such as javadoc which does include
> > simple embedded javascript (of course I can't verify that it is actually
> > parsing it though). I expanded my testing to include two more complex
> > examples that heavily use HTML FRAMESET/FRAME and more complex
> > javascript:
> > 
> > 134 mb, 11,269 files
> > 1.9 gb, 133,978 files
> > 
> > They both fail at the top level with the similar errors such as:
> > 
> > fetching
> > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> > ocCommon.js
> > fetching
> > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo
> > cBanner.htm
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=7
> > -finishing thread FetcherThread, activeThreads=9
> > -finishing thread FetcherThread, activeThreads=6
> > -finishing thread FetcherThread, activeThreads=5
> > -finishing thread FetcherThread, activeThreads=4
> > -finishing thread FetcherThread, activeThreads=3
> > Error parsing:
> > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> > ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type
> > text/javascript
> > Attempting to finish item from unknown queue:
> > org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
> > fetch of
> > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> > ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56
> > -finishing thread FetcherThread, activeThreads=2
> > 
> > I tried several property settings to mimic the previous work around and
> > could not solve it. Any suggestions?
> > 
> > So, I'm not sure how to categorize the issues more accurately. I have
> > many javadoc sets and lots of simple HTML that will now parse, but I
> > have other examples such as the two mentioned above that won't parse and
> > therefore can't be crawled. It seems to me to be systematic rather than
> > exceptional. I cannot believe that I'm the only one who will experience
> > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
> > for asking.
> > 
> > -m.
> > 
> > 
> > 
> > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
> >> Hi Matthew,
> >> 
> >> Awesome! Glad it worked. Now my next question < how often are you seeing
> >> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
> >> trying to process? Or just some of them? Or particular ones (categories of
> >> them). The reason I ask is that I¹m trying to determine whether I should
> >> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
> >> systematic thing versus an exception.
> >> 
> >> Let me know and thanks!
> >> 
> >> Cheers,
> >> Chris
> >> 
> >> 
> >> On 5/3/10 9:04 AM, "matthew a. grisius" <[email protected]> wrote:
> >> 
> >>> Hi Chris,
> >>> 
> >>> Yes, that worked. I caught up on email and noticed that Arpit also
> >>> mentioned the same thing. Sorry I missed it, thanks to both of you!
> >>> 
> >>> -m.
> >>> 
> >>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
> >>>> Hi Matthew,
> >>>> 
> >>>>>> Hi Matthew,
> >>>>>> 
> >>>>>> There is an open issue with Tika (e.g.
> >>>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
> >>>>>> differences betwen parse-html and parse-tika. Note that you can 
> >>>>>> specify :
> >>>>>> *parse-(html|pdf) *in order to get both HTML and PDF files.
> >>>>> 
> >>>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> >>>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> >>>>> PDFs, but has problems with some html. Nutch 1.1 includes more current
> >>>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
> >>>> 
> >>>> Interesting: well one solution comes to mind. Can you test this out?
> >>>> 
> >>>> * uncomment the lines:
> >>>> 
> >>>>         <mimeType name="text/html">
> >>>>                 <plugin id="parse-html" />
> >>>>         </mimeType>
> >>>> 
> >>>> In conf/parse-plugins.xml.
> >>>> 
> >>>> * try your crawl again.
> >>>> 
> >>>>> 
> >>>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> >>>>> with the attached file
> >>>> 
> >>>> Thanks! Let me know what happens after you uncomment the line above.
> >>>> 
> >>>> Cheers,
> >>>> Chris
> >>>> 
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Senior Computer Scientist
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 171-266B, Mailstop: 171-246
> >>>> Email: [email protected]
> >>>> WWW:   http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Assistant Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: [email protected]
> >> WWW:   http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> 
> > 
> > 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
>

Re: nutch crawl issue

Reply via email to