Hi Chris, The 'maven install package' produced this for each target/maven-shared-archive-resources/... file.
... [INFO] [bundle:bundle {execution: default-bundle}] [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/NOTICE~ [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~ [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-resources/META-INF/LICENSE~ [ERROR] Error(s) found in bundle configuration [INFO] ------------------------------------------------------------------------ [ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] Error(s) found in bundle configuration [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1 minute 24 seconds [INFO] Finished at: Wed May 05 23:38:56 EDT 2010 [INFO] Final Memory: 40M/271M [INFO] ------------------------------------------------------------------------ Assuming this was the right thing to do, I renamed each file to match the missing filename, e.g. rename "DEPENDENCIES" to "DEPENDENCIES~" (and NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars. and produce this: [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] ------------------------------------------------------------------------ [INFO] Apache Tika parent .................................... SUCCESS [2.261s] [INFO] Apache Tika core ...................................... SUCCESS [14.429s] [INFO] Apache Tika parsers ................................... SUCCESS [32.370s] [INFO] Apache Tika application ............................... SUCCESS [34.179s] [INFO] Apache Tika OSGi bundle ............................... SUCCESS [16.081s] [INFO] Apache Tika ........................................... SUCCESS [0.237s] [INFO] ------------------------------------------------------------------------ [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1 minute 41 seconds [INFO] Finished at: Wed May 05 23:43:56 EDT 2010 [INFO] Final Memory: 37M/278M [INFO] ------------------------------------------------------------------------ in plugin/parse-tika I replaced parse-tika.jar and tika-parsers-0.7.jar with tika-core-0.8-SNAPSHOT.jar and tika-parsers-0.8-SNAPSHOT.jar in lib/ I replaced tika-core-0.7.jar with tika-core-0.8-SNAPSHOT.jar I ran bin/nutch crawl and it completed w/o error. All of the javascript was fetched and appeared to be parsed w/o error. However, and I'm not sure the correct terminology to use, no more urls were generated to fetch than before. So the patch appears to be a step in the right direction. The problem is with the FRAMESET/FRAME and how javascript is used to generate content in the FRAMES. As Julien suggested, I will read the archive deeper and look at the legacy parse-js. I suppose as a really ugly 'brute force' work around I could walk the directory tree w/ a perl script and generate a 'seed list' of html URLs to fetch. Ugh. If you have any more ideas please let me know how I can help. Thanks. -m. On Tue, 2010-05-04 at 21:50 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at > Julien’s patch and see if there is a way to get it committed sooner rather > than later. > > One way to help me do that ― since you already have an environment and set > of use cases where this is reproduceable can you apply TIKA-379 to a local > checkout of tika trunk (I’ll show you how) and then let me know if that > fixes parse-tika for you? > > Here are the steps: > > svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika > cd tika > wget "http://bit.ly/bXeLkf" (if you don't have SSL support, then manually > download the linked file) > patch -p0 < TIKA-379-3.patch > mvn install package > > Then grab tika-parsers and tika-core out of the respective tika-core/target > and tika-parsers/target directories and drop those jars in your > parse-tika/lib folder, replacing their originals. Then, try your nutch crawl > again. > > See if that works. In the meanwhile, I'll inspect Julien's patch. > > Thanks! > > Cheers, > Chris > > On 5/4/10 9:02 PM, "matthew a. grisius" <mgris...@comcast.net> wrote: > > > Hi Chris, > > > > It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES > > and/or javascript. Using the parse-html suggested work around I am able > > to process my simple test cases such as javadoc which does include > > simple embedded javascript (of course I can't verify that it is actually > > parsing it though). I expanded my testing to include two more complex > > examples that heavily use HTML FRAMESET/FRAME and more complex > > javascript: > > > > 134 mb, 11,269 files > > 1.9 gb, 133,978 files > > > > They both fail at the top level with the similar errors such as: > > > > fetching > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > > ocCommon.js > > fetching > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo > > cBanner.htm > > -finishing thread FetcherThread, activeThreads=8 > > -finishing thread FetcherThread, activeThreads=7 > > -finishing thread FetcherThread, activeThreads=9 > > -finishing thread FetcherThread, activeThreads=6 > > -finishing thread FetcherThread, activeThreads=5 > > -finishing thread FetcherThread, activeThreads=4 > > -finishing thread FetcherThread, activeThreads=3 > > Error parsing: > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > > ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type > > text/javascript > > Attempting to finish item from unknown queue: > > org.apache.nutch.fetcher.fetcher$fetchi...@1532fc > > fetch of > > http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD > > ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56 > > -finishing thread FetcherThread, activeThreads=2 > > > > I tried several property settings to mimic the previous work around and > > could not solve it. Any suggestions? > > > > So, I'm not sure how to categorize the issues more accurately. I have > > many javadoc sets and lots of simple HTML that will now parse, but I > > have other examples such as the two mentioned above that won't parse and > > therefore can't be crawled. It seems to me to be systematic rather than > > exceptional. I cannot believe that I'm the only one who will experience > > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > > for asking. > > > > -m. > > > > > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: > >> Hi Matthew, > >> > >> Awesome! Glad it worked. Now my next question < how often are you seeing > >> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are > >> trying to process? Or just some of them? Or particular ones (categories of > >> them). The reason I ask is that I¹m trying to determine whether I should > >> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a > >> systematic thing versus an exception. > >> > >> Let me know and thanks! > >> > >> Cheers, > >> Chris > >> > >> > >> On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote: > >> > >>> Hi Chris, > >>> > >>> Yes, that worked. I caught up on email and noticed that Arpit also > >>> mentioned the same thing. Sorry I missed it, thanks to both of you! > >>> > >>> -m. > >>> > >>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > >>>> Hi Matthew, > >>>> > >>>>>> Hi Matthew, > >>>>>> > >>>>>> There is an open issue with Tika (e.g. > >>>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the > >>>>>> differences betwen parse-html and parse-tika. Note that you can > >>>>>> specify : > >>>>>> *parse-(html|pdf) *in order to get both HTML and PDF files. > >>>>> > >>>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0 > >>>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my > >>>>> PDFs, but has problems with some html. Nutch 1.1 includes more current > >>>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4. > >>>> > >>>> Interesting: well one solution comes to mind. Can you test this out? > >>>> > >>>> * uncomment the lines: > >>>> > >>>> <mimeType name="text/html"> > >>>> <plugin id="parse-html" /> > >>>> </mimeType> > >>>> > >>>> In conf/parse-plugins.xml. > >>>> > >>>> * try your crawl again. > >>>> > >>>>> > >>>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817 > >>>>> with the attached file > >>>> > >>>> Thanks! Let me know what happens after you uncomment the line above. > >>>> > >>>> Cheers, > >>>> Chris > >>>> > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Chris Mattmann, Ph.D. > >>>> Senior Computer Scientist > >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>> Office: 171-266B, Mailstop: 171-246 > >>>> Email: chris.mattm...@jpl.nasa.gov > >>>> WWW: http://sunset.usc.edu/~mattmann/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Adjunct Assistant Professor, Computer Science Department > >>>> University of Southern California, Los Angeles, CA 90089 USA > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>> > >>> > >> > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Senior Computer Scientist > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 171-266B, Mailstop: 171-246 > >> Email: chris.mattm...@jpl.nasa.gov > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Adjunct Assistant Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.mattm...@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >