Hi Matthew,

I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.

One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can you apply TIKA-379 to a local
checkout of tika trunk (I’ll show you how) and then let me know if that
fixes parse-tika for you?

Here are the steps:

svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
cd tika
wget "http://bit.ly/bXeLkf"; (if you don't have SSL support, then manually
download the linked file)
patch -p0 < TIKA-379-3.patch
mvn install package

Then grab tika-parsers and tika-core out of the respective tika-core/target
and tika-parsers/target directories and drop those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl

See if that works. In the meanwhile, I'll inspect Julien's patch.



On 5/4/10 9:02 PM, "matthew a. grisius" <mgris...@comcast.net> wrote:

> Hi Chris,
> It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
> and/or javascript. Using the parse-html suggested work around I am able
> to process my simple test cases such as javadoc which does include
> simple embedded javascript (of course I can't verify that it is actually
> parsing it though). I expanded my testing to include two more complex
> examples that heavily use HTML FRAMESET/FRAME and more complex
> javascript:
> 134 mb, 11,269 files
> 1.9 gb, 133,978 files
> They both fail at the top level with the similar errors such as:
> fetching
> ocCommon.js
> fetching
> cBanner.htm
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> Error parsing:
> ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type
> text/javascript
> Attempting to finish item from unknown queue:
> org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
> fetch of
> ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56
> -finishing thread FetcherThread, activeThreads=2
> I tried several property settings to mimic the previous work around and
> could not solve it. Any suggestions?
> So, I'm not sure how to categorize the issues more accurately. I have
> many javadoc sets and lots of simple HTML that will now parse, but I
> have other examples such as the two mentioned above that won't parse and
> therefore can't be crawled. It seems to me to be systematic rather than
> exceptional. I cannot believe that I'm the only one who will experience
> these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
> for asking.
> -m.
> On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
>> Hi Matthew,
>> Awesome! Glad it worked. Now my next question < how often are you seeing
>> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
>> trying to process? Or just some of them? Or particular ones (categories of
>> them). The reason I ask is that I¹m trying to determine whether I should
>> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
>> systematic thing versus an exception.
>> Let me know and thanks!
>> Cheers,
>> Chris
>> On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote:
>>> Hi Chris,
>>> Yes, that worked. I caught up on email and noticed that Arpit also
>>> mentioned the same thing. Sorry I missed it, thanks to both of you!
>>> -m.
>>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
>>>> Hi Matthew,
>>>>>> Hi Matthew,
>>>>>> There is an open issue with Tika (e.g.
>>>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>>>>>> differences betwen parse-html and parse-tika. Note that you can specify :
>>>>>> *parse-(html|pdf) *in order to get both HTML and PDF files.
>>>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
>>>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
>>>>> PDFs, but has problems with some html. Nutch 1.1 includes more current
>>>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
>>>> Interesting: well one solution comes to mind. Can you test this out?
>>>> * uncomment the lines:
>>>>         <mimeType name="text/html">
>>>>                 <plugin id="parse-html" />
>>>>         </mimeType>
>>>> In conf/parse-plugins.xml.
>>>> * try your crawl again.
>>>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
>>>>> with the attached file
>>>> Thanks! Let me know what happens after you uncomment the line above.
>>>> Cheers,
>>>> Chris
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: chris.mattm...@jpl.nasa.gov
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.mattm...@jpl.nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

Reply via email to