Hi Matthew,

Awesome! Glad it worked. Now my next question < how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris


On 5/3/10 9:04 AM, "matthew a. grisius" <mgris...@comcast.net> wrote:

> Hi Chris,
> 
> Yes, that worked. I caught up on email and noticed that Arpit also
> mentioned the same thing. Sorry I missed it, thanks to both of you!
> 
> -m.
> 
> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
>> Hi Matthew,
>> 
>>>> Hi Matthew,
>>>> 
>>>> There is an open issue with Tika (e.g.
>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>>>> differences betwen parse-html and parse-tika. Note that you can specify :
>>>> *parse-(html|pdf) *in order to get both HTML and PDF files.
>>> 
>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
>>> PDFs, but has problems with some html. Nutch 1.1 includes more current
>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
>> 
>> Interesting: well one solution comes to mind. Can you test this out?
>> 
>> * uncomment the lines:
>> 
>>         <mimeType name="text/html">
>>                 <plugin id="parse-html" />
>>         </mimeType>
>> 
>> In conf/parse-plugins.xml.
>> 
>> * try your crawl again.
>> 
>>> 
>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
>>> with the attached file
>> 
>> Thanks! Let me know what happens after you uncomment the line above.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.mattm...@jpl.nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to