Re: Nutch-Selenium Plugin Truncates Binary Data

Mattmann, Chris A (3980) Mon, 23 Feb 2015 22:06:07 -0800

Thank you Mohammad!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: Mohammad Al-Mohsin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, February 23, 2015 at 3:13 AM
To: "[email protected]" <[email protected]>
Cc: Mohammad Al-Mohsin <[email protected]>
Subject: Re: Nutch-Selenium Plugin Truncates Binary Data

>Sure, I've just uploaded the updated patch.
>
>On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>I think this is fantastic Mohammad!
>
>Can you update the patch on NUTCH-1933 with this improvement,
>so we can get it into the sources?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Mohammad Al-Mohsin <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Saturday, February 21, 2015 at 6:03 AM
>To: "[email protected]" <[email protected]>
>Cc: Mohammad Al-Mohsin <[email protected]>
>Subject: Nutch-Selenium Plugin Truncates Binary Data
>
>>I am using
>>nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I
>>also have
>>Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing
>>text off images.
>>
>>
>>While crawling with Nutch & selenium, I noticed that binary data (e.g.
>>images, pdf) are always truncated and thus skip/fail parsing. Here is a
>>sample of the log:
>>Content of size 800750 was truncated to 368. Content is truncated, parse
>>may fail!
>>
>>When I turn selenium off, parsing works fine and the content is not
>>truncated.
>>
>>
>>I found that nutch-selenium gets the html body of whatever Firefox
>>displays. So even though you're fetching an image, selenium will just
>>give you the image html tag instead of the image itself.
>>e.g. <img src='xyz.png' height="400" width="600">
>>
>>
>>To get around this, I modified selenium plugin to handle the fetch only
>>if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>>Otherwise, if the content is not textual, it just returns the content as
>>protocol-httpclient does.
>>
>>
>>Now, I am getting binary data properly parsed and also getting selenium
>>handle page rendering with javascript.
>>
>>
>>Is this is the proper way to tackle this? what do you think?
>>
>>
>>
>>
>>Best regards,
>>Mohammad Al-Mohsin
>>
>>
>
>
>
>
>
>
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Reply via email to