Re: Nutch-Selenium Plugin Truncates Binary Data

Mohammad Al-Mohsin Mon, 23 Feb 2015 07:37:30 -0800

Sure, I've just uploaded the updated patch.

On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:


> I think this is fantastic Mohammad!
>
> Can you update the patch on NUTCH-1933 with this improvement,
> so we can get it into the sources?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Mohammad Al-Mohsin <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Saturday, February 21, 2015 at 6:03 AM
> To: "[email protected]" <[email protected]>
> Cc: Mohammad Al-Mohsin <[email protected]>
> Subject: Nutch-Selenium Plugin Truncates Binary Data
>
> >I am using
> >nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I
> >also have
> >Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing
> >text off images.
> >
> >
> >While crawling with Nutch & selenium, I noticed that binary data (e.g.
> >images, pdf) are always truncated and thus skip/fail parsing. Here is a
> >sample of the log:
> >Content of size 800750 was truncated to 368. Content is truncated, parse
> >may fail!
> >
> >When I turn selenium off, parsing works fine and the content is not
> >truncated.
> >
> >
> >I found that nutch-selenium gets the html body of whatever Firefox
> >displays. So even though you're fetching an image, selenium will just
> >give you the image html tag instead of the image itself.
> >e.g. <img src='xyz.png' height="400" width="600">
> >
> >
> >To get around this, I modified selenium plugin to handle the fetch only
> >if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
> >Otherwise, if the content is not textual, it just returns the content as
> >protocol-httpclient does.
> >
> >
> >Now, I am getting binary data properly parsed and also getting selenium
> >handle page rendering with javascript.
> >
> >
> >Is this is the proper way to tackle this? what do you think?
> >
> >
> >
> >
> >Best regards,
> >Mohammad Al-Mohsin
> >
> >
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Reply via email to