Thank you Mohammad! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Mohammad Al-Mohsin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, February 23, 2015 at 3:13 AM To: "[email protected]" <[email protected]> Cc: Mohammad Al-Mohsin <[email protected]> Subject: Re: Nutch-Selenium Plugin Truncates Binary Data >Sure, I've just uploaded the updated patch. > >On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >I think this is fantastic Mohammad! > >Can you update the patch on NUTCH-1933 with this improvement, >so we can get it into the sources? > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Mohammad Al-Mohsin <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Saturday, February 21, 2015 at 6:03 AM >To: "[email protected]" <[email protected]> >Cc: Mohammad Al-Mohsin <[email protected]> >Subject: Nutch-Selenium Plugin Truncates Binary Data > >>I am using >>nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I >>also have >>Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing >>text off images. >> >> >>While crawling with Nutch & selenium, I noticed that binary data (e.g. >>images, pdf) are always truncated and thus skip/fail parsing. Here is a >>sample of the log: >>Content of size 800750 was truncated to 368. Content is truncated, parse >>may fail! >> >>When I turn selenium off, parsing works fine and the content is not >>truncated. >> >> >>I found that nutch-selenium gets the html body of whatever Firefox >>displays. So even though you're fetching an image, selenium will just >>give you the image html tag instead of the image itself. >>e.g. <img src='xyz.png' height="400" width="600"> >> >> >>To get around this, I modified selenium plugin to handle the fetch only >>if the Content-Type header starts with 'text', i.e. to catch 'text/html'. >>Otherwise, if the content is not textual, it just returns the content as >>protocol-httpclient does. >> >> >>Now, I am getting binary data properly parsed and also getting selenium >>handle page rendering with javascript. >> >> >>Is this is the proper way to tackle this? what do you think? >> >> >> >> >>Best regards, >>Mohammad Al-Mohsin >> >> > > > > > > > >

