Sure, I've just uploaded the updated patch. On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980) < [email protected]> wrote:
> I think this is fantastic Mohammad! > > Can you update the patch on NUTCH-1933 with this improvement, > so we can get it into the sources? > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Mohammad Al-Mohsin <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Saturday, February 21, 2015 at 6:03 AM > To: "[email protected]" <[email protected]> > Cc: Mohammad Al-Mohsin <[email protected]> > Subject: Nutch-Selenium Plugin Truncates Binary Data > > >I am using > >nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I > >also have > >Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing > >text off images. > > > > > >While crawling with Nutch & selenium, I noticed that binary data (e.g. > >images, pdf) are always truncated and thus skip/fail parsing. Here is a > >sample of the log: > >Content of size 800750 was truncated to 368. Content is truncated, parse > >may fail! > > > >When I turn selenium off, parsing works fine and the content is not > >truncated. > > > > > >I found that nutch-selenium gets the html body of whatever Firefox > >displays. So even though you're fetching an image, selenium will just > >give you the image html tag instead of the image itself. > >e.g. <img src='xyz.png' height="400" width="600"> > > > > > >To get around this, I modified selenium plugin to handle the fetch only > >if the Content-Type header starts with 'text', i.e. to catch 'text/html'. > >Otherwise, if the content is not textual, it just returns the content as > >protocol-httpclient does. > > > > > >Now, I am getting binary data properly parsed and also getting selenium > >handle page rendering with javascript. > > > > > >Is this is the proper way to tackle this? what do you think? > > > > > > > > > >Best regards, > >Mohammad Al-Mohsin > > > > > >

