Thanks Chris. I will check it out! On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) < [email protected]> wrote:
> Please check out NUTCH-1526 [1] which I am currently targeting for > contribution to 1.10-trunk and the 2.x branch. I'd be happy to > discuss. Thank you! > > Please try the patch out - it will dump out the web pages, images, > etc. all content that is stored in the segments as the original > files that were crawled. > > There is a review board link here: > > https://reviews.apache.org/r/9119/ > > > Cheers, > Chris > > [1] https://issues.apache.org/jira/browse/NUTCH-1526 > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Xavier Morera <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, September 18, 2014 3:21 PM > To: dev <[email protected]> > Subject: Crawling a site and saving the page html exactly as is in a > database > > >Hi, > > > > > >I have a requirement to crawl a site and save the crawled html pages into > >a database exactly as is. How complicated can this be? I need for it to > >keep all html tags. > > > > > >Also, are there any examples available that I could use as a base? > > > > > >Regards, > >Xavier > > > > > >-- > >Xavier Morera > >email: [email protected] > >CR: +(506) 8849 8866 > >US: +1 (305) 600 4919skype: xmorera > > > > > > > > > > > > -- *Xavier Morera* email: [email protected] CR: +(506) 8849 8866 US: +1 (305) 600 4919 skype: xmorera

