Please check out NUTCH-1526 [1] which I am currently targeting for contribution to 1.10-trunk and the 2.x branch. I'd be happy to discuss. Thank you!
Please try the patch out - it will dump out the web pages, images, etc. all content that is stored in the segments as the original files that were crawled. There is a review board link here: https://reviews.apache.org/r/9119/ Cheers, Chris [1] https://issues.apache.org/jira/browse/NUTCH-1526 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Xavier Morera <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, September 18, 2014 3:21 PM To: dev <[email protected]> Subject: Crawling a site and saving the page html exactly as is in a database >Hi, > > >I have a requirement to crawl a site and save the crawled html pages into >a database exactly as is. How complicated can this be? I need for it to >keep all html tags. > > >Also, are there any examples available that I could use as a base? > > >Regards, >Xavier > > >-- >Xavier Morera >email: [email protected] >CR: +(506) 8849 8866 >US: +1 (305) 600 4919skype: xmorera > > > > >

