Dear Xavier yes please contact me I’d be happy to help! So would some of the other devs here who have used it like Lewis, etc.
THanks! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Xavier Morera <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, December 9, 2014 at 1:35 PM To: dev <[email protected]> Subject: Re: Crawling a site and saving the page html exactly as is in a database >Hi Chris Mattmann, > > >We will soon test it out. Is it ok if I let you know if I have questions >or comments? > > >Thanks, >Xavier > > >On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >Please check out NUTCH-1526 [1] which I am currently targeting for >contribution to 1.10-trunk and the 2.x branch. I'd be happy to >discuss. Thank you! > >Please try the patch out - it will dump out the web pages, images, >etc. all content that is stored in the segments as the original >files that were crawled. > >There is a review board link here: > >https://reviews.apache.org/r/9119/ > > >Cheers, >Chris > >[1] https://issues.apache.org/jira/browse/NUTCH-1526 > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Xavier Morera <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Thursday, September 18, 2014 3:21 PM >To: dev <[email protected]> >Subject: Crawling a site and saving the page html exactly as is in a >database > >>Hi, >> >> >>I have a requirement to crawl a site and save the crawled html pages into >>a database exactly as is. How complicated can this be? I need for it to >>keep all html tags. >> >> >>Also, are there any examples available that I could use as a base? >> >> >>Regards, >>Xavier >> >> >>-- >>Xavier Morera >>email: [email protected] >>CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866> >>US: +1 (305) 600 4919skype: xmorera >> >> >> >> >> > > > > > > > > > > >-- >Xavier Morera >Entrepreneur | Author > & Trainer | Consultant | Developer > & Scrum Master >www.xaviermorera.com <http://www.xaviermorera.com/> >office: (305) 600-4919 >cel: +506 8849-8866 > >skype: xmorera >Twitter <https://twitter.com/xmorera> | LinkedIn ><https://www.linkedin.com/in/xmorera> | Pluralsight > Author <http://www.pluralsight.com/author/xavier-morera> > > > > > > > >

