Thanks a lot Markus. We are starting very very soon. On Tue, Dec 9, 2014 at 6:17 PM, Markus Jelsma <[email protected]> wrote: > > You can use NUTCH-1526 to dump segment contents and index them then to > whatever you want, or use NUTCH-1785 to directly index a document's raw > binary content to whatever configured or custom back-end plugin. Both will > help to do what you need. > > -----Original message----- > From: Xavier Morera<[email protected]> > Sent: Tuesday 9th December 2014 22:37 > To: dev <[email protected]> > Subject: Re: Crawling a site and saving the page html exactly as is in a > database > > Hi Chris Mattmann, > > We will soon test it out. Is it ok if I let you know if I have questions > or comments? > > Thanks, > > Xavier > > On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) < > [email protected] <mailto:[email protected]>> > wrote: > Please check out NUTCH-1526 [1] which I am currently targeting for > > contribution to 1.10-trunk and the 2.x branch. Id be happy to > > discuss. Thank you! > > Please try the patch out - it will dump out the web pages, images, > > etc. all content that is stored in the segments as the original > > files that were crawled. > > There is a review board link here: > > https://reviews.apache.org/r/9119/ <https://reviews.apache.org/r/9119/> > > Cheers, > > Chris > > [1] https://issues.apache.org/jira/browse/NUTCH-1526 < > https://issues.apache.org/jira/browse/NUTCH-1526> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: [email protected] <mailto:[email protected]> > > WWW: http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -----Original Message----- > > From: Xavier Morera <[email protected] <mailto: > [email protected]>> > > Reply-To: "[email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>> > > Date: Thursday, September 18, 2014 3:21 PM > > To: dev <[email protected] <mailto:[email protected]>> > > Subject: Crawling a site and saving the page html exactly as is in a > > database > > >Hi, > > > > > > > > >I have a requirement to crawl a site and save the crawled html pages into > > >a database exactly as is. How complicated can this be? I need for it to > > >keep all html tags. > > > > > > > > >Also, are there any examples available that I could use as a base? > > > > > > > > >Regards, > > >Xavier > > > > > > > > >-- > > >Xavier Morera > > >email: [email protected] <mailto:[email protected]> > > >CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866> > > >US: +1 (305) 600 4919skype: xmorera > > > > > > > > > > > > > > > > > -- > > Xavier Morera > > Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master > > www.xaviermorera.com <http://www.xaviermorera.com/> > > office: (305) > 600-4919 > > cel: +506 8849-8866 > > skype: xmorera > > Twitter <https://twitter.com/xmorera> | LinkedIn < > https://www.linkedin.com/in/xmorera> | Pluralsight Author < > http://www.pluralsight.com/author/xavier-morera> > > >
-- *Xavier Morera* Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master *www.xaviermorera.com <http://www.xaviermorera.com/>* office: (305) 600-4919 cel: +506 8849-8866 skype: xmorera Twitter <https://twitter.com/xmorera> | LinkedIn <https://www.linkedin.com/in/xmorera> | Pluralsight Author <http://www.pluralsight.com/author/xavier-morera>

