Re: Crawling a site and saving the page html exactly as is in a database

Xavier Morera Wed, 17 Dec 2014 09:01:01 -0800

Thanks a lot Markus. We are starting very very soon.

On Tue, Dec 9, 2014 at 6:17 PM, Markus Jelsma <[email protected]>
wrote:
>
> You can use NUTCH-1526 to dump segment contents and index them then to
> whatever you want, or use NUTCH-1785 to directly index a document's raw
> binary content to whatever configured or custom back-end plugin. Both will
> help to do what you need.
>
> -----Original message-----
> From: Xavier Morera<[email protected]>
> Sent: Tuesday 9th December 2014 22:37
> To: dev <[email protected]>
> Subject: Re: Crawling a site and saving the page html exactly as is in a
> database
>
> Hi Chris Mattmann,
>
> We will soon test it out. Is it ok if I let you know if I have questions
> or comments?
>
> Thanks,
>
> Xavier
>
> On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) <
> [email protected] <mailto:[email protected]>>
> wrote:
> Please check out NUTCH-1526 [1] which I am currently targeting for
>
> contribution to 1.10-trunk and the 2.x branch. Id be happy to
>
> discuss. Thank you!
>
> Please try the patch out - it will dump out the web pages, images,
>
> etc. all content that is stored in the segments as the original
>
> files that were crawled.
>
> There is a review board link here:
>
> https://reviews.apache.org/r/9119/ <https://reviews.apache.org/r/9119/>
>
> Cheers,
>
> Chris
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1526 <
> https://issues.apache.org/jira/browse/NUTCH-1526>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Chief Architect
>
> Instrument Software and Science Data Systems Section (398)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 168-519, Mailstop: 168-527
>
> Email: [email protected] <mailto:[email protected]>
>
> WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> -----Original Message-----
>
> From: Xavier Morera <[email protected] <mailto:
> [email protected]>>
>
> Reply-To: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
>
> Date: Thursday, September 18, 2014 3:21 PM
>
> To: dev <[email protected] <mailto:[email protected]>>
>
> Subject: Crawling a site and saving the page html exactly as is in a
>
> database
>
> >Hi,
>
> >
>
> >
>
> >I have a requirement to crawl a site and save the crawled html pages into
>
> >a database exactly as is. How complicated can this be? I need for it to
>
> >keep all html tags.
>
> >
>
> >
>
> >Also, are there any examples available that I could use as a base?
>
> >
>
> >
>
> >Regards,
>
> >Xavier
>
> >
>
> >
>
> >--
>
> >Xavier Morera
>
> >email: [email protected] <mailto:[email protected]>
>
> >CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866>
>
> >US: +1 (305) 600 4919skype: xmorera
>
> >
>
> >
>
> >
>
> >
>
> >
>
> --
>
> Xavier Morera
>
> Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master
>
> www.xaviermorera.com <http://www.xaviermorera.com/>
>
> office:  (305)
> 600-4919
>
> cel:     +506 8849-8866
>
> skype: xmorera
>
> Twitter <https://twitter.com/xmorera> | LinkedIn <
> https://www.linkedin.com/in/xmorera> | Pluralsight Author <
> http://www.pluralsight.com/author/xavier-morera>
>
>
>


-- 

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*

office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
<https://www.linkedin.com/in/xmorera> | Pluralsight Author
<http://www.pluralsight.com/author/xavier-morera>

Re: Crawling a site and saving the page html exactly as is in a database

Reply via email to