Thanks Chris. I will check it out!

On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Please check out NUTCH-1526 [1] which I am currently targeting for
> contribution to 1.10-trunk and the 2.x branch. I'd be happy to
> discuss. Thank you!
>
> Please try the patch out - it will dump out the web pages, images,
> etc. all content that is stored in the segments as the original
> files that were crawled.
>
> There is a review board link here:
>
> https://reviews.apache.org/r/9119/
>
>
> Cheers,
> Chris
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1526
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Xavier Morera <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, September 18, 2014 3:21 PM
> To: dev <[email protected]>
> Subject: Crawling a site and saving the page html exactly as is in a
> database
>
> >Hi,
> >
> >
> >I have a requirement to crawl a site and save the crawled html pages into
> >a database exactly as is. How complicated can this be? I need for it to
> >keep all html tags.
> >
> >
> >Also, are there any examples available that I could use as a base?
> >
> >
> >Regards,
> >Xavier
> >
> >
> >--
> >Xavier Morera
> >email: [email protected]
> >CR: +(506) 8849 8866
> >US: +1 (305) 600 4919skype: xmorera
> >
> >
> >
> >
> >
>
>


-- 
*Xavier Morera*
email: [email protected]
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera

Reply via email to