Re: Crawling a site and saving the page html exactly as is in a database

Mattmann, Chris A (3980) Thu, 18 Sep 2014 23:33:08 -0700

Please check out NUTCH-1526 [1] which I am currently targeting for
contribution to 1.10-trunk and the 2.x branch. I'd be happy to
discuss. Thank you!


Please try the patch out - it will dump out the web pages, images,
etc. all content that is stored in the segments as the original
files that were crawled.

There is a review board link here:

https://reviews.apache.org/r/9119/


Cheers,
Chris

[1] https://issues.apache.org/jira/browse/NUTCH-1526

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Xavier Morera <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, September 18, 2014 3:21 PM
To: dev <[email protected]>
Subject: Crawling a site and saving the page html exactly as is in a
database

>Hi,
>
>
>I have a requirement to crawl a site and save the crawled html pages into
>a database exactly as is. How complicated can this be? I need for it to
>keep all html tags.
>
>
>Also, are there any examples available that I could use as a base?
>
>
>Regards,
>Xavier
>
>
>-- 
>Xavier Morera
>email: [email protected]
>CR: +(506) 8849 8866
>US: +1 (305) 600 4919skype: xmorera
>
>
>
>
>

Re: Crawling a site and saving the page html exactly as is in a database

Reply via email to