RE: Crawling a site and saving the page html exactly as is in a database

Markus Jelsma Tue, 09 Dec 2014 16:18:02 -0800

You can use NUTCH-1526 to dump segment contents and index them then to whatever 
you want, or use NUTCH-1785 to directly index a document's raw binary content 
to whatever configured or custom back-end plugin. Both will help to do what you 
need.


-----Original message-----
From: Xavier Morera<[email protected]>
Sent: Tuesday 9th December 2014 22:37
To: dev <[email protected]>
Subject: Re: Crawling a site and saving the page html exactly as is in a 
database

Hi Chris Mattmann,

We will soon test it out. Is it ok if I let you know if I have questions or 
comments?

Thanks,

Xavier

On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) 
<[email protected] <mailto:[email protected]>> wrote:
Please check out NUTCH-1526 [1] which I am currently targeting for

contribution to 1.10-trunk and the 2.x branch. Id be happy to

discuss. Thank you!

Please try the patch out - it will dump out the web pages, images,

etc. all content that is stored in the segments as the original

files that were crawled.

There is a review board link here:

https://reviews.apache.org/r/9119/ <https://reviews.apache.org/r/9119/>

Cheers,

Chris

[1] https://issues.apache.org/jira/browse/NUTCH-1526 
<https://issues.apache.org/jira/browse/NUTCH-1526>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.

Chief Architect

Instrument Software and Science Data Systems Section (398)

NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Office: 168-519, Mailstop: 168-527

Email: [email protected] <mailto:[email protected]>

WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Adjunct Associate Professor, Computer Science Department

University of Southern California, Los Angeles, CA 90089 USA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-----Original Message-----

From: Xavier Morera <[email protected] <mailto:[email protected]>>

Reply-To: "[email protected] <mailto:[email protected]>" 
<[email protected] <mailto:[email protected]>>

Date: Thursday, September 18, 2014 3:21 PM

To: dev <[email protected] <mailto:[email protected]>>

Subject: Crawling a site and saving the page html exactly as is in a

database

>Hi,

>

>

>I have a requirement to crawl a site and save the crawled html pages into

>a database exactly as is. How complicated can this be? I need for it to

>keep all html tags.

>

>

>Also, are there any examples available that I could use as a base?

>

>

>Regards,

>Xavier

>

>

>--

>Xavier Morera

>email: [email protected] <mailto:[email protected]>

>CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866>

>US: +1 (305) 600 4919skype: xmorera

>

>

>

>

>

--

Xavier Morera

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

www.xaviermorera.com <http://www.xaviermorera.com/>

office:  (305)
600-4919

cel:     +506 8849-8866

skype: xmorera

Twitter <https://twitter.com/xmorera> | LinkedIn 
<https://www.linkedin.com/in/xmorera> | Pluralsight Author 
<http://www.pluralsight.com/author/xavier-morera>

RE: Crawling a site and saving the page html exactly as is in a database

Reply via email to