RE: R: Using Nutch for only retriving HTML

BELLINI ADAM Wed, 30 Sep 2009 14:05:58 -0700

hi 
mabe you can run a crawl (dont forget to filter the pages just to keep html or 
htm files (you will do it at conf/crawl-urlfilter.txt) )
after that you will go to the hadoop.log file and grep the sentence 
'fetcher.Fetcher - fetching http' to get all the fetched urls.
dont forget to sort the file and to make it uniq (command uniq -c) becoz 
sometimes the crawl try to fecth the poges several times if they  will not 
answer the first time.


when you have all your urls you can run wget on your file and archive the 
dowlowaded pages.

hope it could help.





> Date: Wed, 30 Sep 2009 20:46:50 +0000
> From: olson_...@yahoo.it
> Subject: Re: R: Using Nutch for only retriving HTML
> To: nutch-user@lucene.apache.org
> 
> Thanks Magnús and Susam for your responses and pointing me in the right 
> direction. I think I would spend time over the next few weeks trying out 
> Nutch over. I only needed the HTML – I don’t care if it is in the Database or 
> in separate files. 
> 
> Thanks guys,
> O.O. 
> 
> 
> --- Mer 30/9/09, Magnús Skúlason <magg...@gmail.com> ha scritto:
> 
> > Da: Magnús Skúlason <magg...@gmail.com>
> > Oggetto: Re: R: Using Nutch for only retriving HTML
> > A: nutch-user@lucene.apache.org
> > Data: Mercoledì 30 settembre 2009, 11:48
> > Actually its quite easy to modify the
> > parse-html filter to do this.
> > 
> > That is saving the HTML to a file or to some database, you
> > could then
> > configure it to skip all unnecessary plugins. I think it
> > depends a lot on
> > the other requirements you have whether using nutch for
> > this task is the
> > right way to go or not. If you can get by with wget -r then
> > its probably an
> > overkill to use nutch.
> > 
> > Best regards,
> > Magnus
> > 
> > On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <susam....@gmail.com>
> > wrote:
> > 
> > > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <olson_...@yahoo.it>
> > wrote:
> > > > Sorry for pushing this topic, but I would like to
> > know if Nutch would
> > > help me get the raw HTML in my situation described
> > below.
> > > >
> > > > I am sure it would be a simple answer to those
> > who know Nutch. If not
> > > then I guess Nutch is the wrong tool for the job.
> > > >
> > > > Thanks,
> > > > O. O.
> > > >
> > > >
> > > > --- Gio 24/9/09, O. Olson <olson_...@yahoo.it>
> > ha scritto:
> > > >
> > > >> Da: O. Olson <olson_...@yahoo.it>
> > > >> Oggetto: Using Nutch for only retriving HTML
> > > >> A: nutch-user@lucene.apache.org
> > > >> Data: Giovedì 24 settembre 2009, 20:54
> > > >> Hi,
> > > >>     I am new to Nutch. I
> > would like to
> > > >> completely crawl through an Internal Website
> > and retrieve
> > > >> all the HTML Content. I don’t intend to do
> > further
> > > >> processing using Nutch.
> > > >> The Website/Content is rather huge. By crawl,
> > I mean that I
> > > >> would go to a page, download/archive the
> > HTML, get the links
> > > >> from that page, and then download/archive
> > those pages. I
> > > >> would keep doing this till I don’t have any
> > new links.
> > >
> > > I don't think it is possible to retrieve pages and
> > store them as
> > > separate files, one per page, without modifications in
> > Nutch. I am not
> > > sure though. Someone would correct me if I am wrong
> > here. However, it
> > > is easy to retrieve the HTML contents from the crawl
> > DB using the
> > > Nutch API. But from your post, it seems, you don't
> > want to do this.
> > >
> > > >>
> > > >> Is this possible? Is this the right tool for
> > this job, or
> > > >> are there other tools out there that would be
> > more suited
> > > >> for my purpose?
> > >
> > > I guess 'wget' is the tool you are looking for. You
> > can use it with -r
> > > option to recursively download pages and store them as
> > separate files
> > > on the hard disk, which is exactly what you need. You
> > might want to
> > > use the -np option too. It is available for Windows as
> > well as Linux.
> > >
> > > Regards,
> > > Susam Pal
> > >
> > 
> 
> 
>       
                                          
_________________________________________________________________
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

RE: R: Using Nutch for only retriving HTML

Reply via email to