Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,
but how to dump the content  ? i tried this command :



./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto

and it said :

Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
  


but the crawl_generate is in this path :

/usr/local/nutch-1.0/crawl/segments/20091001120102

and not in this one :

/usr/local/nutch-1.0/crawl/segments/20091001120102/content

can you plz just give me the correct command ?


This command will dump just the content part:

./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch 
-nogenerate -noparse -noparsedata -noparsetext


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Magnús Skúlason
Actually its quite easy to modify the parse-html filter to do this.

That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to go or not. If you can get by with wget -r then its probably an
overkill to use nutch.

Best regards,
Magnus

On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote:

 On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote:
  Sorry for pushing this topic, but I would like to know if Nutch would
 help me get the raw HTML in my situation described below.
 
  I am sure it would be a simple answer to those who know Nutch. If not
 then I guess Nutch is the wrong tool for the job.
 
  Thanks,
  O. O.
 
 
  --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto:
 
  Da: O. Olson olson_...@yahoo.it
  Oggetto: Using Nutch for only retriving HTML
  A: nutch-user@lucene.apache.org
  Data: Giovedì 24 settembre 2009, 20:54
  Hi,
  I am new to Nutch. I would like to
  completely crawl through an Internal Website and retrieve
  all the HTML Content. I don’t intend to do further
  processing using Nutch.
  The Website/Content is rather huge. By crawl, I mean that I
  would go to a page, download/archive the HTML, get the links
  from that page, and then download/archive those pages. I
  would keep doing this till I don’t have any new links.

 I don't think it is possible to retrieve pages and store them as
 separate files, one per page, without modifications in Nutch. I am not
 sure though. Someone would correct me if I am wrong here. However, it
 is easy to retrieve the HTML contents from the crawl DB using the
 Nutch API. But from your post, it seems, you don't want to do this.

 
  Is this possible? Is this the right tool for this job, or
  are there other tools out there that would be more suited
  for my purpose?

 I guess 'wget' is the tool you are looking for. You can use it with -r
 option to recursively download pages and store them as separate files
 on the hard disk, which is exactly what you need. You might want to
 use the -np option too. It is available for Windows as well as Linux.

 Regards,
 Susam Pal



Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread O. Olson
Thanks Magnús and Susam for your responses and pointing me in the right 
direction. I think I would spend time over the next few weeks trying out Nutch 
over. I only needed the HTML – I don’t care if it is in the Database or in 
separate files. 

Thanks guys,
O.O. 


--- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:

 Da: Magnús Skúlason magg...@gmail.com
 Oggetto: Re: R: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Mercoledì 30 settembre 2009, 11:48
 Actually its quite easy to modify the
 parse-html filter to do this.
 
 That is saving the HTML to a file or to some database, you
 could then
 configure it to skip all unnecessary plugins. I think it
 depends a lot on
 the other requirements you have whether using nutch for
 this task is the
 right way to go or not. If you can get by with wget -r then
 its probably an
 overkill to use nutch.
 
 Best regards,
 Magnus
 
 On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
 wrote:
 
  On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
 wrote:
   Sorry for pushing this topic, but I would like to
 know if Nutch would
  help me get the raw HTML in my situation described
 below.
  
   I am sure it would be a simple answer to those
 who know Nutch. If not
  then I guess Nutch is the wrong tool for the job.
  
   Thanks,
   O. O.
  
  
   --- Gio 24/9/09, O. Olson olson_...@yahoo.it
 ha scritto:
  
   Da: O. Olson olson_...@yahoo.it
   Oggetto: Using Nutch for only retriving HTML
   A: nutch-user@lucene.apache.org
   Data: Giovedì 24 settembre 2009, 20:54
   Hi,
       I am new to Nutch. I
 would like to
   completely crawl through an Internal Website
 and retrieve
   all the HTML Content. I don’t intend to do
 further
   processing using Nutch.
   The Website/Content is rather huge. By crawl,
 I mean that I
   would go to a page, download/archive the
 HTML, get the links
   from that page, and then download/archive
 those pages. I
   would keep doing this till I don’t have any
 new links.
 
  I don't think it is possible to retrieve pages and
 store them as
  separate files, one per page, without modifications in
 Nutch. I am not
  sure though. Someone would correct me if I am wrong
 here. However, it
  is easy to retrieve the HTML contents from the crawl
 DB using the
  Nutch API. But from your post, it seems, you don't
 want to do this.
 
  
   Is this possible? Is this the right tool for
 this job, or
   are there other tools out there that would be
 more suited
   for my purpose?
 
  I guess 'wget' is the tool you are looking for. You
 can use it with -r
  option to recursively download pages and store them as
 separate files
  on the hard disk, which is exactly what you need. You
 might want to
  use the -np option too. It is available for Windows as
 well as Linux.
 
  Regards,
  Susam Pal
 
 





RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM

hi 
mabe you can run a crawl (dont forget to filter the pages just to keep html or 
htm files (you will do it at conf/crawl-urlfilter.txt) )
after that you will go to the hadoop.log file and grep the sentence 
'fetcher.Fetcher - fetching http' to get all the fetched urls.
dont forget to sort the file and to make it uniq (command uniq -c) becoz 
sometimes the crawl try to fecth the poges several times if they  will not 
answer the first time.

when you have all your urls you can run wget on your file and archive the 
dowlowaded pages.

hope it could help.





 Date: Wed, 30 Sep 2009 20:46:50 +
 From: olson_...@yahoo.it
 Subject: Re: R: Using Nutch for only retriving HTML
 To: nutch-user@lucene.apache.org
 
 Thanks Magnús and Susam for your responses and pointing me in the right 
 direction. I think I would spend time over the next few weeks trying out 
 Nutch over. I only needed the HTML – I don’t care if it is in the Database or 
 in separate files. 
 
 Thanks guys,
 O.O. 
 
 
 --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:
 
  Da: Magnús Skúlason magg...@gmail.com
  Oggetto: Re: R: Using Nutch for only retriving HTML
  A: nutch-user@lucene.apache.org
  Data: Mercoledì 30 settembre 2009, 11:48
  Actually its quite easy to modify the
  parse-html filter to do this.
  
  That is saving the HTML to a file or to some database, you
  could then
  configure it to skip all unnecessary plugins. I think it
  depends a lot on
  the other requirements you have whether using nutch for
  this task is the
  right way to go or not. If you can get by with wget -r then
  its probably an
  overkill to use nutch.
  
  Best regards,
  Magnus
  
  On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
  wrote:
  
   On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
  wrote:
Sorry for pushing this topic, but I would like to
  know if Nutch would
   help me get the raw HTML in my situation described
  below.
   
I am sure it would be a simple answer to those
  who know Nutch. If not
   then I guess Nutch is the wrong tool for the job.
   
Thanks,
O. O.
   
   
--- Gio 24/9/09, O. Olson olson_...@yahoo.it
  ha scritto:
   
Da: O. Olson olson_...@yahoo.it
Oggetto: Using Nutch for only retriving HTML
A: nutch-user@lucene.apache.org
Data: Giovedì 24 settembre 2009, 20:54
Hi,
I am new to Nutch. I
  would like to
completely crawl through an Internal Website
  and retrieve
all the HTML Content. I don’t intend to do
  further
processing using Nutch.
The Website/Content is rather huge. By crawl,
  I mean that I
would go to a page, download/archive the
  HTML, get the links
from that page, and then download/archive
  those pages. I
would keep doing this till I don’t have any
  new links.
  
   I don't think it is possible to retrieve pages and
  store them as
   separate files, one per page, without modifications in
  Nutch. I am not
   sure though. Someone would correct me if I am wrong
  here. However, it
   is easy to retrieve the HTML contents from the crawl
  DB using the
   Nutch API. But from your post, it seems, you don't
  want to do this.
  
   
Is this possible? Is this the right tool for
  this job, or
are there other tools out there that would be
  more suited
for my purpose?
  
   I guess 'wget' is the tool you are looking for. You
  can use it with -r
   option to recursively download pages and store them as
  separate files
   on the hard disk, which is exactly what you need. You
  might want to
   use the -np option too. It is available for Windows as
  well as Linux.
  
   Regards,
   Susam Pal
  
  
 
 
   
  
_
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

RE: R: Using Nutch for only retriving HTML

2009-09-30 Thread BELLINI ADAM


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: R: Using Nutch for only retriving HTML
 Date: Wed, 30 Sep 2009 21:04:03 +
 
 
 hi 
 mabe you can run a crawl (dont forget to filter the pages just to keep html 
 or htm files (you will do it at conf/crawl-urlfilter.txt) )
 after that you will go to the hadoop.log file and grep the sentence 
 'fetcher.Fetcher - fetching http' to get all the fetched urls.
 dont forget to sort the file and to make it uniq (command uniq -c) becoz 
 sometimes the crawl try to fecth the poges several times if they  will not 
 answer the first time.
 
 when you have all your urls you can run wget on your file and archive the 
 dowlowaded pages.
 
 hope it could help.
 
 
 
 
 
  Date: Wed, 30 Sep 2009 20:46:50 +
  From: olson_...@yahoo.it
  Subject: Re: R: Using Nutch for only retriving HTML
  To: nutch-user@lucene.apache.org
  
  Thanks Magnús and Susam for your responses and pointing me in the right 
  direction. I think I would spend time over the next few weeks trying out 
  Nutch over. I only needed the HTML – I don’t care if it is in the Database 
  or in separate files. 
  
  Thanks guys,
  O.O. 
  
  
  --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto:
  
   Da: Magnús Skúlason magg...@gmail.com
   Oggetto: Re: R: Using Nutch for only retriving HTML
   A: nutch-user@lucene.apache.org
   Data: Mercoledì 30 settembre 2009, 11:48
   Actually its quite easy to modify the
   parse-html filter to do this.
   
   That is saving the HTML to a file or to some database, you
   could then
   configure it to skip all unnecessary plugins. I think it
   depends a lot on
   the other requirements you have whether using nutch for
   this task is the
   right way to go or not. If you can get by with wget -r then
   its probably an
   overkill to use nutch.
   
   Best regards,
   Magnus
   
   On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com
   wrote:
   
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it
   wrote:
 Sorry for pushing this topic, but I would like to
   know if Nutch would
help me get the raw HTML in my situation described
   below.

 I am sure it would be a simple answer to those
   who know Nutch. If not
then I guess Nutch is the wrong tool for the job.

 Thanks,
 O. O.


 --- Gio 24/9/09, O. Olson olson_...@yahoo.it
   ha scritto:

 Da: O. Olson olson_...@yahoo.it
 Oggetto: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Giovedì 24 settembre 2009, 20:54
 Hi,
 I am new to Nutch. I
   would like to
 completely crawl through an Internal Website
   and retrieve
 all the HTML Content. I don’t intend to do
   further
 processing using Nutch.
 The Website/Content is rather huge. By crawl,
   I mean that I
 would go to a page, download/archive the
   HTML, get the links
 from that page, and then download/archive
   those pages. I
 would keep doing this till I don’t have any
   new links.
   
I don't think it is possible to retrieve pages and
   store them as
separate files, one per page, without modifications in
   Nutch. I am not
sure though. Someone would correct me if I am wrong
   here. However, it
is easy to retrieve the HTML contents from the crawl
   DB using the
Nutch API. But from your post, it seems, you don't
   want to do this.
   

 Is this possible? Is this the right tool for
   this job, or
 are there other tools out there that would be
   more suited
 for my purpose?
   
I guess 'wget' is the tool you are looking for. You
   can use it with -r
option to recursively download pages and store them as
   separate files
on the hard disk, which is exactly what you need. You
   might want to
use the -np option too. It is available for Windows as
   well as Linux.
   
Regards,
Susam Pal
   
   
  
  

 
 _
 We are your photos. Share us now with Windows Live Photos.
 http://go.microsoft.com/?linkid=9666047
  
_
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki

BELLINI ADAM wrote:


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files


I'd argue with this advice. The goal here is to obtain the HTML pages. 
If you have crawled them, then why do it again? You already have their 
content locally.


However, page content is NOT stored in crawldb, it's stored in segments. 
So you need to dump the content from segments, and not the content of 
crawldb.


The command 'bin/nutch readseg -dump segmentName output' should do 
the trick.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



R: Using Nutch for only retriving HTML

2009-09-29 Thread O. Olson
Sorry for pushing this topic, but I would like to know if Nutch would help me 
get the raw HTML in my situation described below. 

I am sure it would be a simple answer to those who know Nutch. If not then I 
guess Nutch is the wrong tool for the job.

Thanks,
O. O. 


--- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto:

 Da: O. Olson olson_...@yahoo.it
 Oggetto: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Giovedì 24 settembre 2009, 20:54
 Hi,
     I am new to Nutch. I would like to
 completely crawl through an Internal Website and retrieve
 all the HTML Content. I don’t intend to do further
 processing using Nutch. 
 The Website/Content is rather huge. By crawl, I mean that I
 would go to a page, download/archive the HTML, get the links
 from that page, and then download/archive those pages. I
 would keep doing this till I don’t have any new links.
 
 Is this possible? Is this the right tool for this job, or
 are there other tools out there that would be more suited
 for my purpose?
 
 Thanks,
 O.O. 
 
 
 
 
 





Re: R: Using Nutch for only retriving HTML

2009-09-29 Thread Susam Pal
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote:
 Sorry for pushing this topic, but I would like to know if Nutch would help me 
 get the raw HTML in my situation described below.

 I am sure it would be a simple answer to those who know Nutch. If not then I 
 guess Nutch is the wrong tool for the job.

 Thanks,
 O. O.


 --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto:

 Da: O. Olson olson_...@yahoo.it
 Oggetto: Using Nutch for only retriving HTML
 A: nutch-user@lucene.apache.org
 Data: Giovedì 24 settembre 2009, 20:54
 Hi,
     I am new to Nutch. I would like to
 completely crawl through an Internal Website and retrieve
 all the HTML Content. I don’t intend to do further
 processing using Nutch.
 The Website/Content is rather huge. By crawl, I mean that I
 would go to a page, download/archive the HTML, get the links
 from that page, and then download/archive those pages. I
 would keep doing this till I don’t have any new links.

I don't think it is possible to retrieve pages and store them as
separate files, one per page, without modifications in Nutch. I am not
sure though. Someone would correct me if I am wrong here. However, it
is easy to retrieve the HTML contents from the crawl DB using the
Nutch API. But from your post, it seems, you don't want to do this.


 Is this possible? Is this the right tool for this job, or
 are there other tools out there that would be more suited
 for my purpose?

I guess 'wget' is the tool you are looking for. You can use it with -r
option to recursively download pages and store them as separate files
on the hard disk, which is exactly what you need. You might want to
use the -np option too. It is available for Windows as well as Linux.

Regards,
Susam Pal