Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: hi, but how to dump the content ? i tried this command : ./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate but the crawl_generate is in this path : /usr/local/nutch-1.0/crawl/segments/20091001120102 and not in this one : /usr/local/nutch-1.0/crawl/segments/20091001120102/content can you plz just give me the correct command ? This command will dump just the content part: ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: R: Using Nutch for only retriving HTML
Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal
Re: R: Using Nutch for only retriving HTML
Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal
RE: R: Using Nutch for only retriving HTML
hi mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) ) after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls. dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they will not answer the first time. when you have all your urls you can run wget on your file and archive the dowlowaded pages. hope it could help. Date: Wed, 30 Sep 2009 20:46:50 + From: olson_...@yahoo.it Subject: Re: R: Using Nutch for only retriving HTML To: nutch-user@lucene.apache.org Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047
RE: R: Using Nutch for only retriving HTML
me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: R: Using Nutch for only retriving HTML Date: Wed, 30 Sep 2009 21:04:03 + hi mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) ) after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls. dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they will not answer the first time. when you have all your urls you can run wget on your file and archive the dowlowaded pages. hope it could help. Date: Wed, 30 Sep 2009 20:46:50 + From: olson_...@yahoo.it Subject: Re: R: Using Nutch for only retriving HTML To: nutch-user@lucene.apache.org Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047 _ Attention all humans. We are your photos. Free us. http://go.microsoft.com/?linkid=9666046
Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files I'd argue with this advice. The goal here is to obtain the HTML pages. If you have crawled them, then why do it again? You already have their content locally. However, page content is NOT stored in crawldb, it's stored in segments. So you need to dump the content from segments, and not the content of crawldb. The command 'bin/nutch readseg -dump segmentName output' should do the trick. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
R: Using Nutch for only retriving HTML
Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? Thanks, O.O.
Re: R: Using Nutch for only retriving HTML
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal