Re: fetcher runs without error with no internet connection

2011-08-23 Thread lewis john mcgibbney
Hi Alex, Did you get anywhere with this? What condition led to you seeing unknown host exception? Unless segment gets corrupted, I would assume you could fetch again. Hopefully you can confirm this. On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote: Hello, After running bin/nutch fetch

Re: force recrawl

2011-08-23 Thread lewis john mcgibbney
Correct There should be comprehensive documentation on the wiki for these parameters (and many more) On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: addDays is not a crawl switch but a generator switch. You cannot use the crawl command. But if I use

Re: Empty LinkDB after invertlinks

2011-08-23 Thread Markus Jelsma
There are some peculiarities in your log: 2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:211) at org.apache.hadoop.conf.Configuration.init(Configuration.java:198) at

Re: Empty LinkDB after invertlinks

2011-08-23 Thread Marek Bachmann
Hi Markus, thank you for the quick reply. I already searched for this Configuration error and found: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html Where they say that This exception is innocuous - it helps to debug at which points in the code the Configuration

Re: Empty LinkDB after invertlinks

2011-08-23 Thread lewis john mcgibbney
Hi Small suggestion, but I do not see any -dir argument passed alongside your initial invertlinks command. I understand that you have multiple segment directories, which have been fetched over a recent number of days, and that the output would also suggest the process was properly executed,

Nutch crawl updates ignore cans URL

2011-08-23 Thread Ramanathapuram, Rajesh
Summary : Nutch crawl updates to solr ignores case-insensitive URL index key. Let me explain, the site nutch crawls is on apache server and the URL is case sensitive. When updating the solr index I use URL as my key and it is not getting updated with the new different case URL. When accessing

How to store data in new column in MySQL database Nutch 2.0

2011-08-23 Thread jcoffield
Hi there, I'm a newbie with Nutch. I need to store data from a crawl in specific columns in the webpage table in the Nutch database in MySQL. I have the columns being created by changing gora-sql-mapping.xml, and changing schema and field info in org.apache.nutch.storage.WebPage. I only need

Re: readdblink not showing alllinks

2011-08-23 Thread abhayd
hi after doing invert link i see the complete link graph...THANKS I m bit confused, please help me understand.. I do crawl using crawl command. I see around 7000+ urls when i dump crawldb. Then i do invertlink and i see the complete link graph. After this i do solrindex. After solr indexing

Re: readdblink not showing alllinks

2011-08-23 Thread lewis john mcgibbney
If you please post your crawldb dump then we could see the structure of your crawldb and may be able to begin pin pointing the issue. It should not be required for you to undertake another crawl after inverting links for these URLs to be indexed when calling solrindex command... there must be

Re: fetcher runs without error with no internet connection

2011-08-23 Thread alxsss
Hi Lewis, I stopped fetcher and started it on the same segment again. But before doing that I turned off modem and fetcher started giving Unknown.Host exception. It was not giving any error, with dsl failure, i.e. I was not able to connect to any sites. Again this is nutch-1.2. Thanks. Alex.

Re: readdblink not showing alllinks

2011-08-23 Thread Markus Jelsma
hi after doing invert link i see the complete link graph...THANKS I m bit confused, please help me understand.. I do crawl using crawl command. I see around 7000+ urls when i dump crawldb. Then i do invertlink and i see the complete link graph. After this i do solrindex. After solr

Re: fetcher runs without error with no internet connection

2011-08-23 Thread Markus Jelsma
If you fetch too hard, your DNS-server may not be able to keep up. Hi Lewis, I stopped fetcher and started it on the same segment again. But before doing that I turned off modem and fetcher started giving Unknown.Host exception. It was not giving any error, with dsl failure, i.e. I was not

Re: readdblink not showing alllinks

2011-08-23 Thread abhayd
hi I have started the crawl again, I will post crawl db out put as soon as crawl finishes.. -- View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3279147.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Empty LinkDB after invertlinks

2011-08-23 Thread Marek Bachmann
Oh yes, thank your very much Sergey, that was the problem. Would have been nice, if the inverlinks command had told me that it has ignored them :-) Cheers, Marek Am 23.08.2011 19:26, schrieb Sergey A Volkov: Hi Is it possible that you fetch documents from just one site/domain? Looks like

Re: Re: Empty LinkDB after invertlinks

2011-08-23 Thread lewis . mcgibbney
Hi Marek, You make a reasonable point. If you feel that this is something that should be integrated then maybe consider filing a JIRA with a comprehensive description of the problem and a proposed solution. If you do not actually patch this yourself then maybe someone else can provide a

How to save html source to local drive

2011-08-23 Thread dyzc
Hi, I am using nutch within hadoop distributed computing environment. I tried saving html source to a local drive (not HDFS) via absolute filepath, but I can't find the saved contents on either master node or datanodes. How can I achieve this? Thanks!