Hi Alex,
Did you get anywhere with this?
What condition led to you seeing unknown host exception?
Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.
On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote:
Hello,
After running bin/nutch fetch
Correct
There should be comprehensive documentation on the wiki for these parameters
(and many more)
On Fri, Aug 19, 2011 at 6:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
addDays is not a crawl switch but a generator switch. You cannot use the
crawl
command.
But if I use
There are some peculiarities in your log:
2011-08-23 14:47:34,833 DEBUG conf.Configuration - java.io.IOException:
config()
at org.apache.hadoop.conf.Configuration.init(Configuration.java:211)
at org.apache.hadoop.conf.Configuration.init(Configuration.java:198)
at
Hi Markus,
thank you for the quick reply. I already searched for this Configuration
error and found:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15397.html
Where they say that This exception is innocuous - it helps to debug at
which points in the code the Configuration
Hi
Small suggestion, but I do not see any -dir argument passed alongside your
initial invertlinks command. I understand that you have multiple segment
directories, which have been fetched over a recent number of days, and that
the output would also suggest the process was properly executed,
Summary : Nutch crawl updates to solr ignores case-insensitive URL index key.
Let me explain, the site nutch crawls is on apache server and the URL is case
sensitive. When updating the solr index I use URL as my key and it is not
getting updated with the new different case URL. When accessing
Hi there,
I'm a newbie with Nutch. I need to store data from a crawl in specific
columns in the webpage table in the Nutch database in MySQL. I have the
columns being created by changing gora-sql-mapping.xml, and changing schema
and field info in org.apache.nutch.storage.WebPage.
I only need
hi
after doing invert link i see the complete link graph...THANKS
I m bit confused, please help me understand..
I do crawl using crawl command. I see around 7000+ urls when i dump crawldb.
Then i do invertlink and i see the complete link graph.
After this i do solrindex.
After solr indexing
If you please post your crawldb dump then we could see the structure of your
crawldb and may be able to begin pin pointing the issue.
It should not be required for you to undertake another crawl after inverting
links for these URLs to be indexed when calling solrindex command... there
must be
Hi Lewis,
I stopped fetcher and started it on the same segment again.
But before doing that I turned off modem and fetcher started giving
Unknown.Host exception.
It was not giving any error, with dsl failure, i.e. I was not able to connect
to any sites. Again this is nutch-1.2.
Thanks.
Alex.
hi
after doing invert link i see the complete link graph...THANKS
I m bit confused, please help me understand..
I do crawl using crawl command. I see around 7000+ urls when i dump
crawldb. Then i do invertlink and i see the complete link graph.
After this i do solrindex.
After solr
If you fetch too hard, your DNS-server may not be able to keep up.
Hi Lewis,
I stopped fetcher and started it on the same segment again.
But before doing that I turned off modem and fetcher started giving
Unknown.Host exception. It was not giving any error, with dsl failure,
i.e. I was not
hi
I have started the crawl again, I will post crawl db out put as soon as
crawl finishes..
--
View this message in context:
http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3279147.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Oh yes, thank your very much Sergey, that was the problem.
Would have been nice, if the inverlinks command had told me that it has
ignored them :-)
Cheers,
Marek
Am 23.08.2011 19:26, schrieb Sergey A Volkov:
Hi
Is it possible that you fetch documents from just one site/domain?
Looks like
Hi Marek,
You make a reasonable point. If you feel that this is something that should
be integrated then maybe consider filing a JIRA with a comprehensive
description of the problem and a proposed solution. If you do not actually
patch this yourself then maybe someone else can provide a
Hi,
I am using nutch within hadoop distributed computing environment. I tried
saving html source to a local drive (not HDFS) via absolute filepath, but I
can't find the saved contents on either master node or datanodes.
How can I achieve this?
Thanks!
16 matches
Mail list logo