Hello everybody,
I have configured nutch-1.2 to crawl all urls of a specific website.
It runs fine for a while but now that the number of indexed urls has grown
more than 30'000, I got an exception on segment merging.
Have anybody seen this kind of error.
The exception is shown below.
I am using a hadoop cluster so I am putting my conf files into the nutch
source conf directory and building a nutch job file. I am then putting the
job file into the classpath. I thought it was working fine since it seems to
be reading the regex-urlfilter.txt from there. However, I am getting
Alex,
See http://wiki.apache.org/solr/FieldCollapsing for implementing this in
SOLR. Since the indexing and searching is delegated to SOLR as of Nutch 1.3
and 2.0 this won't be implemented directly in Nutch.
HTH
Julien
On 4 January 2011 00:10, alx...@aim.com wrote:
Hello,
I used nutch-1.2
I see in hadup log and some more details about the exception are there.
Please help me what to check for this error.
Here are the details:
2011-01-04 07:40:23,999 INFO segment.SegmentMerger - Slice size: 5
URLs.
2011-01-04 07:40:36,563 INFO segment.SegmentMerger - Slice size: 5
URLs.
Hello,
Thanks you for your response.
Let me give you more detail of the issue that I have.
First definitions. Let say I have my own domain that I host on a dedicated
server and call it mydomain.com
Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com,
maps.mydomain.com
Which command did you use? Merging segments is very expensive in resources, so
I try to avoid merging them.
-Original Message-
From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com
To: user user@nutch.apache.org
Sent: Tue, Jan 4, 2011 7:12 am
Subject: FW: Exception on segment
Hi users devs,
As you probably know, there are currently two active lines of
development for Nutch:
* Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely
redesigned storage layer that uses Apache Gora, which in turn can use
various storage implementations such as HBase, Cassandra,
+1 from me. I've committed today a bunch of patches which were in 1.2 but
not in 1.3 (just one last one to do) but haven't compared with 2.0
Having a release based on 1.3 would be great as it would be a nice
transition towards 2.0 (delegate indexing/search, dependency management with
Ivy,
Other users have previously reported similar problems which were due to a
lack on space on disk as suggested by this
*Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could
notfind any valid local directory for
Use the hadoop.tmp.dir setting in nutch-site.conf to point to a disk where
plenty is space is available.
Other users have previously reported similar problems which were due to a
lack on space on disk as suggested by this
*Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
(cc to d...@nutch since you are addressing devs too)
Hey Andrzej:
As you probably know, there are currently two active lines of
development for Nutch:
[...snip...]
Regarding branch-1.2 (which is a maintenance branch after release 1.2)
there have been pretty no updates there, if any.
11 matches
Mail list logo