Exception on segment merging

2011-01-04 Thread Marseld Dedgjonaj
Hello everybody, I have configured nutch-1.2 to crawl all urls of a specific website. It runs fine for a while but now that the number of indexed urls has grown more than 30'000, I got an exception on segment merging. Have anybody seen this kind of error. The exception is shown below.

Which parse-plugins.xml is being used?

2011-01-04 Thread Steve Cohen
I am using a hadoop cluster so I am putting my conf files into the nutch source conf directory and building a nutch job file. I am then putting the job file into the classpath. I thought it was working fine since it seems to be reading the regex-urlfilter.txt from there. However, I am getting

Re: unnecessary results in search

2011-01-04 Thread Julien Nioche
Alex, See http://wiki.apache.org/solr/FieldCollapsing for implementing this in SOLR. Since the indexing and searching is delegated to SOLR as of Nutch 1.3 and 2.0 this won't be implemented directly in Nutch. HTH Julien On 4 January 2011 00:10, alx...@aim.com wrote: Hello, I used nutch-1.2

FW: Exception on segment merging

2011-01-04 Thread Marseld Dedgjonaj
I see in hadup log and some more details about the exception are there. Please help me what to check for this error. Here are the details: 2011-01-04 07:40:23,999 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:36,563 INFO segment.SegmentMerger - Slice size: 5 URLs.

Re: unnecessary results in search

2011-01-04 Thread alxsss
Hello, Thanks you for your response. Let me give you more detail of the issue that I have. First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com

Re: Exception on segment merging

2011-01-04 Thread alxsss
Which command did you use? Merging segments is very expensive in resources, so I try to avoid merging them. -Original Message- From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com To: user user@nutch.apache.org Sent: Tue, Jan 4, 2011 7:12 am Subject: FW: Exception on segment

Release planning

2011-01-04 Thread Andrzej Bialecki
Hi users devs, As you probably know, there are currently two active lines of development for Nutch: * Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely redesigned storage layer that uses Apache Gora, which in turn can use various storage implementations such as HBase, Cassandra,

Re: Release planning

2011-01-04 Thread Julien Nioche
+1 from me. I've committed today a bunch of patches which were in 1.2 but not in 1.3 (just one last one to do) but haven't compared with 2.0 Having a release based on 1.3 would be great as it would be a nice transition towards 2.0 (delegate indexing/search, dependency management with Ivy,

Re: Exception on segment merging

2011-01-04 Thread Julien Nioche
Other users have previously reported similar problems which were due to a lack on space on disk as suggested by this *Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could notfind any valid local directory for

Re: Exception on segment merging

2011-01-04 Thread Markus Jelsma
Use the hadoop.tmp.dir setting in nutch-site.conf to point to a disk where plenty is space is available. Other users have previously reported similar problems which were due to a lack on space on disk as suggested by this *Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:

Re: Release planning

2011-01-04 Thread Mattmann, Chris A (388J)
(cc to d...@nutch since you are addressing devs too) Hey Andrzej: As you probably know, there are currently two active lines of development for Nutch: [...snip...] Regarding branch-1.2 (which is a maintenance branch after release 1.2) there have been pretty no updates there, if any.