That is the same issue that I am having.

I didn't see any overusage on the disk or CPU, I am going set up a small hadoop Config today and will report the results

-John

On Jun 5, 2009, at 8:44 AM, Alex Basa <alex_b...@yahoo.com> wrote:


It takes me about 6 days to crawl, parse and index 5 million documents.

If I did not create an incremental index and have 2000 crawls in different directories. What is the best way to merge them all into one index? I've started using the mergecrawls.sh but it's taking 2 weeks already and it's still not done. I've been monitoring the HD and there are no waits. CPU usage is at about 14%.

The server I'm doing the merges on is a 16 core, 32MB RAM, 30TB with ZFS on a SAN. Ideas anyone?

Thanks in advance as always,

Alex


--- On Fri, 6/5/09, Raymond Balmès <raymond.bal...@gmail.com> wrote:

From: Raymond Balmès <raymond.bal...@gmail.com>
Subject: Re: Merge taking forever
To: nutch-user@lucene.apache.org
Date: Friday, June 5, 2009, 2:38 AM
how long does it take for your 6
millions URLs to be
crawled/parsed/indexed... I'm curious to know because I'm
about to shoot in
this area but I have no idea how long it will take.

-Ray-

2009/6/5 John Martyniak <j...@beforedawnsolutions.com>

Arkady,

I think that is beauty of nutch I have built a index
of a little more 6
million urls with "out of the box" Nutch.  I
would say that is pretty good
for most situations before you have to start getting
into hadoop and
multiple machines.

-John


On Jun 4, 2009, at 5:19 PM, <arkadi.kosmy...@csiro.au>
wrote:

Hi Andrzej,

-----Original Message-----
From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Thursday, June 04, 2009 9:47 PM
To: nutch-user@lucene.apache.org
Subject: Re: Merge taking forever

Bartosz Gadzimski wrote:

As Arkadi said, your hdd is to slow for 2
x quad core processor. I have
the same problem and now thinking of using
more boxes or very fast
drives (sas 15k).

Raymond Balmčs pisze:

Well I suspect the sort function is
mono-threaded as usually they are

so

only one core is used 25% is the max you
will get.
I have a dual core and it only goes to
50% CPU in many of the steps ...

I

assumed that some phases are
mono-threaded.


Folks,

From your conversation I suspect that you are
running Hadoop with
LocalJobtracker, i.e. in a single JVM -
correct?

While this works ok for small datasets, you
don't really benefit from
map-reduce parallelism (and you still pay the
penalty for the
overheads). As your dataset grows, you will
quickly reach the
scalability limits - in this case, the limit
of IO throughput of a
single drive, during the sort phase of a large
dataset. The excessive IO
demands can be solved by distributing the load
(over many drives, and
over many machines), which is what HDFS is
designed to do well.

Hadoop tasks are usually single-threaded, and
additionally
LocalJobTracker implements only a primitive
non-parallel model of task
execution - i.e. each task is scheduled to run
sequentially in turn. If
you run the regular distributed JobTracker,
Hadoop splits the load among
many tasks running in parallel.

So, the solution is this: set up a distributed
Hadoop cluster, even if
it's going to consist of a single node -
because then the data will be
split and processed in parallel by several JVM
instances. This will also
help the operating system to schedule these
processes over multiple
CPU-s. Additionally, if you still experience
IO contention, consider
moving to HDFS as the filestystem, and spread
it over more than 1
machine and more than 1 disk in each machine.


Thank you for these recommendations.

I think that there is a large group of users
(perhaps limited by budget or
time they are willing to spend) that will give up
on trying to use Nutch
unless they can run it on a single box with simple
configuration.

Regards,

Arkadi


--
Best regards,
Andrzej Bialecki
   <><
  ___. ___ ___ ___ _
_   __________________________________
[__ || __|__/|__||\/|  Information
Retrieval, Semantic Web
___|||__||  \|  ||  |
Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot
com







Reply via email to