From: Raymond Balmès <raymond.bal...@gmail.com>
Subject: Re: Merge taking forever
To: nutch-user@lucene.apache.org
Date: Friday, June 5, 2009, 2:38 AM
how long does it take for your 6
millions URLs to be
crawled/parsed/indexed... I'm curious to know because I'm
about to shoot in
this area but I have no idea how long it will take.
-Ray-
2009/6/5 John Martyniak <j...@beforedawnsolutions.com>
Arkady,
I think that is beauty of nutch I have built a index
of a little more 6
million urls with "out of the box" Nutch. I
would say that is pretty good
for most situations before you have to start getting
into hadoop and
multiple machines.
-John
On Jun 4, 2009, at 5:19 PM, <arkadi.kosmy...@csiro.au>
wrote:
Hi Andrzej,
-----Original Message-----
From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Thursday, June 04, 2009 9:47 PM
To: nutch-user@lucene.apache.org
Subject: Re: Merge taking forever
Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2
x quad core processor. I have
the same problem and now thinking of using
more boxes or very fast
drives (sas 15k).
Raymond Balmčs pisze:
Well I suspect the sort function is
mono-threaded as usually they are
so
only one core is used 25% is the max you
will get.
I have a dual core and it only goes to
50% CPU in many of the steps ...
I
assumed that some phases are
mono-threaded.
Folks,
From your conversation I suspect that you are
running Hadoop with
LocalJobtracker, i.e. in a single JVM -
correct?
While this works ok for small datasets, you
don't really benefit from
map-reduce parallelism (and you still pay the
penalty for the
overheads). As your dataset grows, you will
quickly reach the
scalability limits - in this case, the limit
of IO throughput of a
single drive, during the sort phase of a large
dataset. The excessive IO
demands can be solved by distributing the load
(over many drives, and
over many machines), which is what HDFS is
designed to do well.
Hadoop tasks are usually single-threaded, and
additionally
LocalJobTracker implements only a primitive
non-parallel model of task
execution - i.e. each task is scheduled to run
sequentially in turn. If
you run the regular distributed JobTracker,
Hadoop splits the load among
many tasks running in parallel.
So, the solution is this: set up a distributed
Hadoop cluster, even if
it's going to consist of a single node -
because then the data will be
split and processed in parallel by several JVM
instances. This will also
help the operating system to schedule these
processes over multiple
CPU-s. Additionally, if you still experience
IO contention, consider
moving to HDFS as the filestystem, and spread
it over more than 1
machine and more than 1 disk in each machine.
Thank you for these recommendations.
I think that there is a large group of users
(perhaps limited by budget or
time they are willing to spend) that will give up
on trying to use Nutch
unless they can run it on a single box with simple
configuration.
Regards,
Arkadi
--
Best regards,
Andrzej Bialecki
<><
___. ___ ___ ___ _
_ __________________________________
[__ || __|__/|__||\/| Information
Retrieval, Semantic Web
___|||__|| \| || |
Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot
com