Re: Merge taking forever

John Martyniak Fri, 05 Jun 2009 10:38:16 -0700

That is the same issue that I am having.

I didn't see any overusage on the disk or CPU, I am going set up asmall hadoop Config today and will report the results


-John

On Jun 5, 2009, at 8:44 AM, Alex Basa <alex_b...@yahoo.com> wrote:

It takes me about 6 days to crawl, parse and index 5 milliondocuments.

If I did not create an incremental index and have 2000 crawls indifferent directories. What is the best way to merge them all intoone index? I've started using the mergecrawls.sh but it's taking 2weeks already and it's still not done. I've been monitoring the HDand there are no waits. CPU usage is at about 14%.

The server I'm doing the merges on is a 16 core, 32MB RAM, 30TB withZFS on a SAN. Ideas anyone?


Thanks in advance as always,

Alex


--- On Fri, 6/5/09, Raymond Balmès <raymond.bal...@gmail.com> wrote:

From: Raymond Balmès <raymond.bal...@gmail.com>
Subject: Re: Merge taking forever
To: nutch-user@lucene.apache.org
Date: Friday, June 5, 2009, 2:38 AM
how long does it take for your 6
millions URLs to be
crawled/parsed/indexed... I'm curious to know because I'm
about to shoot in
this area but I have no idea how long it will take.

-Ray-

2009/6/5 John Martyniak <j...@beforedawnsolutions.com>

Arkady,

I think that is beauty of nutch I have built a index

of a little more 6

million urls with "out of the box" Nutch.  I

would say that is pretty good

for most situations before you have to start getting

into hadoop and

multiple machines.

-John


On Jun 4, 2009, at 5:19 PM, <arkadi.kosmy...@csiro.au>

wrote:


Hi Andrzej,


-----Original Message-----

From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Thursday, June 04, 2009 9:47 PM
To: nutch-user@lucene.apache.org
Subject: Re: Merge taking forever

Bartosz Gadzimski wrote:

As Arkadi said, your hdd is to slow for 2

x quad core processor. I have

the same problem and now thinking of using

more boxes or very fast

drives (sas 15k).

Raymond Balmčs pisze:

Well I suspect the sort function is

mono-threaded as usually they are

so

only one core is used 25% is the max you

will get.

I have a dual core and it only goes to

50% CPU in many of the steps ...

assumed that some phases are

mono-threaded.

Folks,

From your conversation I suspect that you are

running Hadoop with

LocalJobtracker, i.e. in a single JVM -

correct?


While this works ok for small datasets, you

don't really benefit from

map-reduce parallelism (and you still pay the

penalty for the

overheads). As your dataset grows, you will

quickly reach the

scalability limits - in this case, the limit

of IO throughput of a

single drive, during the sort phase of a large

dataset. The excessive IO

demands can be solved by distributing the load

(over many drives, and

over many machines), which is what HDFS is

designed to do well.


Hadoop tasks are usually single-threaded, and

additionally

LocalJobTracker implements only a primitive

non-parallel model of task

execution - i.e. each task is scheduled to run

sequentially in turn. If

you run the regular distributed JobTracker,

Hadoop splits the load among

many tasks running in parallel.

So, the solution is this: set up a distributed

Hadoop cluster, even if

it's going to consist of a single node -

because then the data will be

split and processed in parallel by several JVM

instances. This will also

help the operating system to schedule these

processes over multiple

CPU-s. Additionally, if you still experience

IO contention, consider

moving to HDFS as the filestystem, and spread

it over more than 1

machine and more than 1 disk in each machine.


Thank you for these recommendations.

I think that there is a large group of users

(perhaps limited by budget or

time they are willing to spend) that will give up

on trying to use Nutch

unless they can run it on a single box with simple

configuration.


Regards,

Arkadi

--
Best regards,
Andrzej Bialecki

<><

  ___. ___ ___ ___ _

_   __________________________________

[__ || __|__/|__||\/|  Information

Retrieval, Semantic Web

___|||__||  \|  ||  |

Embedded Unix, System Integration

http://www.sigram.com  Contact: info at sigram dot

com

Re: Merge taking forever

Reply via email to