I have a similar problem when the segread tool (acutually any code that
needs to read the seg) was just hanging there forever on a 
truncated segment. I think there are at least 2 bugs: one in the fetcher
which generated the truncated seg without any error message, the 2nd is the
MapFile/SequenceFile which generates the dead lock. But looking at the codes
it is not easy to pinpoint the bugs. Maybe someone else (like Doug) has a
better idea?
The worst part is that there is no way to fix that truncated record because
any tool that intends to fix it needs to read it first!
As for the parallel indexing on multiple machines, I think you need to copy
the same web db over in order to do it right and you need to merge the
segments in the end too.

Thanks
Jay

-----Original Message-----
From: Andy Liu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 13, 2005 6:55 AM
To: [email protected]
Subject: Re: Optimal segment size?


I've had problems trying to access truncated segments in the past. 
The process would hang when I tried to read the segment.  Have you
tried using the segread tool to see if it can be accessed correctly? 
Have you tried reparing the segment?  One week for 4 million records
is way long, so I would say that there's something not right.

I don't think there's an optimum segment size, although I find that
smaller segments are easier to manage.  Also, with multiple segments
you can parallelize the indexing process by indexing on separate
machines, or even running multiple indexing processes on the same
machine.  This helps especially if you're using resource intensive
index plugins (like language ID).

On 4/13/05, Luke Baker <[EMAIL PROTECTED]> wrote:
> Hey,
> 
> Is there some sort of optimal or maximum segment size?  I have a segment
> with 3.9 million records and it appears to be taking a really long time
> to index.  The index process has been optimizing the index for over a
> week.  The server I'm running it on is a dual Xeon 3.0 Ghz with 2GB of
> RAM.  I've done 2 million page segments before and the optimizing has
> taken about 48 hours.
> 
> Would a truncated segment cause the optimizing process to take a really
> long time?  I would guess that the optimizing process would just be
> manipulating the index that already has been created and that nothing in
> the segment itself would cause the optimizing part to take a really long
> time.
> 
> I have confirmed that the process is still running and modifying files
> in the index directory.  Would the underlying filesystem play any role
> in all this?  I'm using ext3.
> 
> Thanks,
> 
> Luke
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to