[Nutch-dev] RE: MapFile.Reader bug (Re: Optimal segment size?)

Jay Yu Wed, 13 Apr 2005 11:02:34 -0700

Andrzej:
Thank you for your response to my comments.
The reason I said there may be bug in the fetcher is that in our case there
was no JVM crash or OOM Exception during the fetch and the fetch process was
successful by reading the log.
file. So I cannot tell what caused the truncation (Unexpected EOF exception
in the reader)

The problem with the MapFile is not that the performance drops, instead, it
simply hangs on a deadlock ( I looked at the thread dump). I do not
understand why a segread would need a write lock on the seg. 
Your proposed manual step to fix the index may not work simply because
UnExpected EOF is in the fetch output (seq file), not the index.
Unfortunately I deleted the bad seg, otherwise I would give it a try.
Thanks again!

Jay

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 13, 2005 10:42 AM
To: [email protected]
Subject: MapFile.Reader bug (Re: Optimal segment size?)

Jay Yu wrote:
> I have a similar problem when the segread tool (acutually any code that
> needs to read the seg) was just hanging there forever on a 
> truncated segment. I think there are at least 2 bugs: one in the fetcher
> which generated the truncated seg without any error message, the 2nd is
the

Well, truncated segments are created only in case of a fatal bug, like 
OOM Exception or a JVM crash. So, there is really no way to produce any 
message except just the usual messages in such cases...

> MapFile/SequenceFile which generates the dead lock. But looking at the
codes
> it is not easy to pinpoint the bugs. Maybe someone else (like Doug) has a
> better idea?

The "bug" (or misfeature) of MapFile.Reader is that it silently assumes 
it is ok to deal with a truncated file. In reality, the tradeoff is a 
slowdown of two- or more orders of magnitude for random seeking. If the 
intended use is to process the file sequentially (as many tools do 
this), then it's ok. In other cases, if the file is used for intensive 
random seeking, then the processing performance will drop drastically.

I believe the correct fix is to refuse opening a truncated MapFile, 
unless an "override" flag is provided. This way, it will be easy to 
detect this situation and fix corrupted segments when really needed.

If this sounds like a proper way to address this problem, I'll prepare a 
patch.

> The worst part is that there is no way to fix that truncated record
because
> any tool that intends to fix it needs to read it first!

Erhm. Not true. Currently this involves a bit of manual procedure, but 
can be done. First, you need to delete the partial "index" files from 
affected directories. Then, run the segread -fix command - it will 
create new "index" files.

> As for the parallel indexing on multiple machines, I think you need to
copy
> the same web db over in order to do it right and you need to merge the
> segments in the end too.

Indexing doesn't use WebDB at all. However, at this moment there is no 
straightforward way to do this in parallel (unless the new MapReduce 
code can be used for that?).

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: MapFile.Reader bug (Re: Optimal segment size?)

Reply via email to