Andrzej: Thank you for your response to my comments. The reason I said there may be bug in the fetcher is that in our case there was no JVM crash or OOM Exception during the fetch and the fetch process was successful by reading the log. file. So I cannot tell what caused the truncation (Unexpected EOF exception in the reader)
The problem with the MapFile is not that the performance drops, instead, it simply hangs on a deadlock ( I looked at the thread dump). I do not understand why a segread would need a write lock on the seg. Your proposed manual step to fix the index may not work simply because UnExpected EOF is in the fetch output (seq file), not the index. Unfortunately I deleted the bad seg, otherwise I would give it a try. Thanks again! Jay -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 13, 2005 10:42 AM To: [email protected] Subject: MapFile.Reader bug (Re: Optimal segment size?) Jay Yu wrote: > I have a similar problem when the segread tool (acutually any code that > needs to read the seg) was just hanging there forever on a > truncated segment. I think there are at least 2 bugs: one in the fetcher > which generated the truncated seg without any error message, the 2nd is the Well, truncated segments are created only in case of a fatal bug, like OOM Exception or a JVM crash. So, there is really no way to produce any message except just the usual messages in such cases... > MapFile/SequenceFile which generates the dead lock. But looking at the codes > it is not easy to pinpoint the bugs. Maybe someone else (like Doug) has a > better idea? The "bug" (or misfeature) of MapFile.Reader is that it silently assumes it is ok to deal with a truncated file. In reality, the tradeoff is a slowdown of two- or more orders of magnitude for random seeking. If the intended use is to process the file sequentially (as many tools do this), then it's ok. In other cases, if the file is used for intensive random seeking, then the processing performance will drop drastically. I believe the correct fix is to refuse opening a truncated MapFile, unless an "override" flag is provided. This way, it will be easy to detect this situation and fix corrupted segments when really needed. If this sounds like a proper way to address this problem, I'll prepare a patch. > The worst part is that there is no way to fix that truncated record because > any tool that intends to fix it needs to read it first! Erhm. Not true. Currently this involves a bit of manual procedure, but can be done. First, you need to delete the partial "index" files from affected directories. Then, run the segread -fix command - it will create new "index" files. > As for the parallel indexing on multiple machines, I think you need to copy > the same web db over in order to do it right and you need to merge the > segments in the end too. Indexing doesn't use WebDB at all. However, at this moment there is no straightforward way to do this in parallel (unless the new MapReduce code can be used for that?). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
