I have a similar problem when the segread tool (acutually any code that
needs to read the seg) was just hanging there forever on a truncated segment. I think there are at least 2 bugs: one in the fetcher
which generated the truncated seg without any error message, the 2nd is the
Well, truncated segments are created only in case of a fatal bug, like OOM Exception or a JVM crash. So, there is really no way to produce any message except just the usual messages in such cases...
MapFile/SequenceFile which generates the dead lock. But looking at the codes it is not easy to pinpoint the bugs. Maybe someone else (like Doug) has a better idea?
The "bug" (or misfeature) of MapFile.Reader is that it silently assumes it is ok to deal with a truncated file. In reality, the tradeoff is a slowdown of two- or more orders of magnitude for random seeking. If the intended use is to process the file sequentially (as many tools do this), then it's ok. In other cases, if the file is used for intensive random seeking, then the processing performance will drop drastically.
I believe the correct fix is to refuse opening a truncated MapFile, unless an "override" flag is provided. This way, it will be easy to detect this situation and fix corrupted segments when really needed.
If this sounds like a proper way to address this problem, I'll prepare a patch.
The worst part is that there is no way to fix that truncated record because any tool that intends to fix it needs to read it first!
Erhm. Not true. Currently this involves a bit of manual procedure, but can be done. First, you need to delete the partial "index" files from affected directories. Then, run the segread -fix command - it will create new "index" files.
As for the parallel indexing on multiple machines, I think you need to copy the same web db over in order to do it right and you need to merge the segments in the end too.
Indexing doesn't use WebDB at all. However, at this moment there is no straightforward way to do this in parallel (unless the new MapReduce code can be used for that?).
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
