Thanks Andrzej! Matthias
Andrzej Bialecki schrieb:
Matthias Jaekle wrote:
Hi Andrzej,
I have copied the 2 segments to: http://www.iventax.de/NEW/
You might wget -r them. Both of them are around 2 GB. If I should tar them or something else, please let me know.
Thanks for your help!
Only got around to it now... I'm getting the segments now, should be able to take a look at them tomorrow afternoon CET.
Ok, I discovered what was the problem. You didn't say exactly how the mergesegs process "breaks", but I'm pretty sure that you just didn't see any progress for a long, long time... right? So it looked as if it was stuck.
The real cause of this behaviour is a non-obvious treatment of segment data with partially truncated "index" files. The pairs of {data,index} files are handled together by MapFile. The "index" file is a sort of directory of entries, and it is used to speed up a random access to individual entries. However, if it's truncated, the MapFile.Reader only prints a warning, and still opens the files - but now the parts of the "data" file with entries past the latest entry from the "index" are searched sequentially every time a given entry needs to be found... This completely destroys the performance of MapFile.Reader.
Arguably, the MapFile.Reader should throw an exception in such case, because for most applications the performance loss in unacceptable.
So, the reality is that your mergesegs process is not stuck, it is just progressing at a snail's pace.
The solution to this is to remove all "index" files from the truncated segment's directory. Then simply run the mergesegs command - it uses an equivalent of "segread -fix" to re-create the "index" files.
BTW. I have successfuly finished the "mergesegs" command on the segments that you put up for download - initially it would probably take years to finish the process, but after fixing them in the way I described above the performance went back to normal, i.e. roughly 300 rec/s on my hardware.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
