Hi, I'm using Python with ZeroMQ to distribute data around an HPC cluster. The results have been good apart from one issue which I am completely stuck with:
We are using marshal for serialising objects before distributing them around the cluster, and extremely occasionally a corrupted marshal is produced. The current workaround is to serialise everything twice and check that the serialisations are the same. On the rare occasions that they are not, I have dumped the files for comparison. It turns out that there are a few positions within the serialisation where corruption tends to occur (these positions seem to be independent of the data of the size of the complete serialisation). These are: 4 bytes starting at 548867 (0x86003) 4 bytes starting at 4398083 (0x431c03) 4 bytes starting at 17595395 (0x10c7c03) 4 bytes starting at 19794819 (0x12e0b83) 4 bytes starting at 22269171 (0x153ccf3) 2 bytes starting at 25052819 (0x17e4693) 3 bytes starting at 28184419 (0x1ae0f63) I note that the ratio between the later positions is almost exactly 1.125. Presumably this has something to do with memory allocation somewhere? Some datapoints: - The phenomenon has been observed in a single-threaded process without ZeroMQ - I think the phenomenon has been observed in pickled as well as marshalled data - The phenomenon has been observed on different hardware Unfortunately after quite a lot of work I still haven't managed to reproduce this error on a single machine. Hopefully the above is enough information for someone to speculate as to where the problem is. Many thanks in advance for any help. Regards, Graham -- http://mail.python.org/mailman/listinfo/python-list