On 08/20/2011 09:58 PM, Andras Salamon wrote: > On Fri, Aug 19, 2011 at 11:54:46PM +0100, Pádraig Brady wrote: >> On 08/18/2011 03:30 PM, Andras Salamon wrote: >>> I am seeing repeated (but not reliably repeatable) segmentation faults >>> sorting datasets in the 100MB-100GB range on a 64-bit Debian system >>> using GNU sort 8.12 (and also 8.9). Stack traces seem to indicate >>> problems during the merge phase, usually when the temporary files >>> are being combined. > >> Andras, could you give the exact command line your having issue with, >> and perhaps make sort inputs available too? > > The sort inputs are several-gigabyte-range files containing strings, > each typically 60 to 140 bytes long, one per line. There are > many duplicates, and the first reason to sort is to establish the > distribution of duplicates. I would be happy to make available data > if I could find a reasonably sized file that causes a reproducible > segfault. The problem seems easier to reproduce with larger files, > unfortunately. > >> Do the --batch-size=NMERGE or --compress-program=PROG options change >> anything? > > Thanks for the suggestion, I will try forcing smaller batches. > > Compressing batches was something I tried early on with no apparent > change in likelihood of failure, but it led to much slower runtimes. > >> Also there were temp file handling changes made in 7.2 so could you try: >> ftp://ftp.gnu.org/gnu/coreutils/coreutils-7.1.tar.gz > > Here are some of the relevant-seeming parts of a gdb session for > coreutils-7.1.
If this happens with 2.5 year old sort, I'd be leaning towards a local issue. > (gdb) bt > #0 0x000000000040e6bc in memcoll ( > s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, > s1len=15564440312192434243, s2=0x2b2a1a0 > "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., > s2len=68) > at memcoll.c:50 > #1 0x000000000040af4c in xmemcoll ( > s1=0x7800000005824d58 <Address 0x7800000005824d58 out of bounds>, > s1len=15564440312192434243, s2=0x2b2a1a0 > "<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066.fasta>\n<http://scinets.org/item/cria214s2ria214u225719i~files~P58066."..., > s2len=68) > at xmemcoll.c:43 > #2 0x00000000004059ee in compare (a=0x5b4a7f0, b=0x301dfc0) at sort.c:2059 > #3 0x0000000000406815 in mergefps (files=0x24063e0, ntemps=15, nfiles=15, > ofp=0x23ff8e0, output_file=0x24062ec "/home/a/tmp/sortcOqzkh") > at sort.c:2326 > #4 0x000000000040708f in merge (files=0x24063e0, ntemps=16, nfiles=32, > output_file=0x0) at sort.c:2567 > #5 0x000000000040766a in sort (files=0x61c660, nfiles=0, output_file=0x0) > at sort.c:2699 > #6 0x000000000040908c in main (argc=5, argv=0x7fff149247a8) at sort.c:3425 So the 'a' line struct is corrupted. a->text = 7800000005824D58 a->length = D800000000000043 Notice the 0x78 and 0xD8. They should be 0x00. Now whether this is software or hardware? It looks like hardware TBH as there are 4 bits incorrectly set in each of those bytes (which ECC couldn't correct if you have that), and also each incorrect bit is beside another. cheers, Pádraig.
