On 03/11/2013 03:47 PM, McFarland, Jeffrey wrote: > I have come across some odd results regarding the sort utility in coreutils > version 8.20. I’ve looked through the archives and don’t see any similar > issues so it may be something specific to our systems. > > > > System: SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890 > > > > Issue: When running sort on a 22.5 GB file I found that about 1 out of 10 > times the process seems to hang (out of 100+ tests). The process is still > running but the temp files are no longer changing and the final file either > has not been created or is a 0 byte file. When this happens the temp files > are never in the exact same state as a previous run. On this machine a > complete sort normally takes about 20 minutes. On one occasion the process > hung for over 48 hours before I killed it. Running top shows no significant > load on the system. > > > > Command run: > > ./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T > /disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T > /disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10 infile -o > infile.sorted > > > >>: ps > > PID TTY TIME CMD > > 16328 pts/3 5:06 sort > > 12697 pts/3 0:00 ps > > > >>: sudo truss -rall -wall -f -p 16328 > > 16328: lwp_park(0x00000000, 0) (sleeping...) > > > >>: sudo pstack 16328 > > 16328: /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100 -T /disk/ > > ----------------- lwp# 1 / thread# 1 -------------------- > > ffffffff7d4d8818 lwp_park (0, 0, 0) > > 0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0, > 10012a321, ffffffff7fffead0, 10012a328) + 514 > > 000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0, 0, > ffffffff7fffeab0) + e6c > > 000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420, 0, > ffffffff7fffeab0) + e6c > > 000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0, 1, > ffffffff7fffeab0) + e6c > > 000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd, > 112154760) + 350 > > 000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd, 10012b1e0) + > 1ee8 > > 00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c > > ----------------- lwp# 240 / thread# 240 -------------------- > > 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 > > ** zombie (exited, not detached, not yet joined) ** > > ----------------- lwp# 241 / thread# 241 -------------------- > > 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 > > ** zombie (exited, not detached, not yet joined) ** > > ----------------- lwp# 242 / thread# 242 -------------------- > > 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000 > > ** zombie (exited, not detached, not yet joined) ** > > > > If I change the sort to run as a single threaded process (add “--parallel=1” > to above command) then it doesn’t hang. This makes me think that it’s most > likely a threading issue. I ran the same tests on a LINUX machine and it did > not have the same hanging issue so it’s most likely only an issue with > Solaris. > > > > I initially found this issue using coreutils 8.9 and I changed to 8.20 to see > if a fix had been made but no luck. > > > > Is this a known issue? Are there any additional tests I should run to > further narrow down this issue?
I can't think of anything TBH. There may possibly be some portability issues with --compress and --parallel (due to possibly non async safe functions being called after a fork), but you're not using --compress, so we can discount that at least. No matter if the bug is in coreutils or solaris, adding some sleeps may help trigger a race more quickly? BTW the `sort -t\n` looks strange. Did you mean: sort -t$'\n' ? thanks, Pádraig.
