I have come across some odd results regarding the sort utility in coreutils
version 8.20. I've looked through the archives and don't see any similar
issues so it may be something specific to our systems.
System: SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890
Issue: When running sort on a 22.5 GB file I found that about 1 out of 10
times the process seems to hang (out of 100+ tests). The process is still
running but the temp files are no longer changing and the final file either has
not been created or is a 0 byte file. When this happens the temp files are
never in the exact same state as a previous run. On this machine a complete
sort normally takes about 20 minutes. On one occasion the process hung for
over 48 hours before I killed it. Running top shows no significant load on the
system.
Command run:
./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T
/disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T
/disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10 infile -o
infile.sorted
>: ps
PID TTY TIME CMD
16328 pts/3 5:06 sort
12697 pts/3 0:00 ps
>: sudo truss -rall -wall -f -p 16328
16328: lwp_park(0x00000000, 0) (sleeping...)
>: sudo pstack 16328
16328: /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100 -T /disk/
----------------- lwp# 1 / thread# 1 --------------------
ffffffff7d4d8818 lwp_park (0, 0, 0)
0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0, 10012a321,
ffffffff7fffead0, 10012a328) + 514
000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0, 0,
ffffffff7fffeab0) + e6c
000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420, 0,
ffffffff7fffeab0) + e6c
000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0, 1,
ffffffff7fffeab0) + e6c
000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd,
112154760) + 350
000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd, 10012b1e0) +
1ee8
00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c
----------------- lwp# 240 / thread# 240 --------------------
000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
** zombie (exited, not detached, not yet joined) **
----------------- lwp# 241 / thread# 241 --------------------
000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
** zombie (exited, not detached, not yet joined) **
----------------- lwp# 242 / thread# 242 --------------------
000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
** zombie (exited, not detached, not yet joined) **
If I change the sort to run as a single threaded process (add "--parallel=1" to
above command) then it doesn't hang. This makes me think that it's most likely
a threading issue. I ran the same tests on a LINUX machine and it did not have
the same hanging issue so it's most likely only an issue with Solaris.
I initially found this issue using coreutils 8.9 and I changed to 8.20 to see
if a fix had been made but no luck.
Is this a known issue? Are there any additional tests I should run to further
narrow down this issue?
Thanks,
Jeff
________________________________
This e-mail and files transmitted with it are confidential, and are intended
solely for the use of the individual or entity to whom this e-mail is
addressed. If you are not the intended recipient, or the employee or agent
responsible to deliver it to the intended recipient, you are hereby notified
that any dissemination, distribution or copying of this communication is
strictly prohibited. If you are not one of the named recipient(s) or otherwise
have reason to believe that you received this message in error, please
immediately notify sender by e-mail, and destroy the original message. Thank
You.