Tomasz Kuzemko created TS-1648:
----------------------------------

             Summary: Segmentation fault in dir_clear_range()
                 Key: TS-1648
                 URL: https://issues.apache.org/jira/browse/TS-1648
             Project: Traffic Server
          Issue Type: Bug
          Components: Cache
            Reporter: Tomasz Kuzemko


I use ATS as a reverse proxy. I have a fairly large disk cache consisting of 2x 
10TB raw disks. I do not use cache compression. After a few days of running 
(this is a dev machine - not handling any traffic) ATS begins to crash with a 
segfault shortly after start:

[Jan 11 16:11:00.690] Server {0x7ffff2bb8700} DEBUG: (rusage) took rusage snap 
1357917060690487000

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff20ad700 (LWP 17292)]
0x0000000000696a71 in dir_clear_range (start=640, end=17024, vol=0x16057d0) at 
CacheDir.cc:382
382     CacheDir.cc: No such file or directory.
        in CacheDir.cc
(gdb) p i
$1 = 214748365
(gdb) l
377     in CacheDir.cc
(gdb) p dir_index(vol, i)
$2 = (Dir *) 0x7ff997a04002
(gdb) p dir_index(vol, i-1)
$3 = (Dir *) 0x7ffa97a03ff8
(gdb) p *dir_index(vol, i-1)
$4 = {w = {0, 0, 0, 0, 0}}
(gdb) p *dir_index(vol, i-2)
$5 = {w = {0, 0, 52431, 52423, 0}}
(gdb) p *dir_index(vol, i)
Cannot access memory at address 0x7ff997a04002
(gdb) p *dir_index(vol, i+2)
Cannot access memory at address 0x7ff997a04016
(gdb) p *dir_index(vol, i+1)
Cannot access memory at address 0x7ff997a0400c
(gdb) p vol->buckets * DIR_DEPTH * vol->segments
$6 = 1246953472
(gdb) bt
#0  0x0000000000696a71 in dir_clear_range (start=640, end=17024, vol=0x16057d0) 
at CacheDir.cc:382
#1  0x000000000068aba2 in Vol::handle_recover_from_data (this=0x16057d0, 
event=3900, data=0x16058a0) at Cache.cc:1384
#2  0x00000000004e8e1c in Continuation::handleEvent (this=0x16057d0, 
event=3900, data=0x16058a0) at ../iocore/eventsystem/I_Continuation.h:146
#3  0x0000000000692385 in AIOCallbackInternal::io_complete (this=0x16058a0, 
event=1, data=0x135afc0) at ../../iocore/aio/P_AIO.h:80
#4  0x00000000004e8e1c in Continuation::handleEvent (this=0x16058a0, event=1, 
data=0x135afc0) at ../iocore/eventsystem/I_Continuation.h:146
#5  0x0000000000700fec in EThread::process_event (this=0x7ffff36c4010, 
e=0x135afc0, calling_code=1) at UnixEThread.cc:142
#6  0x00000000007011ff in EThread::execute (this=0x7ffff36c4010) at 
UnixEThread.cc:191
#7  0x00000000006ff8c2 in spawn_thread_internal (a=0x1356040) at Thread.cc:88
#8  0x00007ffff797e8ca in start_thread () from /lib/libpthread.so.0
#9  0x00007ffff55c6b6d in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()


This is fixed by running "traffic_server -Kk" to clear the cache. But after a 
few days the issue reappears.

I will keep the current faulty setup as-is in case you need me to provide more 
data. I tried to make a core dump but it took a couple of GB even after gzip (I 
can however provide it on request).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to