cassandra start-up seek-bound in uninterruptable sleep
------------------------------------------------------

                 Key: CASSANDRA-1597
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1597
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Peter Schuller
            Priority: Minor


(Cassandra trunk from a few months ago, Linux 2.6.32)

I just restarted a Cassandra node with a large (a few TB:s) column family. I 
was expecting index sampling to take a while, but in this case I'm seeing 
something unexpected. First, the log has:

10/10/09 11:15:49 INFO config.DatabaseDescriptor: Auto DiskAccessMode 
determined to be mmap
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for 
/var/lib/cassandra/data/system/Schema-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for 
/var/lib/cassandra/data/system/Migrations-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for 
/var/lib/cassandra/data/system/LocationInfo-c-1-<>
10/10/09 11:15:50 INFO config.DatabaseDescriptor: Loading schema version 
d8ad1485-a6e4-11df-beb1-d1273d6ae3d0

And it is currently stuck with iostat looking like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb               0.00    67.00  146.00    2.00   584.00   276.00    11.62     
0.99    6.68   6.68  98.80

I re-checked the code and the index sampling really should not be seek bound 
since it is sequential. At first I figured that perhaps some aspect of the I/O 
pattern is causing the kernel (and I guess underlying RAID+drives) to not do 
read-ahead. This could explain why lots of small reads, even though they are 
sequential in nature, could end up being seek bound.

mmap() has the problem that you cannot strace to see what is doing I/O, so I 
wanted to confirm with jstack. At this point I notice jstack won't attach at 
all (after a timeout).

Further, a SIGKILL fails to kill Java, with iostat proceeding, still seek 
bound, and the process in uninterruptable sleep (presumably constantly, or else 
the SIGKILL should have worked).

Somewhere along the way, strace has begun failing to attach as well but from my 
remembrance I didn't see any I/O calls when it did attach properly, which is 
consistent with whatever I/O is happening being triggered by mmap():ed memory 
access or else by some other long-running syscall.

A very speculative hypothesis is that perhaps the initial mmap() call can be 
long and blocking, seek-bound, due to inodes being consulted on disk (compare 
with the cost of "rm:ing" large files on ext3fs). This would probably have the 
potential to take quite a long time, if it is the case. On the other hand, if 
this was the behavior of mmap() on ext3fs, it seems unlikely that it would not 
be widely known already. It also seems unlikely in terms of prior probability 
since I don't see a good reason why mmap() would have a need to scan inodes.

I am currently waiting it out to see if/when it eventually gets killed.

I will then try restarting the node in standard I/O mode and see if that works. 
If it does, I'll re-try with mmap():ed mode to see that I can trigger it again.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to