cassandra start-up seek-bound in uninterruptable sleep
------------------------------------------------------
Key: CASSANDRA-1597
URL: https://issues.apache.org/jira/browse/CASSANDRA-1597
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
Priority: Minor
(Cassandra trunk from a few months ago, Linux 2.6.32)
I just restarted a Cassandra node with a large (a few TB:s) column family. I
was expecting index sampling to take a while, but in this case I'm seeing
something unexpected. First, the log has:
10/10/09 11:15:49 INFO config.DatabaseDescriptor: Auto DiskAccessMode
determined to be mmap
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
/var/lib/cassandra/data/system/Schema-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
/var/lib/cassandra/data/system/Migrations-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
/var/lib/cassandra/data/system/LocationInfo-c-1-<>
10/10/09 11:15:50 INFO config.DatabaseDescriptor: Loading schema version
d8ad1485-a6e4-11df-beb1-d1273d6ae3d0
And it is currently stuck with iostat looking like this:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.00 67.00 146.00 2.00 584.00 276.00 11.62
0.99 6.68 6.68 98.80
I re-checked the code and the index sampling really should not be seek bound
since it is sequential. At first I figured that perhaps some aspect of the I/O
pattern is causing the kernel (and I guess underlying RAID+drives) to not do
read-ahead. This could explain why lots of small reads, even though they are
sequential in nature, could end up being seek bound.
mmap() has the problem that you cannot strace to see what is doing I/O, so I
wanted to confirm with jstack. At this point I notice jstack won't attach at
all (after a timeout).
Further, a SIGKILL fails to kill Java, with iostat proceeding, still seek
bound, and the process in uninterruptable sleep (presumably constantly, or else
the SIGKILL should have worked).
Somewhere along the way, strace has begun failing to attach as well but from my
remembrance I didn't see any I/O calls when it did attach properly, which is
consistent with whatever I/O is happening being triggered by mmap():ed memory
access or else by some other long-running syscall.
A very speculative hypothesis is that perhaps the initial mmap() call can be
long and blocking, seek-bound, due to inodes being consulted on disk (compare
with the cost of "rm:ing" large files on ext3fs). This would probably have the
potential to take quite a long time, if it is the case. On the other hand, if
this was the behavior of mmap() on ext3fs, it seems unlikely that it would not
be widely known already. It also seems unlikely in terms of prior probability
since I don't see a good reason why mmap() would have a need to scan inodes.
I am currently waiting it out to see if/when it eventually gets killed.
I will then try restarting the node in standard I/O mode and see if that works.
If it does, I'll re-try with mmap():ed mode to see that I can trigger it again.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.