[
https://issues.apache.org/jira/browse/CASSANDRA-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919495#action_12919495
]
Peter Schuller commented on CASSANDRA-1597:
-------------------------------------------
My apologies for the unnecessary traffic. The explanation was right there in my
own ramblings above (about rm). It was actually in uninterruptible sleep due to
the removal of orphan files:
0/10/09 11:38:34 INFO db.ColumnFamilyStore: Removing orphan
/var/lib/cassandra/data/sporebench/sporebench-tmp-c-10204-Index.db
Service killed by signal 9
In other words, to avoid this, run Cassandra on better file systems...
Presumably xfs/ext4 should both be good.
> cassandra start-up seek-bound in uninterruptable sleep
> ------------------------------------------------------
>
> Key: CASSANDRA-1597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1597
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Peter Schuller
> Priority: Minor
>
> (Cassandra trunk from a few months ago, Linux 2.6.32)
> I just restarted a Cassandra node with a large (a few TB:s) column family. I
> was expecting index sampling to take a while, but in this case I'm seeing
> something unexpected. First, the log has:
> 10/10/09 11:15:49 INFO config.DatabaseDescriptor: Auto DiskAccessMode
> determined to be mmap
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
> /var/lib/cassandra/data/system/Schema-c-1-<>
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
> /var/lib/cassandra/data/system/Migrations-c-1-<>
> 10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for
> /var/lib/cassandra/data/system/LocationInfo-c-1-<>
> 10/10/09 11:15:50 INFO config.DatabaseDescriptor: Loading schema version
> d8ad1485-a6e4-11df-beb1-d1273d6ae3d0
> And it is currently stuck with iostat looking like this:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await svctm %util
> sdb 0.00 67.00 146.00 2.00 584.00 276.00 11.62
> 0.99 6.68 6.68 98.80
> I re-checked the code and the index sampling really should not be seek bound
> since it is sequential. At first I figured that perhaps some aspect of the
> I/O pattern is causing the kernel (and I guess underlying RAID+drives) to not
> do read-ahead. This could explain why lots of small reads, even though they
> are sequential in nature, could end up being seek bound.
> mmap() has the problem that you cannot strace to see what is doing I/O, so I
> wanted to confirm with jstack. At this point I notice jstack won't attach at
> all (after a timeout).
> Further, a SIGKILL fails to kill Java, with iostat proceeding, still seek
> bound, and the process in uninterruptable sleep (presumably constantly, or
> else the SIGKILL should have worked).
> Somewhere along the way, strace has begun failing to attach as well but from
> my remembrance I didn't see any I/O calls when it did attach properly, which
> is consistent with whatever I/O is happening being triggered by mmap():ed
> memory access or else by some other long-running syscall.
> A very speculative hypothesis is that perhaps the initial mmap() call can be
> long and blocking, seek-bound, due to inodes being consulted on disk (compare
> with the cost of "rm:ing" large files on ext3fs). This would probably have
> the potential to take quite a long time, if it is the case. On the other
> hand, if this was the behavior of mmap() on ext3fs, it seems unlikely that it
> would not be widely known already. It also seems unlikely in terms of prior
> probability since I don't see a good reason why mmap() would have a need to
> scan inodes.
> I am currently waiting it out to see if/when it eventually gets killed.
> I will then try restarting the node in standard I/O mode and see if that
> works. If it does, I'll re-try with mmap():ed mode to see that I can trigger
> it again.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.