[
https://issues.apache.org/jira/browse/CASSANDRA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032294#comment-13032294
]
Thibaut commented on CASSANDRA-2394:
------------------------------------
Another hd died.
This time, there were ERRORS in the log:
ERROR [ReadStage:336] 2011-05-11 14:35:53,232 AbstractCassandraDaemon.java
(line 113) Fatal exception in thread Thread[ReadStage:336,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable
at
org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:60)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: corrupt sstable
at
org.apache.cassandra.io.sstable.SSTableScanner.seekTo(SSTableScanner.java:104)
at
org.apache.cassandra.db.RowIteratorFactory.getIterator(RowIteratorFactory.java:96)
at
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1447)
at
org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:49)
... 4 more
Caused by: java.io.IOException: Input/output error
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:206)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:347)
at
org.apache.cassandra.io.sstable.SSTableScanner.seekTo(SSTableScanner.java:99)
... 7 more
Together with:
WARN [ScheduledTasks:1] 2011-05-11 12:24:35,725 MessagingService.java (line
504) Dropped 10 READ messages in the last 5000ms
WARN [ScheduledTasks:1] 2011-05-11 12:24:35,725 MessagingService.java (line
504) Dropped 17 RANGE_SLICE messages in the last 5000ms
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 51)
Pool Name Active Pending
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66)
ReadStage 16 1310
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66)
RequestResponseStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66)
ReadRepairStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66)
MutationStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
GossipStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
AntiEntropyStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
MigrationStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
StreamStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
MemtablePostFlusher 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66)
FILEUTILS-DELETE-POOL 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66)
FlushWriter 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66)
MiscStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66)
FlushSorter 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66)
InternalResponseStage 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66)
HintedHandoff 0 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 70)
CompactionManager n/a 0
INFO [ScheduledTasks:1] 2011-05-11 12:24:35,729 StatusLogger.java (line 82)
MessagingService n/a 0,0
> Faulty hd kills cluster performance
> -----------------------------------
>
> Key: CASSANDRA-2394
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2394
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.7.4
> Reporter: Thibaut
> Priority: Minor
> Fix For: 0.7.6
>
>
> Hi,
> About every week, a node from our main cluster (>100 nodes) has a faulty hd
> (Listing the cassandra data storage directoy triggers an input/output error).
> Whenever this occurs, I see many timeoutexceptions in our application on
> various nodes which cause everything to run very very slowly. Keyrange scans
> just timeout and will sometimes never succeed. If I stop cassandra on the
> faulty node, everything runs normal again.
> It would be great to have some kind of monitoring thread in cassandra which
> marks a node as "down" if there are multiple read/write errors to the data
> directories. A single faulty hd on 1 node shouldn't affect global cluster
> performance.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira