I will try to reproduce problem on smaller test cluster.
It was rather easy, cluster contains 4 servers. Log's fragment from restarted node (10.2.3.38):
DEBUG [pool-1-thread-64] 2009-10-15 14:18:16,290 CassandraServer.java (line 214) get_slice DEBUG [pool-1-thread-64] 2009-10-15 14:18:16,290 StorageProxy.java (line 239) weakreadlocal reading SliceFromReadCommand(table='Keyspace1', key='0000000000000000000000000000000000849706', column_parent='QueryPath(columnFamilyName='Super1', superColumnName='[...@6ca50fbe', columnName='null')', start='1', finish='0', reversed=true, count=2) DEBUG [pool-1-thread-64] 2009-10-15 14:18:16,290 StorageProxy.java (line 251) weakreadremote reading SliceFromReadCommand(table='Keyspace1', key='0000000000000000000000000000000000849706', column_parent='QueryPath(columnFamilyName='Super1', superColumnName='[...@6ca50fbe', columnName='null')', start='1', finish='0', reversed=true, count=2) from [email protected]:7000
...ERROR [pool-1-thread-64] 2009-10-15 14:18:21,281 Cassandra.java (line 679) Internal error processing get_slice java.lang.RuntimeException: error reading key 0000000000000000000000000000000000849706 at org.apache.cassandra.service.StorageProxy.weakReadRemote(StorageProxy.java:265) at org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:312) at org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95) at org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177) at org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252) at org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215) at org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:671)
at
org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:627)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.util.concurrent.TimeoutException: Operation timed out.
at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
at
org.apache.cassandra.service.StorageProxy.weakReadRemote(StorageProxy.java:261)
... 11 more
Log's fragment from 10.2.3.40:
DEBUG [ROW-READ-STAGE:4] 2009-10-15 14:18:16,308 ReadVerbHandler.java (line 100)
Read key 0000000000000000000000000000000000849706; sending response to
[email protected]:7000
....DEBUG [CONSISTENCY-MANAGER:2] 2009-10-15 14:18:16,308 ConsistencyManager.java (line 168) Reading consistency digest for 0000000000000000000000000000000000849706 from 527...@[10.3.2.39:7000, 10.3.2.41:7000]
I have full logs, but they are about half of gigabyte for each node. If it's needed I can put them somewhere accessible by http.
How to reproduce: - configure cluster for 4 nodes, changes in storage-conf.xml: <ReplicationFactor>3</ReplicationFactor> <FlushMinThreads>8</FlushMinThreads> <FlushMaxThreads>16</FlushMaxThreads> - edit attached scripts with correct node's IPs - run perl writecluster.pl -c 8 and wait for 10-20 minutes - run perl readcluster.pl - look at error :) -- Teodor Sigaev E-mail: [email protected] WWW: http://www.sigaev.ru/
writecluster.pl
Description: Perl program
readcluster.pl
Description: Perl program
