[
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394474#comment-14394474
]
Sergey Maznichenko commented on CASSANDRA-9092:
-----------------------------------------------
Consistency ONE. Clients use Datastax Client (Java).
We are writing only to DC1.
In the logs of the nodes which don't fail we have errors and warnings during
load:
INFO [SharedPool-Worker-5] 2015-03-31 15:48:52,534 Message.java:532 -
Unexpected exception during request; channel = [id: 0x48b3ad12,
/10.77.81.33:56581 :> /10.XX.XX.10:9042]
java.io.IOException: Error while read(...): Connection reset by peer
at io.netty.channel.epoll.Native.readAddress(Native Method)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_71]
ERROR [Thrift:15] 2015-03-31 11:54:35,163 CustomTThreadPoolServer.java:221 -
Error occurred during processing of message.
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 2 responses.
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:317)
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:125)
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.service.ClientState.login(ClientState.java:171)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1493)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579)
~[apache-cassandra-thrift-2.1.2.jar:2.1.2]
at
org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563)
~[apache-cassandra-thrift-2.1.2.jar:2.1.2]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
~[libthrift-0.9.1.jar:0.9.1]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
~[libthrift-0.9.1.jar:0.9.1]
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:202)
~[apache-cassandra-2.1.2.jar:2.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[na:1.7.0_71]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_71]
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation
timed out - received only 2 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:103)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1263)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1184)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:262)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:215)
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:306)
~[apache-cassandra-2.1.2.jar:2.1.2]
... 11 common frames omitted
I've changed schema definition.
It's periodic workload, so I will disable hinted handoff temporary. Also I
disabled compaction for filespace.filestorage because it takes long time and
gives <1% efficiency.
My hints parameters now:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 4
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 10240
I suppose Cassandra should do some kind of partial compaction if system.hints
is big, or do clean old hints before compaction. Do you have idea about
nessesary changes in 2.1.5?
> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
> Key: CASSANDRA-9092
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
> Project: Cassandra
> Issue Type: Bug
> Environment: CentOS 6.2 64-bit, Cassandra 2.1.2,
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
> Reporter: Sergey Maznichenko
> Assignee: Sam Tunnicliffe
> Fix For: 2.1.5
>
> Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1
> node in DC2 stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start.
> I see many files in system.hints table and error appears in 2-3 minutes after
> starting system.hints auto compaction.
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456
> CassandraDaemon.java:153 - Exception in thread
> Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 -
> Exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)