Re: rebuild constantly fails, 3.11

2017-08-11 Thread kurt greaves
cc'ing user back in...

On 12 Aug. 2017 01:55, "kurt greaves"  wrote:

> How much memory do these machines have?  Typically we've found that G1
> isn't worth it until you get to around 24G heaps, and even at that it's not
> really better than CMS. You could try CMS with an 8G heap and 2G new size.
>
> However as the oom is only happening on one node have you ensured there
> are no extra processes running on that node that could be consuming extra
> memory? Note that the oom killer will kill the process with the highest oom
> score, which generally corresponds to the process using the most memory,
> but not necessarily the problem.
>
> Also could you run nodetool info on the problem node and 1 other and dump
> the output in a gist? It would be interesting to see if there is a
> significant difference in off-heap.
>
> On 11 Aug. 2017 17:30, "Micha"  wrote:
>
>> It's an oom issue, the kernel kills the cassandra job.
>> The config was to use offheap buffers and 20G java heap, I changed this
>> to use heap buffers and 16G java heap. I added a  new node yesterday
>> which got streams from 4 other nodes. They all succeeded except on the
>> one node which failed before. This time again the db was killed by the
>> kernel. At the moment I don't know what is the reason here, since the
>> nodes are equal.
>>
>> For me it seems the g1gc is not able to free the memory fast enough.
>> The settings were for  MaxGCPauseMillis=600 and ParallelGCThreads=10
>> ConcGCThreads=10 which maybe are too high since the node has only 8
>> cores..
>> I changed this ParallelGCThreads=8 and ConcGCThreads=2 as is mentioned
>> in the comments of jvm.options
>>
>> Since the bootstrap of the fifth node did not complete I will start it
>> again and check if the memory is still decreasing over time.
>>
>>
>>
>>  Michael
>>
>>
>>
>> On 11.08.2017 01:25, Jeff Jirsa wrote:
>> >
>> >
>> > On 2017-08-08 01:00 (-0700), Micha  wrote:
>> >> Hi,
>> >>
>> >> it seems I'm not able to add add 3 node dc to a 3 node dc. After
>> >> starting the rebuild on a new node, nodetool netstats show it will
>> >> receive 1200 files from node-1 and 5000 from node-2. The stream from
>> >> node-1 completes but the stream from node-2 allways fails, after
>> sending
>> >> ca 4000 files.
>> >>
>> >> After restarting the rebuild it again starts to send the 5000 files.
>> >> The whole cluster is connected via one switch only , no firewall
>> >> between, the networks shows no errors.
>> >> The machines have 8 cores, 32GB RAM and two 1TB discs as raid0.
>> >> the logs show no errors. The size of the data is ca 1TB.
>> >
>> > Is there anything in `dmesg` ?  System logs? Nothing? Is node2 running?
>> Is node3 running?
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: dev-h...@cassandra.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>


Re: rebuild constantly fails, 3.11

2017-08-08 Thread Micha
The logs didn't show an error.
I have started it again with higher log level allthough errors should be
logged despite the log level. If it breaks again I share the log with
the possible error in it.
The only error output I got was on the console:


Cassandra has shutdown.
error: null
-- StackTrace --
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:222)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1020)
at
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:298)
at com.sun.proxy.$Proxy7.rebuild(Unknown Source)
at org.apache.cassandra.tools.NodeProbe.rebuild(NodeProbe.java:1190)
at
org.apache.cassandra.tools.nodetool.Rebuild.execute(Rebuild.java:58)
at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:254)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:168)











On 08.08.2017 17:03, ZAIDI, ASAD A wrote:
> Without exact failure text, it is really hard to guess what may be going-on - 
>  can you please share logfile excerpt detailing the failure error so we can 
> have better idea of the nature failure.
> Adjusting phi_convict_threshold may yet be another shot in the dark when we 
> don’t know what is causing the failure and network is supposedly stable.
> 
> ~Asad
> 
> 
> 
> -Original Message-
> From: Micha [mailto:mich...@fantasymail.de] 
> Sent: Tuesday, August 08, 2017 8:35 AM
> To: user@cassandra.apache.org; ZAIDI, ASAD A <az1...@att.com>; 
> user@cassandra.apache.org
> Subject: Re: rebuild constantly fails, 3.11
> 
> no, I have left it at the default value of 24hours.
> 
> I've read about adjusting phi_convict_threshold, but I haven't done this yet 
> as the network is stable. maybe I set this to 10.
> 
> 
> On 08.08.2017 15:24, ZAIDI, ASAD A wrote:
>> Is there any chance you've set streaming_socket_timeout_in_ms parameter set 
>> too low on failing node?
>>
>>
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: rebuild constantly fails, 3.11

2017-08-08 Thread ZAIDI, ASAD A
Without exact failure text, it is really hard to guess what may be going-on -  
can you please share logfile excerpt detailing the failure error so we can have 
better idea of the nature failure.
Adjusting phi_convict_threshold may yet be another shot in the dark when we 
don’t know what is causing the failure and network is supposedly stable.

~Asad



-Original Message-
From: Micha [mailto:mich...@fantasymail.de] 
Sent: Tuesday, August 08, 2017 8:35 AM
To: user@cassandra.apache.org; ZAIDI, ASAD A <az1...@att.com>; 
user@cassandra.apache.org
Subject: Re: rebuild constantly fails, 3.11

no, I have left it at the default value of 24hours.

I've read about adjusting phi_convict_threshold, but I haven't done this yet as 
the network is stable. maybe I set this to 10.


On 08.08.2017 15:24, ZAIDI, ASAD A wrote:
> Is there any chance you've set streaming_socket_timeout_in_ms parameter set 
> too low on failing node?
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


Re: rebuild constantly fails, 3.11

2017-08-08 Thread Micha
no, I have left it at the default value of 24hours.

I've read about adjusting phi_convict_threshold, but I haven't done this
yet as the network is stable. maybe I set this to 10.


On 08.08.2017 15:24, ZAIDI, ASAD A wrote:
> Is there any chance you've set streaming_socket_timeout_in_ms parameter set 
> too low on failing node?
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: rebuild constantly fails, 3.11

2017-08-08 Thread ZAIDI, ASAD A
Is there any chance you've set streaming_socket_timeout_in_ms parameter set too 
low on failing node?


-Original Message-
From: Micha [mailto:mich...@fantasymail.de] 
Sent: Tuesday, August 08, 2017 3:01 AM
To: user@cassandra.apache.org; d...@cassandra.apache.org
Subject: rebuild constantly fails, 3.11

Hi,

it seems I'm not able to add add 3 node dc to a 3 node dc. After starting the 
rebuild on a new node, nodetool netstats show it will receive 1200 files from 
node-1 and 5000 from node-2. The stream from
node-1 completes but the stream from node-2 allways fails, after sending ca 
4000 files.

After restarting the rebuild it again starts to send the 5000 files.
The whole cluster is connected via one switch only , no firewall between, the 
networks shows no errors.
The machines have 8 cores, 32GB RAM and two 1TB discs as raid0.
the logs show no errors. The size of the data is ca 1TB.


Any help is really welcome,

cheers
 Michael






The error is:

Cassandra has shutdown.
error: null
-- StackTrace --
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:222)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1020)
at
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:298)
at com.sun.proxy.$Proxy7.rebuild(Unknown Source)
at org.apache.cassandra.tools.NodeProbe.rebuild(NodeProbe.java:1190)
at
org.apache.cassandra.tools.nodetool.Rebuild.execute(Rebuild.java:58)
at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:254)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:168)

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: rebuild constantly fails, 3.11

2017-08-08 Thread kurt greaves
If the error is reproducible can you upload the logs to a gist from the
same time period as when the error occurs?​


rebuild constantly fails, 3.11

2017-08-08 Thread Micha
Hi,

it seems I'm not able to add add 3 node dc to a 3 node dc. After
starting the rebuild on a new node, nodetool netstats show it will
receive 1200 files from node-1 and 5000 from node-2. The stream from
node-1 completes but the stream from node-2 allways fails, after sending
ca 4000 files.

After restarting the rebuild it again starts to send the 5000 files.
The whole cluster is connected via one switch only , no firewall
between, the networks shows no errors.
The machines have 8 cores, 32GB RAM and two 1TB discs as raid0.
the logs show no errors. The size of the data is ca 1TB.


Any help is really welcome,

cheers
 Michael






The error is:

Cassandra has shutdown.
error: null
-- StackTrace --
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:222)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1020)
at
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:298)
at com.sun.proxy.$Proxy7.rebuild(Unknown Source)
at org.apache.cassandra.tools.NodeProbe.rebuild(NodeProbe.java:1190)
at
org.apache.cassandra.tools.nodetool.Rebuild.execute(Rebuild.java:58)
at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:254)
at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:168)

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org