[jira] [Comment Edited] (CASSANDRA-10797) Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables

Paulo Motta (JIRA) Fri, 11 Dec 2015 04:47:54 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052692#comment-15052692
 ]


Paulo Motta edited comment on CASSANDRA-10797 at 12/11/15 12:46 PM:
--------------------------------------------------------------------

As mentioned before I was able to reproduce the OOM with 1000 small sstables 
and 50M heap. I attached a [ccm 
cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz] 
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the 
{{~/.ccm}} folder, update the following properties to match your local 
directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}}, 
{{data_file_directories}} and {{saved_caches_directory}}. After that, run the 
following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of 
the stream receive task, and these objects are quite large with indexes and 
stats that are written to disk only when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were 
received, and the stream receive task kept only the {{SStableReaders}} which 
have a much smaller memory footprint. The main reason to defer the closing of 
the {{SStableWriter}} to the end of the stream receive task was to keep 
sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear 
if the machines are restarted after a failed repair session. A discussed 
alternative was to close the {{SStableWriter}} without removing the {{-tmp}} 
infix, and performing an atomic rename in the end of the stream task. However, 
this alternative was disregarded as the {{SStableReader}} would need to be 
closed and reopened in order to perform the atomic rename on non-posix systems 
such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed 
{{SStableWriters}} if the node goes down before these files are processed in 
the end of the stream receive task. So, the proposed solution basically returns 
to the previous behavior of closing {{SStableWriters}} as soon as they are 
received, while adding already-closed-but-not-yet-live files to the 
{{StreamLockFile}}. As soon as the sstables are added to the data tracker, the 
{{StreamLockFile}} is removed. If the stream session fails before that, the 
already-closed-but-not-yet-live sstables are cleaned up. If there is a failure 
while adding files to the data tracker, only the files that were not yet added 
to the data tracker are removed since they were already live. If the node goes 
down during a stream session, the already-closed-but-not-yet-live sstables 
present in the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, 
I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and 
{{delete}} are working correctly. We also need to ignore sstables that are 
present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that 
after first review if this approach is validated.

Below are some test results with and without the patch, with constrained (50M) 
and unconstrained (500M) memory.



||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|

In the constrained case, the unpatched version OOM soon after starting 
bootstrap while the patched version finished bootstrap successfully. In the 
unconstrained case, the memory footprint is between 1/2 to 1/3 smaller, but the 
difference is probably much larger in the case of large sstables.

Below is the initial patch and tests:

||2.1||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10797]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-dtest/lastCompletedBuild/testReport/]|

I will provide 2.2+ versions after review.


was (Author: pauloricardomg):
As mentioned before I was able to reproduce the OOM with 1000 small sstables 
and 50M heap. I attached a [ccm 
cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz] 
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the 
{{~/.ccm}} folder, update the following properties to match your local 
directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}}, 
{{data_file_directories}} and {{saved_caches_directory}}. After that, run the 
following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of 
the stream receive task, and these objects are quite large with indexes and 
stats that are written to disk only when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were 
received, and the stream receive task kept only the {{SStableReaders}} which 
have a much smaller memory footprint. The main reason to defer the closing of 
the {{SStableWriter}} to the end of the stream receive task was to keep 
sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear 
if the machines are restarted after a failed repair session. A discussed 
alternative was to close the {{SStableWriter}} without removing the {{-tmp}} 
infix, and performing an atomic rename in the end of the stream task. However, 
this alternative was disregarded as the {{SStableReader}} would need to be 
closed and reopened in order to perform the atomic rename on non-posix systems 
such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed 
{{SStableWriters}} if the node goes down before these files are processed in 
the end of the stream receive task. So, the proposed solution basically returns 
to the previous behavior of closing {{SStableWriters}} as soon as they are 
received, while adding already-closed-but-not-yet-live files to the 
{{StreamLockFile}}. As soon as the sstables are added to the data tracker, the 
{{StreamLockFile}} is removed. If the stream session fails before that, the 
already-closed-but-not-yet-live sstables are cleaned up. If there is a failure 
while adding files to the data tracker, only the files that were not yet added 
to the data tracker are removed since they were already live. If the node goes 
down during a stream session, the already-closed-but-not-yet-live sstables 
present in the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, 
I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and 
{{delete}} are working correctly. We also need to ignore sstables that are 
present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that 
after first review if this approach is validated.

Below are some test results with and without the patch, with constrained (50M) 
and unconstrained (500M) memory.



||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|

In the constrained case, the unpatched version OOM soon after starting 
bootstrap while the patched version finished bootstrap successfully. In the 
unconstrained case, the memory footprint is between 1/2 to 1/3 smaller, but the 
difference is probably much larger in the case of large sstables.

I will provide 2.2+ versions after review.

> Bootstrap new node fails with OOM when streaming nodes contains thousands of 
> sstables
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10797
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10797
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Cassandra 2.1.8.621 w/G1GC
>            Reporter: Jose Martinez Poblete
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>         Attachments: 10797-nonpatched.png, 10797-patched.png, 
> 10798-nonpatched-500M.png, 10798-patched-500M.png, 112415_system.log, 
> Heapdump_OOM.zip, Screen Shot 2015-12-01 at 7.34.40 PM.png, dtest.tar.gz
>
>
> When adding a new node to an existing DC, it runs OOM after 25-45 minutes
> Upon heapdump revision, it is found the sending nodes are streaming thousands 
> of sstables which in turns blows the bootstrapping node heap 
> {noformat}
> ERROR [RMI Scheduler(0)] 2015-11-24 10:10:44,585 
> JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting 
> forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [STREAM-IN-/173.36.28.148] 2015-11-24 10:10:44,585 
> StreamSession.java:502 - [Stream #0bb13f50-92cb-11e5-bc8d-f53b7528ffb4] 
> Streaming error occurred
> java.lang.IllegalStateException: Shutdown in progress
>         at 
> java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82) 
> ~[na:1.8.0_65]
>         at java.lang.Runtime.removeShutdownHook(Runtime.java:239) 
> ~[na:1.8.0_65]
>         at 
> org.apache.cassandra.service.StorageService.removeShutdownHook(StorageService.java:747)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.utils.JVMStabilityInspector$Killer.killCurrentJVM(JVMStabilityInspector.java:95)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.utils.JVMStabilityInspector.inspectThrowable(JVMStabilityInspector.java:64)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:66)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
>  ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65]
> ERROR [RMI TCP Connection(idle)] 2015-11-24 10:10:44,585 
> JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting 
> forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [OptionalTasks:1] 2015-11-24 10:10:44,585 CassandraDaemon.java:223 - 
> Exception in thread Thread[OptionalTasks:1,5,main]
> java.lang.IllegalStateException: Shutdown in progress
> {noformat}
> Attached is the Eclipse MAT report as a zipped web page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10797) Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables

Reply via email to