[
https://issues.apache.org/jira/browse/CASSANDRA-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052692#comment-15052692
]
Paulo Motta edited comment on CASSANDRA-10797 at 12/11/15 12:46 PM:
--------------------------------------------------------------------
As mentioned before I was able to reproduce the OOM with 1000 small sstables
and 50M heap. I attached a [ccm
cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz]
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the
{{~/.ccm}} folder, update the following properties to match your local
directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}},
{{data_file_directories}} and {{saved_caches_directory}}. After that, run the
following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}
The main problem is that all {{SStableWriters}} remain open until the end of
the stream receive task, and these objects are quite large with indexes and
stats that are written to disk only when the {{SStableWriters}} are closed.
Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were
received, and the stream receive task kept only the {{SStableReaders}} which
have a much smaller memory footprint. The main reason to defer the closing of
the {{SStableWriter}} to the end of the stream receive task was to keep
sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear
if the machines are restarted after a failed repair session. A discussed
alternative was to close the {{SStableWriter}} without removing the {{-tmp}}
infix, and performing an atomic rename in the end of the stream task. However,
this alternative was disregarded as the {{SStableReader}} would need to be
closed and reopened in order to perform the atomic rename on non-posix systems
such as Windows.
CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed
{{SStableWriters}} if the node goes down before these files are processed in
the end of the stream receive task. So, the proposed solution basically returns
to the previous behavior of closing {{SStableWriters}} as soon as they are
received, while adding already-closed-but-not-yet-live files to the
{{StreamLockFile}}. As soon as the sstables are added to the data tracker, the
{{StreamLockFile}} is removed. If the stream session fails before that, the
already-closed-but-not-yet-live sstables are cleaned up. If there is a failure
while adding files to the data tracker, only the files that were not yet added
to the data tracker are removed since they were already live. If the node goes
down during a stream session, the already-closed-but-not-yet-live sstables
present in the {{StreamLockFile}} are removed on the next startup as done today.
Since {{StreamLockFile}} is a much more critical component with this approach,
I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and
{{delete}} are working correctly. We also need to ignore sstables that are
present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that
after first review if this approach is validated.
Below are some test results with and without the patch, with constrained (50M)
and unconstrained (500M) memory.
||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|
In the constrained case, the unpatched version OOM soon after starting
bootstrap while the patched version finished bootstrap successfully. In the
unconstrained case, the memory footprint is between 1/2 to 1/3 smaller, but the
difference is probably much larger in the case of large sstables.
Below is the initial patch and tests:
||2.1||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10797]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-dtest/lastCompletedBuild/testReport/]|
I will provide 2.2+ versions after review.
was (Author: pauloricardomg):
As mentioned before I was able to reproduce the OOM with 1000 small sstables
and 50M heap. I attached a [ccm
cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz]
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the
{{~/.ccm}} folder, update the following properties to match your local
directories on {{dtest/node*/conf/cassandra.yaml}}: {{commitlog_directory}},
{{data_file_directories}} and {{saved_caches_directory}}. After that, run the
following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}
The main problem is that all {{SStableWriters}} remain open until the end of
the stream receive task, and these objects are quite large with indexes and
stats that are written to disk only when the {{SStableWriters}} are closed.
Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were
received, and the stream receive task kept only the {{SStableReaders}} which
have a much smaller memory footprint. The main reason to defer the closing of
the {{SStableWriter}} to the end of the stream receive task was to keep
sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear
if the machines are restarted after a failed repair session. A discussed
alternative was to close the {{SStableWriter}} without removing the {{-tmp}}
infix, and performing an atomic rename in the end of the stream task. However,
this alternative was disregarded as the {{SStableReader}} would need to be
closed and reopened in order to perform the atomic rename on non-posix systems
such as Windows.
CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed
{{SStableWriters}} if the node goes down before these files are processed in
the end of the stream receive task. So, the proposed solution basically returns
to the previous behavior of closing {{SStableWriters}} as soon as they are
received, while adding already-closed-but-not-yet-live files to the
{{StreamLockFile}}. As soon as the sstables are added to the data tracker, the
{{StreamLockFile}} is removed. If the stream session fails before that, the
already-closed-but-not-yet-live sstables are cleaned up. If there is a failure
while adding files to the data tracker, only the files that were not yet added
to the data tracker are removed since they were already live. If the node goes
down during a stream session, the already-closed-but-not-yet-live sstables
present in the {{StreamLockFile}} are removed on the next startup as done today.
Since {{StreamLockFile}} is a much more critical component with this approach,
I added unit tests to verify that {{append}}, {{cleanup}}, {{skip}} and
{{delete}} are working correctly. We also need to ignore sstables that are
present on a {{StreamLockFile}} during {{nodetool refresh}}. I will do that
after first review if this approach is validated.
Below are some test results with and without the patch, with constrained (50M)
and unconstrained (500M) memory.
||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|
In the constrained case, the unpatched version OOM soon after starting
bootstrap while the patched version finished bootstrap successfully. In the
unconstrained case, the memory footprint is between 1/2 to 1/3 smaller, but the
difference is probably much larger in the case of large sstables.
I will provide 2.2+ versions after review.
> Bootstrap new node fails with OOM when streaming nodes contains thousands of
> sstables
> -------------------------------------------------------------------------------------
>
> Key: CASSANDRA-10797
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10797
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Environment: Cassandra 2.1.8.621 w/G1GC
> Reporter: Jose Martinez Poblete
> Assignee: Paulo Motta
> Fix For: 2.1.x
>
> Attachments: 10797-nonpatched.png, 10797-patched.png,
> 10798-nonpatched-500M.png, 10798-patched-500M.png, 112415_system.log,
> Heapdump_OOM.zip, Screen Shot 2015-12-01 at 7.34.40 PM.png, dtest.tar.gz
>
>
> When adding a new node to an existing DC, it runs OOM after 25-45 minutes
> Upon heapdump revision, it is found the sending nodes are streaming thousands
> of sstables which in turns blows the bootstrapping node heap
> {noformat}
> ERROR [RMI Scheduler(0)] 2015-11-24 10:10:44,585
> JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting
> forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [STREAM-IN-/173.36.28.148] 2015-11-24 10:10:44,585
> StreamSession.java:502 - [Stream #0bb13f50-92cb-11e5-bc8d-f53b7528ffb4]
> Streaming error occurred
> java.lang.IllegalStateException: Shutdown in progress
> at
> java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82)
> ~[na:1.8.0_65]
> at java.lang.Runtime.removeShutdownHook(Runtime.java:239)
> ~[na:1.8.0_65]
> at
> org.apache.cassandra.service.StorageService.removeShutdownHook(StorageService.java:747)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.utils.JVMStabilityInspector$Killer.killCurrentJVM(JVMStabilityInspector.java:95)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.utils.JVMStabilityInspector.inspectThrowable(JVMStabilityInspector.java:64)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:66)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
> ~[cassandra-all-2.1.8.621.jar:2.1.8.621]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65]
> ERROR [RMI TCP Connection(idle)] 2015-11-24 10:10:44,585
> JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting
> forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [OptionalTasks:1] 2015-11-24 10:10:44,585 CassandraDaemon.java:223 -
> Exception in thread Thread[OptionalTasks:1,5,main]
> java.lang.IllegalStateException: Shutdown in progress
> {noformat}
> Attached is the Eclipse MAT report as a zipped web page
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)