[ 
https://issues.apache.org/jira/browse/CASSANDRA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heiko Sommer updated CASSANDRA-12699:
-------------------------------------
    Attachment: 201610-TimelinePlots.png
                201610-CorrelationPlot.png

Comment summary: New C* memory data confirms page table (PTE) issue and may be 
helpful also for analyzing other memory issues. Hoping to gain the attention of 
people more knowledgeable with Cassandra and Linux memory issues.

During early October I collected more data on the same 12 GB node as before, 
using the "ps" command, /proc/<pid>/status, and /proc/<pid>/smaps dumps.
As labeled in the plot, first we had an ongoing node repair plus a large 
compaction. Then an OOM-killer action occurred and C* was restarted. After that 
we saw mainly a large compaction (6.6 TB uncompressed) over 9 days.
During the whole time we used the cluster mostly for writing data to it, with a 
very small number of reads. 

The two diagrams in the attached file "201610-TimelinePlots.png" show various 
Cassandra and system memory measurements. Both show the same timeline and 
should be viewed together. Findings:
- Page table (VmPTE) memory grows linearly during compactions, up to more than 
4 GB, and eventually gets released. System memory follows this pattern.
- VmData increases in discrete steps and does not go down again when PTE / 
system memory are released. I don't know if this indicates a problem, or simply 
means that VmData is not an interesting measure.
- Cassandra RSS memory measurements done with "ps" and with "smaps" (summarized 
over memory regions) are in good agreement. 
- RSS memory in total decreases under the pressure caused by PTE, at the 
expense of mmap'd Cassandra data files, while the anonymous RSS memory portion 
increases (but at a lower rate than PTE).
- Anonymous memory starts with twice the heap size. Did not check if this is 
accidental.
- Unclear why RSS anonymous memory (and therefore also used system memory) is 
larger on 2016-10-01 before the OOM-kill than on 2016-10-10 before the end of 
the large compaction, while PTE memory is about the same. Could it be a "memory 
leftover" from the first phase of repair, before anticompaction? Maybe this is 
the same issue with anonymous memory that Ariel mentioned, even if VmData 
should turn out to be not the best indicator for it?
- Not sure what the "Referenced" memory curve is telling us, for example that 
it increasingly deviates from RSS total.

Correlations in "201610-CorrelationPlot.png":
- Used system memory increases linearly with page table memory, but with a 
smaller steepness factor after C* restart. (This was roughly visible already in 
the above timeline plot.)
- The most linear correlation found (black curve) is the growth of system 
memory with C* PTE memory when the effect of anonymous memory on system memory 
is subtracted.
- The scaling of used system memory with C* anonymous memory, or with C* 
VmData, is less clear.

Result of configuring disk_access_mode: On other nodes that have even less RAM 
(8 GB), I configured {{disk_access_mode: mmap_index_only}}. Any previously seen 
OOM killer issues then disappeared. This confirms that our memory issues are 
caused by mmap'ing of data files. 
Especially since we are not reading much data for the time being, it is 
difficult to predict what performance price one would pay when using this 
disk_access_mode as a workaround for mmap-OOM-kill issues in a production 
system. 

Intermediate conclusions:
- The mmap'ing of data files for (anti-)compactions can cause OOM-killer issues 
on a 12 GB RAM node with several TB of data on disk. 
- Cassandra should, as a minmum, document that substantially more RAM is 
required for such nodes, not only to get decent operational performance but 
even for test systems just to avoid crashes.
- Perhaps C* should revisit if mmap'ing data files really brings an advantage 
in case of linear file access as done durimg compactions and anticompactions. 
Is there a test setup that could compare it with sequential IO? 
- If sequential IO turns out to be equally good in these special cases, then C* 
could offer a new disk_access_mode such as "mmap_except_compactions", or even 
make this the default. Otherwise, if using mmap is clearly better also for 
(anti-)compactions, then C* could perhaps implement the "running working 
section" with eager "munmap"ing, as described in my previous comment.
- If anyone is interested to investigate these memory issues, I'd be happy to 
provide more data or diagrams. Unfortunately at the moment I'm unable to offer 
changes in Cassandra. Testing changes provided by others could be OK though.


> Excessive use of "hidden" Linux page table memory
> -------------------------------------------------
>
>                 Key: CASSANDRA-12699
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12699
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.2.7 on Red Hat 6.7, kernel 
> 2.6.32-573.18.1.el6.x86_64, with Java 1.8.0_73. Probably others. 
>            Reporter: Heiko Sommer
>         Attachments: 201610-CorrelationPlot.png, 201610-TimelinePlots.png, 
> PageTableMemoryExample.png, cassandra-env.sh, cassandra.yaml, 
> cassandraMemoryLog.sh, cassandraMemoryLog.sh
>
>
> free 
> The cassandra JVM process uses many gigabytes of page table memory during 
> certain activities, which can lead to oom-killer action with 
> "java.lang.OutOfMemoryError: null" logs.
> Page table memory is not reported by Linux tools such as "top" or "ps" and 
> therefore might be responsible also for other spurious Cassandra issues with 
> "memory eating" or crashes, e.g. CASSANDRA-8723.
> The problem happens especially (or only?) during large compactions and 
> anticompactions. 
> Eventually all memory gets released, which means there is no real leak. Still 
> I suspect that the memory mappings that fill the page table could be released 
> much sooner, to keep the page table size at a small fraction of the total 
> Cassandra process memory. 
> How to reproduce: Record the memory use on a Cassandra node, including page 
> table memory, for example using the attached script cassandraMemoryLog.sh. 
> Even when there is no crash, the ramping up and sudden release of page table 
> memory is visible. 
> A stacked area plot for the memory on one of our crashed nodes is attached 
> (PageTableMemoryExample.png). The page table memory used by Cassandra is 
> shown in red ("VmPTE").
> (In the plot we also see that the sum of measured memory portions sometimes 
> exceeds the total memory. This is probably an issue of how RSS memory is 
> measured, perhaps including some buffers/cache memory that also counts toward 
> available memory. It does not invalidate the finding that page table memory 
> is growing to enormous sizes.) 
> Shortly before the crash, /proc/$PID/status reported 
>                 VmPeak: 6989760944 kB
>                 VmSize: 5742400572 kB
>                 VmLck:   4735036 kB
>                 VmHWM:   8589972 kB
>                 VmRSS:   7022036 kB
>                 VmData: 10019732 kB
>                 VmStk:        92 kB
>                 VmExe:         4 kB
>                 VmLib:     17584 kB
>                 VmPTE:   3965856 kB
>                 VmSwap:        0 kB
> The files cassandra.yaml and cassandra-env.sh used on the node where the data 
> was taken are attached. 
> Please let me know if I should provide any other data or descriptions to help 
> with this ticket. 
> Known workarounds: Use more RAM, or limit the amount of Java heap memory. In 
> the above crash, MAX_HEAP_SIZE was not set, so that the default heap size for 
> 12 GB RAM was used (-Xms2976M, -Xmx2976M). 
> We have not tried yet if variations of heap vs. offheap config choices make a 
> difference. 
> Perhaps there are other workarounds using -XX+UseLargePages or related Linux 
> settings to reduce the size of the process page table?
> I believe that we see these crashes more often than other projects because we 
> have a test system with not much RAM but with a lot of data (compressed ~3 TB 
> per node), while the CPUs are slow so that anti-/compactions overlap a lot. 
> Ideally Cassandra (native) code should be changed to release memory in 
> smaller chunks, so that page table size cannot cause an otherwise stable 
> system to crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to