[
https://issues.apache.org/jira/browse/KUDU-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adar Dembo resolved KUDU-2466.
------------------------------
Resolution: Fixed
Fix Version/s: 1.10.0
Fixed in commit 1567dec0865586ef93d8a8f555eea30a21228347.
> Fault tolerant scanners can over-allocate memory and crash a cluster
> --------------------------------------------------------------------
>
> Key: KUDU-2466
> URL: https://issues.apache.org/jira/browse/KUDU-2466
> Project: Kudu
> Issue Type: Bug
> Components: tserver
> Reporter: Grant Henke
> Assignee: Adar Dembo
> Priority: Critical
> Fix For: 1.10.0
>
>
> When testing a Spark job with fault tolerant scanners enabled, reading a
> large table (~1.5TB replicated) with many columns resulted in using up all of
> the memory on the tablet servers. 400 GB of total memory was being consumed
> though the memory limit was configured for 60 GB. This impacted all services
> on the machines making the cluster effectively unusable. Killing the job
> running the scans did not free the memory. However, restarting the Tablet
> servers resulted in a healthy cluster.
>
> Based on a chat with [~tlipcon], [~jdcryans], and [~mpercy] it looks like we
> are not lazy in MergeIterator initialization and we could fix this by being
> lazy about the merger based on rowset bounds. Limiting the number of
> concurrently open scanners to O(rowset height).
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)