[ 
https://issues.apache.org/jira/browse/IMPALA-13333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935355#comment-17935355
 ] 

ASF subversion and git services commented on IMPALA-13333:
----------------------------------------------------------

Commit 7a5adc15ca0c3c374967a82d98675eacc1826916 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7a5adc15c ]

IMPALA-13333: Limit memory estimation if PlanNode can spill

SortNode, AggregationNode, and HashJoinNode (the build side) can spill
to disk. However, their memory estimation does not consider this
capability and assumes it will hold all rows in memory. This causes
memory overestimation if cardinality is also overestimated. In reality,
the whole query execution in a single host is often subject to much
lower memory upper-bound and not allowed to exceed it.

This upper-bound is dictated by, but not limited to:
- MEM_LIMIT
- MEM_LIMIT_COORDINATORS
- MEM_LIMIT_EXECUTORS
- MAX_MEM_ESTIMATE_FOR_ADMISSION
- impala.admission-control.max-query-mem-limit.<pool_name>
  from admission control.

This patch adds SpillableOperator interface that defines an alternative
of either PlanNode.computeNodeResourceProfile() or
DataSink.computeResourceProfile() if a lower memory upper-bound can be
reasoned about from configs mentioned above. This interface is applied
to SortNode, AggregationNode, HashJoinNode, and JoinBuildSink.

The in-memory vs spill-to-disk bias is controlled through
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR option. A scale between
[0.0,1.0] to control estimate peak memory of query operator that has
spill-to-disk capabilities. Setting value closer to 1.0 will make
Planner bias towards keeping as much rows as possible in memory, while
setting value closer to 0.0 will make Planner bias towards spilling rows
to disk under memory pressure. Note that lowering
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR can make query that previously
rejected by Admission Controller becomes admittable, but also may
spill-to-disk more and have higher risk to exhaust scratch space more
than before.

There are some caveats on this memory bounding patch:
- It checks if spill-to-disk is enabled in the coordinator, but
  individual backend executors might not have it configured. Mismatch of
  spill-to-disk configs between the coordinator and backend executor,
  however, is rare and can be considered as misconfiguration.
- It does not check the actual total scratch space available to
  spill-to-disk. However, query execution will be forced to spill anyway
  if memory usage exceeds the three memory configs above. Raising
  MEM_LIMIT / MEM_LIMIT_EXECUTORS option can help increase memory
  estimation and increase the likelihood for the query to get assigned
  to a larger executor group set, which usually has a bigger total
  scratch space.
- The memory bound is divided evenly among all instances of a fragment
  kind in a single host. But in theory, they should be able to share
  and grow their memory usage independently beyond memory estimate as
  long max memory reservation is not set.
- This does not consider other memory-related configs such as
  clamp_query_mem_limit_backend_mem_limit or disable_pool_mem_limits
  flag. But the admission controller will still enforce them if set.

Testing:
- Pass FE and custom cluster tests with core exploration.

Change-Id: I290c4e889d4ab9e921e356f0f55a9c8b11d0854e
Reviewed-on: http://gerrit.cloudera.org:8080/21762
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Riza Suminto <[email protected]>


> Curb memory estimation for SORT node
> ------------------------------------
>
>                 Key: IMPALA-13333
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13333
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Riza Suminto
>            Priority: Major
>
> High cardinality overestimation can lead to severe memory overestimation for 
> SORT node, even in Parallel Plan. TPC-DS Q31 and Q51 plan against synthetic 
> 3TB scale workload shows such huge overestimation:
> [https://github.com/apache/impala/blob/ae6a3b9ec058dfea4b4f93d4828761f792f0b55e/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q31.test#L1319-L1323]
> [https://github.com/apache/impala/blob/ae6a3b9ec058dfea4b4f93d4828761f792f0b55e/testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q51.test#L511-L515]
> Planner should be aware to not estimate terabytes/petabytes of memory for 
> SORT node, knowing that SORT node has ability to spill-to-disk under memory 
> pressure. Planner can also take account for SORT_RUN_BYTES_LIMIT or 
> MAX_SORT_RUN_SIZE option value to come up with lower memory estimate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to