GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/19506
[SPARK-22285] [SQL] Change implementation of
ApproxCountDistinctForIntervals to TypedImperativeAggregate
## What changes were proposed in this pull request?
The current implementation of `ApproxCountDistinctForIntervals` is
`ImperativeAggregate`. The number of `aggBufferAttributes` is the number of
total words in the hllppHelper array. Each hllppHelper has 52 words by default
relativeSD.
Since this aggregate function is used in equi-height histogram generation,
and the number of buckets in histogram is usually hundreds, the number of
`aggBufferAttributes` can easily reach tens of thousands or even more.
This leads to a huge method in codegen and causes errors such as
`org.codehaus.janino.JaninoRuntimeException: Code of method
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
of class
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
grows beyond 64 KB`. Besides, huge generated methods also result in
performance regression.
In this PR, we change its implementation to `TypedImperativeAggregate`.
After the fix, `ApproxCountDistinctForIntervals` can deal with more than
thousands endpoints without throwing codegen error, and improve performance
from `20 sec` to `2 sec` in a test case of 500 endpoints.
## How was this patch tested?
Test by an added test case and existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark change_forIntervals_typedAgg
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19506.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19506
----
commit f66239469f2be030f23d86ff8686b59e99033b6c
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-13T07:04:22Z
implement ApproxCountDistinctForIntervals as TypedImperativeAggregate
commit 1c3e18ada5870b5d2e2a3ecb54ae2312f27af09c
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-14T03:34:44Z
remove offset
commit 792b58a2a473a899025311d37afc788db087c68e
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-16T05:31:38Z
fix withOffset return type
commit 652b301bc032be6fe66664526f3c1c316eb981b6
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-16T08:27:38Z
add test for large number of endpoints
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]