GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19506

    [SPARK-22285] [SQL] Change implementation of 
ApproxCountDistinctForIntervals to TypedImperativeAggregate

    ## What changes were proposed in this pull request?
    
    The current implementation of `ApproxCountDistinctForIntervals` is 
`ImperativeAggregate`. The number of `aggBufferAttributes` is the number of 
total words in the hllppHelper array. Each hllppHelper has 52 words by default 
relativeSD.
    
    Since this aggregate function is used in equi-height histogram generation, 
and the number of buckets in histogram is usually hundreds, the number of 
`aggBufferAttributes` can easily reach tens of thousands or even more.
    
    This leads to a huge method in codegen and causes errors such as 
`org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB`. Besides, huge generated methods also result in 
performance regression.
    
    In this PR, we change its implementation to `TypedImperativeAggregate`. 
After the fix, `ApproxCountDistinctForIntervals` can deal with more than 
thousands endpoints without throwing codegen error, and improve performance 
from `20 sec` to `2 sec` in a test case of 500 endpoints.
    
    ## How was this patch tested?
    
    Test by an added test case and existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark change_forIntervals_typedAgg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19506
    
----
commit f66239469f2be030f23d86ff8686b59e99033b6c
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-13T07:04:22Z

    implement ApproxCountDistinctForIntervals as TypedImperativeAggregate

commit 1c3e18ada5870b5d2e2a3ecb54ae2312f27af09c
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-14T03:34:44Z

    remove offset

commit 792b58a2a473a899025311d37afc788db087c68e
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-16T05:31:38Z

    fix withOffset return type

commit 652b301bc032be6fe66664526f3c1c316eb981b6
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-16T08:27:38Z

    add test for large number of endpoints

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to