Jungtaek Lim created SPARK-24763:
------------------------------------

             Summary: Remove redundant key data from value in streaming 
aggregation
                 Key: SPARK-24763
                 URL: https://issues.apache.org/jira/browse/SPARK-24763
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.4.0
            Reporter: Jungtaek Lim


Key/Value of state in streaming aggregation is formatted as below:
 * key: UnsafeRow containing group-by fields
 * value: UnsafeRow containing key fields and another fields for aggregation 
results

which data for key is stored to both key and value.

This is to avoid doing projection row to value while storing, and joining key 
and value to restore origin row to boost performance, but while doing a simple 
benchmark test, I found it not much helpful compared to "project and join". 
(will paste test result in comment)

So I would propose a new option: remove redundant in stateful aggregation. I'm 
avoiding to modify default behavior of stateful aggregation, because state 
value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to