Re: [I] [VL] Result mismatch in CollectList when partial sort is involved [incubator-gluten]

via GitHub Wed, 11 Dec 2024 00:12:41 -0800


NEUpanning commented on issue #8184:
URL: 
https://github.com/apache/incubator-gluten/issues/8184#issuecomment-2534703555


   > I think we could preserve the sort as long as vanilla Spark plan has it 
with a hash agg.
   
   @zhztheplayer I see this feature is implemented in gluten 1.2 branch, but 
main branch doesn't include it for some reason. For this issue, CollectList 
function is replaced by VeloxCollectList function in logical optimization 
phase. Here is the spark plan:
   ```
   == Parsed Logical Plan ==
   Aggregate [id#0], [id#0, collect_list(value#1, 0, 0) AS values_list#13]
   +- SubqueryAlias t
      +- Project [id#0, value#1]
         +- RepartitionByExpression [_nondeterministic#15], 20
            +- Project [id#0, value#1, rand(5386921442550703776) AS 
_nondeterministic#15]
               +- Project [id#0, value#1]
                  +- SubqueryAlias temp_table
                     +- Project [id#0, value#1]
                        +- SubqueryAlias t
                           +- LocalRelation [id#0, value#1]
   
   == Analyzed Logical Plan ==
   id: int, values_list: array<string>
   Aggregate [id#0], [id#0, collect_list(value#1, 0, 0) AS values_list#13]
   +- SubqueryAlias t
      +- Project [id#0, value#1]
         +- RepartitionByExpression [_nondeterministic#15], 20
            +- Project [id#0, value#1, rand(5386921442550703776) AS 
_nondeterministic#15]
               +- Project [id#0, value#1]
                  +- SubqueryAlias temp_table
                     +- Project [id#0, value#1]
                        +- SubqueryAlias t
                           +- LocalRelation [id#0, value#1]
   
   == Optimized Logical Plan ==
   Aggregate [id#0], [id#0, velox_collect_list(value#1) AS values_list#13]
   +- Project [id#0, value#1]
      +- RepartitionByExpression [_nondeterministic#15], 20
         +- LocalRelation [id#0, value#1, _nondeterministic#15]
   
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=true
   +- == Final Plan ==
      VeloxColumnarToRowExec
      +- ^(3) HashAggregateTransformer(keys=[id#0], 
functions=[velox_collect_list(value#1)], isStreamingAgg=false, output=[id#0, 
values_list#13])
         +- ^(3) InputIteratorTransformer[id#0, buffer#20]
            +- CustomShuffleReader coalesced
               +- ShuffleQueryStage 1
                  +- ColumnarExchange hashpartitioning(id#0, 20), 
ENSURE_REQUIREMENTS, [id#0, buffer#20], [id=#1048], [id=#1048], [OUTPUT] 
List(id:IntegerType, buffer:ArrayType(StringType,false)), [OUTPUT] 
List(id:IntegerType, buffer:ArrayType(StringType,false))
                     +- VeloxAppendBatches 3276
                        +- ^(2) ProjectExecTransformer [hash(id#0, 42) AS 
hash_partition_key#22, id#0, buffer#20]
                           +- ^(2) 
FlushableHashAggregateTransformer(keys=[id#0], 
functions=[partial_velox_collect_list(value#1)], isStreamingAgg=false, 
output=[id#0, buffer#20])
                              +- ^(2) ProjectExecTransformer [id#0, value#1]
                                 +- ^(2) InputIteratorTransformer[id#0, 
value#1, _nondeterministic#15]
                                    +- ShuffleQueryStage 0
                                       +- ColumnarExchange 
hashpartitioning(_nondeterministic#15, 20), REPARTITION_WITH_NUM, [id#0, 
value#1, _nondeterministic#15], [id=#966], [id=#966], [OUTPUT] 
List(id:IntegerType, value:StringType, _nondeterministic:DoubleType), [OUTPUT] 
List(id:IntegerType, value:StringType, _nondeterministic:DoubleType)
                                          +- VeloxAppendBatches 3276
                                             +- ^(1) ProjectExecTransformer 
[hash(_nondeterministic#15, 42) AS hash_partition_key#21, id#0, value#1, 
_nondeterministic#15]
                                                +- ^(1) 
InputIteratorTransformer[id#0, value#1, _nondeterministic#15]
                                                   +- RowToVeloxColumnar
                                                      +- LocalTableScan [id#0, 
value#1, _nondeterministic#15]
   +- == Initial Plan ==
      SortAggregate(key=[id#0], functions=[velox_collect_list(value#1)], 
output=[id#0, values_list#13])
      +- Sort [id#0 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#0, 20), ENSURE_REQUIREMENTS, [id=#936]
            +- SortAggregate(key=[id#0], 
functions=[partial_velox_collect_list(value#1)], output=[id#0, buffer#20])
               +- Sort [id#0 ASC NULLS FIRST], false, 0
                  +- Project [id#0, value#1]
                     +- Exchange hashpartitioning(_nondeterministic#15, 20), 
REPARTITION_WITH_NUM, [id=#928]
                        +- LocalTableScan [id#0, value#1, _nondeterministic#15]
   ```
   This leads to Spark using SortAggregateExec
    instead of ObjectHashAggregateExec as aggregate operator. So I think the 
sort also should be preserved if aggregate expressions include VeloxCollectList 
or VeloxCollectSet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [VL] Result mismatch in CollectList when partial sort is involved [incubator-gluten]

Reply via email to