peter-toth commented on PR #54330:
URL: https://github.com/apache/spark/pull/54330#issuecomment-3999793601

   > Also can we add more tests:
   > 
   > * Empty grouped partitions: Plan that yields groupedPartitions.isEmpty 
(ie, partitioned table but no partition values inserted)
   
   I added an empty partitioned table test in 
https://github.com/apache/spark/pull/54330/commits/4a904ad98a3870e99136cf88a797903feb6496b8,
 but it seems we prevent returing `KeyedPartitioning` without partitions. This 
is not new behaviour, it worked the same way with `KeyGroupedPartitioning` 
before this PR.
   If we removed that `inputPartitions.nonEmpty` guard from `BatchScanExec` and 
then the 2 shuffles would disappear, but no `GroupPartitionsExec` is added as 
those are not needed. Maybe the only way to get `GroupPartitionsExec` with 
empty `groupedPartitions` is to enable 
`spark.sql.sources.v2.bucketing.partition.filter.enabled` and use disjoint set 
of keys in join legs to get empty `expectedPartitionKeys`.
   
   > * Runtime filtering + KeyedPartitioning: Keyed scan with runtime filtering 
and Option[InputPartition] (including None for filtered-out keys); check result 
correctness and that no partition is read for filtered keys.
   
   I've extended the 2 existing "dynamic partition filtering" tests in 
https://github.com/apache/spark/pull/54330/commits/5a4ecd1815b2f5af5700d3a1d65107d2ac67b928.
   
   > * OrderedDistribution: Require ordering on a keyed scan and assert that a 
single GroupPartitionsExec is inserted with the expected keys.
   
   I already added `GroupPartitionsExec` check to the existing `SPARK-48655: 
order by on partition keys should not introduce additional shuffle` test. It is 
an extensive tests with 5 different orderings, which seems to render expected 
key checks pointless.
   
   > * Multi table joins (looks like there's already tests, but maybe we can 
check the plan shape for the GroupPartitionExec)
   
   The 2 new "Multi table join" join tests do that, but I can add more or 
extend them if you miss something.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to