[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

via GitHub Thu, 03 Aug 2023 03:21:00 -0700


peter-toth commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1663724814


   > > But my point is that it doesn't matter how the filter looks like (is it 
an OR condition or not). I enabled merging if only
   > > FileSourceScanExec.dataFilters differ between the 2 scans. If 
FileSourceScanExec.partitionFilters or
   > > FileSourceScanExec.optionalBucketSet differ then merging is disabled 
because partitioning and bucketing filters can be
   > > more selective in terms what files to scan...
   > 
   > In theory, whether it is data filters or partition filters, there is a 
possibility of data overlap when connected filters with `or`. Before merge the 
filters (e.g. `p = 1`, `p = 2`), assume each partition have one file, so we 
need to read two partition files. After merge the two filters, we still need to 
read two partition files. I think the overhead of scan partition files is the 
same. the different is the filter need to calculates more. e.g. `p = 1` also 
need to treat the data come from `p = 2`. So, personally, I think the overhead 
of calculate is similar, no matter which filter is.
   > 
   > The main reason for filter merging is the amount of overlapping data. For 
example, `F1` obtains 100 rows of data, and `F2` obtains 50 rows of data. If 
the 100 rows and 50 rows completely overlap, this is the best situation. `F1` 
on `Aggregate1` still processes 100 rows of data, while `F2` on `Aggregate2` 
processes an additional 50 rows, resulting in a total of 100 rows of data. The 
worst case scenario is that the two do not overlap at all. So `F1` on 
`Aggregate1` needs to process an additional 50 rows, a total of 150 rows; `F2` 
on `Aggregate2` processes an additional 100 rows, totaling 150 rows.
   
   Sorry @beliefer, I didn't explain all the reasoning behind my heuristics in 
https://github.com/apache/spark/pull/37630. I've updated 
https://github.com/apache/spark/pull/42223#discussion_r1282023520, please see 
the details there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

Reply via email to