[GitHub] [spark] beliefer commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

via GitHub Thu, 03 Aug 2023 02:15:29 -0700


beliefer commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1663601053


   > But my point is that it doesn't matter how the filter looks like (is it an 
OR condition or not). I enabled merging if only
   > FileSourceScanExec.dataFilters differ between the 2 scans. If 
FileSourceScanExec.partitionFilters or
   > FileSourceScanExec.optionalBucketSet differ then merging is disabled 
because partitioning and bucketing filters can be 
   > more selective in terms what files to scan...
   
   In theory, whether it is data filters or partition filters, there is a 
possibility of data overlap when connected filters with `or`.
   Before merge the filters (e.g. `p = 1`, `p = 2`), assume each partition have 
one file, so we need to read two partition files.
   After merge the two filters, we still need to read two partition files.
   I think the overhead of scan partition files is the same. the different is 
the filter need to calculates more. e.g. `p = 1` also need to treat the data 
come from `p = 2`.
   So, personally, I think the overhead of calculate is similar, no matter 
which filter is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] beliefer commented on pull request #42223: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates

Reply via email to