beliefer commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1663601053
> But my point is that it doesn't matter how the filter looks like (is it an OR condition or not). I enabled merging if only > FileSourceScanExec.dataFilters differ between the 2 scans. If FileSourceScanExec.partitionFilters or > FileSourceScanExec.optionalBucketSet differ then merging is disabled because partitioning and bucketing filters can be > more selective in terms what files to scan... In theory, whether it is data filters or partition filters, there is a possibility of data overlap when connected filters with `or`. Before merge the filters (e.g. `p = 1`, `p = 2`), assume each partition have one file, so we need to read two partition files. After merge the two filters, we still need to read two partition files. I think the overhead of scan partition files is the same. the different is the filter need to calculates more. e.g. `p = 1` also need to treat the data come from `p = 2`. So, personally, I think the overhead of calculate is similar, no matter which filter is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
