peter-toth commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1663724814
> > But my point is that it doesn't matter how the filter looks like (is it an OR condition or not). I enabled merging if only > > FileSourceScanExec.dataFilters differ between the 2 scans. If FileSourceScanExec.partitionFilters or > > FileSourceScanExec.optionalBucketSet differ then merging is disabled because partitioning and bucketing filters can be > > more selective in terms what files to scan... > > In theory, whether it is data filters or partition filters, there is a possibility of data overlap when connected filters with `or`. Before merge the filters (e.g. `p = 1`, `p = 2`), assume each partition have one file, so we need to read two partition files. After merge the two filters, we still need to read two partition files. I think the overhead of scan partition files is the same. the different is the filter need to calculates more. e.g. `p = 1` also need to treat the data come from `p = 2`. So, personally, I think the overhead of calculate is similar, no matter which filter is. > > The main reason for filter merging is the amount of overlapping data. For example, `F1` obtains 100 rows of data, and `F2` obtains 50 rows of data. If the 100 rows and 50 rows completely overlap, this is the best situation. `F1` on `Aggregate1` still processes 100 rows of data, while `F2` on `Aggregate2` processes an additional 50 rows, resulting in a total of 100 rows of data. The worst case scenario is that the two do not overlap at all. So `F1` on `Aggregate1` needs to process an additional 50 rows, a total of 150 rows; `F2` on `Aggregate2` processes an additional 100 rows, totaling 150 rows. Sorry @beliefer, I didn't explain all the reasoning behind my heuristics in https://github.com/apache/spark/pull/37630. I've updated https://github.com/apache/spark/pull/42223#discussion_r1282023520, please see the details there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
