Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 Thanks for working on this, @cloud-fan! I was thinking about needing it just recently so that data sources can delegate to Spark when needed. I'll have a thorough look at it tomorrow, but one quick high-level question: should we make these residuals based on the input split instead? Input splits might have different residual filters that need to be applied. For example, if you have a time range query, `ts > X`, and are storing data by day, then you know that when `day(ts) > day(X)` that `ts > X` *must* be true, but when `day(ts) = day(X)`, `ts > X` *might* be true. So for only some splits, when scanning the boundary day, you need to run the original filter, but not for any other splits. Another use case for a per-split residual is when splits might be different file formats. Parquet allows pushing down filters, but Avro doesn't. In a mixed table format it would be great for Avro splits to return the entire expression as a residual, while Parquet splits do the filtering. What do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org