Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21143
  
    Thanks for working on this, @cloud-fan! I was thinking about needing it 
just recently so that data sources can delegate to Spark when needed.
    
    I'll have a thorough look at it tomorrow, but one quick high-level 
question: should we make these residuals based on the input split instead?
    
    Input splits might have different residual filters that need to be applied. 
For example, if you have a time range query, `ts > X`, and are storing data by 
day, then you know that when `day(ts) > day(X)` that `ts > X` *must* be true, 
but when `day(ts) = day(X)`, `ts > X` *might* be true. So for only some splits, 
when scanning the boundary day, you need to run the original filter, but not 
for any other splits.
    
    Another use case for a per-split residual is when splits might be different 
file formats. Parquet allows pushing down filters, but Avro doesn't. In a mixed 
table format it would be great for Avro splits to return the entire expression 
as a residual, while Parquet splits do the filtering.
    
    What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to