Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22357
  
    I just read @mallman's comment. Thanks for that. Roughly, my two cents:
    
    > IMO, we can get closer to settling the question of relative 
performance/behavior by pushing down Parquet reader filters just for the 
columns we need, e.g. IsNotNull(employer.id) in this case above. Neither patch 
(currently) does that, however I think my patch is closer to achieving that 
because it already identifies isnotnull(employer#4445.id) as a filter predicate 
in the query plan. We just need to push it down.
    
    I don't know that `IsNotNull(employer.id)` is actually better than 
`IsNotNull(employer)` in predicate push down. `IsNotNull(employer)` doesn't 
actually mean the reader reads all contents from employer struct. IMO, there 
usually is some null bit optimization to do null check in reader without 
reading real content. `IsNotNull(employer.id)` may need to read `id` field from 
employer to do null check, comparatively, so it might even worse.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to