Github user viirya commented on the issue:
https://github.com/apache/spark/pull/22357
I just read @mallman's comment. Thanks for that. Roughly, my two cents:
> IMO, we can get closer to settling the question of relative
performance/behavior by pushing down Parquet reader filters just for the
columns we need, e.g. IsNotNull(employer.id) in this case above. Neither patch
(currently) does that, however I think my patch is closer to achieving that
because it already identifies isnotnull(employer#4445.id) as a filter predicate
in the query plan. We just need to push it down.
I don't know that `IsNotNull(employer.id)` is actually better than
`IsNotNull(employer)` in predicate push down. `IsNotNull(employer)` doesn't
actually mean the reader reads all contents from employer struct. IMO, there
usually is some null bit optimization to do null check in reader without
reading real content. `IsNotNull(employer.id)` may need to read `id` field from
employer to do null check, comparatively, so it might even worse.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]