Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
> @mallman It will be great that we can have this fix in 2.4 release as
this can dramatically reduce the data being read in many applications which is
the purpose of the original work.
I agree it would be great to have this capability in 2.4. But I don't know
that this PR is the right way to accomplish our intended goal. I'm also not
sure this patch accomplishes its intended goal. And I would like time to
complete my reviewâI'm still running tests against this patch.
I would also like to submit my patch as an alternative for review, because
the approach made by this PR and by my patch are not compatible. Even though
it's incomplete, I'm willing to submit it as-is with some notes on how it's
incomplete and what needs to be done. However, I can say for certain there is
no way it would be accepted for Spark 2.4. The earliest I could get it
submitted is tomorrow morning (EDT).
However, to give you a sense of how my patch works, I can give you the gist
of how I see the problem. Basically, constraint propagation as defined in
`QueryPlanConstraints.scala` inhibits schema pruning. Indeed, if you turn off
constraint propagation (by setting `spark.sql.constraintPropagation.enabled` to
`false`), the following query
select employer.id from contacts where employer.id = 0
produces the following physical plan
```
== Physical Plan ==
*(1) Project [employer#36.id AS id#47]
+- *(1) Filter (employer#36.id = 0)
+- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format:
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<employer:struct<id:int>>
```
without applying _either_ patch. (FYI I ran this on the master branch,
commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56). The only column read in this
plan is `employer.id`, just as we'd like.
Aside from the difference in approach, I have some other concerns around
this PR. I don't think we should push down `IsNotNull(employer)` to the reader
unless we need to. This PR includes that pushed down filter for both of the
sample queries I provided in my previous comment
https://github.com/apache/spark/pull/22357#issuecomment-419612555. The
question isâhow does that pushdown affect the reader's behavior?
That leads me to a concern around the testing of this functionality. Our
intent is to read from as few columns as necessary. In the query
select employer.id from contacts where employer.id = 0
we need only read from the `employer.id` column. And we can tell the reader
to only read that column. But how do we know that pushing down
`IsNotNull(employer)` does not negate that instruction? One way to be certain
is to not push that filter down in the first place. That is the approach my
patch currently takes. Of course, this removes the pushdown. I think that
identifying which plan leads to a faster scan requires a more robust testing
capability, however one thing is certain: the `FileScan` in my patch's plan
gives no reason to believe that it is reading anything other than that one
column.
IMO, we can get closer to settling the question of relative
performance/behavior by pushing down Parquet reader filters just for the
columns we need, e.g. `IsNotNull(employer.id)` in this case above. Neither
patch (currently) does that, however I think my patch is closer to achieving
that because it already identifies `isnotnull(employer#4445.id)` as a filter
predicate in the query plan. We just need to push it down.
As I mentioned, I'll endeavor to have my patch posted as a PR by tomorrow
morning, but I can't make a promise of that.
I'm sorry for the delay. I really wasn't expecting we'd work on this
functionality for Spark 2.4. We do have a known bug in the schema pruning
functionality that's in Spark 2.4âone that throws an error. We identified it
in #21320 (look for the "ignored" test in `ParquetSchemaPruningSuite.scala`),
but I don't think we have an issue in Jira for it. I'll try to take care of
that by tomorrow morning as well, and I was hoping we would prioritize that. I
have a patch for that bug that is code complete but missing proper code
documentation.
Thanks all.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]