Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> @gatorsmile @HyukjinKwon @ajacques I'm seeing incorrect results from the
following simple projection query, given @jainaks test file:
>
>```
>select page.url, page from temptable
>```
I believe I have identified and fixed this bug. I have added a commit with
a fix and additional test cases that cover it and related failure scenarios.
The underlying problem was that when merging the struct fields derived from
the projections `page.url` and `page` into a single pruned schema, the merge
function we are using does not necessarily respect the order of the fields in
the schema of `temptable`. In the above and similar scenarios, the merge
function merged `page.url` and `page` into a struct consisting of the
`page.url` field followed by the other fields in `page`. While this produces a
schema that has a subset of the fields of `temptable`'s schema, the fields are
in the wrong order.
I considered two high-level approaches to fixing this problem. The first
was to rework the way the pruned schema was constructed in the first place.
That is, rather than construct the pruned schema by merging fields together,
construct the merged schema by directly filtering the original schema. I think
that approach would go along the lines of altering the `SelectedField`
extractor for an expression to a `SelectedStruct` extractor that extracts a
whole struct from a sequence of expressions. The latter expressions would
consist of the projection and filtering expressions of a `PhysicalOperation`. I
did not go further in exploring that route, as it would involve a substantial
rewrite of the patch. However, in the end it may be the cleaner, more natural
route.
The second approach I consideredâand adoptedâwas to sort the fields of
the merged schema, recursively, so that the order of the fields in the merged
schema respected the order of their namesakes in the original schema. This adds
more complexity to the patch, undesirable for something already so complex. But
it appeared to be the quickest route to a correct solution.
Given more time, I would probably explore rewriting the patch with a
`SelectedStruct` extractor as described above. I don't know it would actually
lead to something less complex. It's just a thought.
I added three additional test "scenarios" to
`ParquetSchemaPruningSuite.scala` (each "scenario" is tested four ways by the
`testSchemaPruning` function). They test three distinct scenarios that fail
without the fix. These scenarios consist of a field and its parent struct, an
array-based variation and a map-based variation. I added some additional array
and map data to the test data to ensure proper test coverage.
Incidentally, I also added an integer `id` field to the test contact types
so that the results of queries on the contact tables can be ordered
deterministically. This should have been part of the tests all along, but I
forgot to incorporate it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]