[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

mallman Sat, 28 Jul 2018 00:18:57 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/21320
  
    > @gatorsmile @HyukjinKwon @ajacques I'm seeing incorrect results from the 
following simple projection query, given @jainaks test file:
    >
    >```
    >select page.url, page from temptable
    >```
    
    I believe I have identified and fixed this bug. I have added a commit with 
a fix and additional test cases that cover it and related failure scenarios.
    
    The underlying problem was that when merging the struct fields derived from 
the projections `page.url` and `page` into a single pruned schema, the merge 
function we are using does not necessarily respect the order of the fields in 
the schema of `temptable`. In the above and similar scenarios, the merge 
function merged `page.url` and `page` into a struct consisting of the 
`page.url` field followed by the other fields in `page`. While this produces a 
schema that has a subset of the fields of `temptable`'s schema, the fields are 
in the wrong order.
    
    I considered two high-level approaches to fixing this problem. The first 
was to rework the way the pruned schema was constructed in the first place. 
That is, rather than construct the pruned schema by merging fields together, 
construct the merged schema by directly filtering the original schema. I think 
that approach would go along the lines of altering the `SelectedField` 
extractor for an expression to a `SelectedStruct` extractor that extracts a 
whole struct from a sequence of expressions. The latter expressions would 
consist of the projection and filtering expressions of a `PhysicalOperation`. I 
did not go further in exploring that route, as it would involve a substantial 
rewrite of the patch. However, in the end it may be the cleaner, more natural 
route.
    
    The second approach I consideredâand adoptedâwas to sort the fields of 
the merged schema, recursively, so that the order of the fields in the merged 
schema respected the order of their namesakes in the original schema. This adds 
more complexity to the patch, undesirable for something already so complex. But 
it appeared to be the quickest route to a correct solution.
    
    Given more time, I would probably explore rewriting the patch with a 
`SelectedStruct` extractor as described above. I don't know it would actually 
lead to something less complex. It's just a thought.
    
    I added three additional test "scenarios" to 
`ParquetSchemaPruningSuite.scala` (each "scenario" is tested four ways by the 
`testSchemaPruning` function). They test three distinct scenarios that fail 
without the fix. These scenarios consist of a field and its parent struct, an 
array-based variation and a map-based variation. I added some additional array 
and map data to the test data to ensure proper test coverage.
    
    Incidentally, I also added an integer `id` field to the test contact types 
so that the results of queries on the contact tables can be ordered 
deterministically. This should have been part of the tests all along, but I 
forgot to incorporate it.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

Reply via email to