Github user mallman commented on the issue:
https://github.com/apache/spark/pull/21320
> Could we move the changes made in ParquetReadSupport.scala to a separate
PR? Then, we can merge this PR very quickly.
If I remove the changes to `ParquetReadSupport.scala`, then four tests fail
in `ParquetSchemaPruningSuite.scala`.
I don't think we should/can proceed without addressing the issue of reading
from two parquet files with identical column names and types but different
ordering of those columns in their respective file schema. Personally, I think
the fact that the Spark parquet reader appears to assume the same column order
in otherwise compatible schema across files is a bug. I think column selection
should be by name, not index. The parquet-mr reader behaves that way.
As a stop-gap alternative, I suppose we could disable the built-in reader
if parquet schema pruning is turned on. But I think that would be a rather
ugly, invasive and confusing hack.
Of course I'm open to other ideas as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]