Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16281
@nsync you raised an excellent question on test coverage. The kind of bugs
we have seen in the past weren't really integration bugs, but bugs in
parquet-mr. Technically it should be the jobs of parquet-mr to verify
correctness and performance regressions. If we were to introduce a much more
broader set of regression tests in Spark, then to me it makes even more sense
to just move the Parquet code into Spark and fixed issues found there.
Also I have spent some time understanding the Parquet codec, and I have to
say it is pretty powerful and complicated and as a result fairly difficult to
implement correctly. The dremel format optimizes for sparse nested data, but is
much more difficult to get right than a simpler dense format.
FWIW, the ideal scenario I can think of is to have parquet-mr publish big
fix versions that don't include new features. That would make update auditing
easier and updates lower risk.
E.g. Parquet-mr 2 adds new features, and 1.x are just bug fixes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]