Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/20511
I agree on what @omalley said. The new reader based on ORC 1.4 is better
than the old reader. That is why we chose the new reader as the default at the
beginning. We also saw the performance improvement in the micro-benchmark.
However, at the RC3 of Spark 2.3, we realize the new reader introduces the
regression. After RC3, we reverted many PRs that caused the regressions and
also rejected many bug fixes that could introduce new regressions. We are very
conservative when merging the bug fixes in this stage. Thus, I also think the
suggestion from @marmbrus is very reasonable. That is the strategy we are
following in the previous Spark releases.
Regarding this specific case, we did not revert the new reader, but just
changed the default values. Users can try the new reader. We just want to avoid
breaking the existing workloads when they upgrade to the upcoming Spark 2.3
release. In the next Spark 2.4 release, I believe we can feel more confident to
choose the ORC new reader as the default.
@dongjoon-hyun Could you submit a PR against the master branch to turn on
them by default? Also add a migration guide.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]