Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/18640
Thank you again for coming and reviewing this PR, @rxin , @kiszk , @mridulm
, @omalley .
So far, we discussed the followings.
1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
- `sql/core` module gives more benefit than `sql/hive`.
- Apache ORC library (`no-hive` version) is a general and resonably
small library designed for non-hive apps.
2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
- The previous #17980 , #17924, and #17943 are the complete examples
containing this PR.
- This PR is focusing on dependency only.
3. `Why don't we then create a separate orc module? Just copy a few of the
files over?` (@rxin)
- Apache ORC library is the same with most of other data sources(CSV,
JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
- It's better to use as a library instead of copying ORC files because
Apache ORC shaded jar has many files. We had better depend on Apache ORC
community's effort until an unavoidable reason for copying occurs.
4. `I do worry in the future whether ORC would bring in a lot more jars`
(@rxin)
- The ORC core library's dependency tree is aggressively kept as small
as possible. I've gone through and excluded unnecessary jars from our
dependencies. I also kick back pull requests that add unnecessary new
dependencies. (@omalley)
I tried to contain and summarize all advices here, but please let me know
if I missed some concerns here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]