[
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032497#comment-16032497
]
Josh Rosen commented on HIVE-16391:
-----------------------------------
I tried to see whether Spark can consume existing Hive 1.2.1 artifacts, but it
looks like neither the regular nor {{core}} hive-exec artifacts can work:
* We can't use the regular Hive uber-JAR artifacts because they include many
transitive dependencies but do not relocate those dependencies' classes into a
private namespace, so this will cause multiple versions of the same class to be
included on the classpath. To see this, note the long list of artifacts at
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml#L685 but there is
only one relocation pattern (for Kryo).
* We can't use the {{core}}-classified artifact:
** We actually need Kryo to be shaded in {{hive-exec}} because Spark now uses
Kryo 3 (which is needed by Chill 0.8.x, which is needed for Scala 2.12) while
Hive uses Kryo 2.
** In addition, I think that Spark needs to shade Hive's
{{com.google.protobuf:protobuf-java}} dependency.
** The published {{hive-exec}} POM is a "dependency-reduced" POM which doesn't
declare {{hive-exec}}'s transitive dependencies. To see this, compare the
declared dependencies in the published POM in Maven Central
(http://central.maven.org/maven2/org/apache/hive/hive-exec/1.2.1/hive-exec-1.2.1.pom)
to the dependencies the source repo's POM:
https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml. The lack of
declared dependencies creates an additional layer of pain for us when consuming
the {{core}} JAR because we now have to shoulder the burden of declaring
explicit dependencies on {{hive-exec}}'s transitive dependencies (since they're
no longer bundled in an uber JAR when we use the {{core}} JAR), making it
harder to use tools like Maven's {{dependency:tree}} to help us spot potential
dep. conflicts.
Spark's current custom Hive fork is effectively making three changes compared
to Hive 1.2.1 order to work around the above problems plus some legacy issues
which are no longer relevant:
* Remove the shading/bundling of most non-Hive classes, with the exception of
Kryo and Protobuf. This has the effect of making the published POM
non-dependency-reduced, easing the dep. management story in Spark's POMs, while
still ensuring that we relocate classes that conflict with Spark.
* Package the hive-shims into the hive-exec JAR. I don't think that this is
strictly necessary.
* Downgrade Kryo to 2.21. This isn't necessary anymore: there was an earlier
time where we purposely _unshaded_ Kryo and pinned Hive's version to match
Spark's. The only reason that this change is present today was to minimize the
diff between versions 1 and 2 of Spark's Hive fork.
For the full details, see
https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2,
which compares the current Version 2 of our Hive fork to stock Hive 1.2.1.
Maven classifiers do not allow the declaration of different dependencies for
artifacts depending on their classifiers, so if we wanted to publish a
{{hive-exec core}}-like artifact which declares its transitive dependencies
then this would need to be done under a new Maven artifact name or new version
(e.g. Hive 1.2.2-spark).
That said, proper declaration of transitive dependencies isn't a hard blocker
for us: a long, long, long time ago, I think that Spark may have actually built
with a stock {{core}} artifact and explicitly declared the transitive deps, so
if we've handled that dependency declaration before then we can do it again at
the cost of some pain in the future if we want to bump to Hive 2.x.
Therefore, I think the minimal change needed in Hive's build is to add a new
classifier, say {{core-spark}}, which behaves like {{core}} except that it
shades and relocates Kryo and Protobuf. If this artifact existed then I think
Spark could use that classified artifact, declare an explicit dependency on the
shim artifacts (assuming Kryo and Protobuf don't need to be shaded there) and
explicitly pull in all of {{hive-exec}}'s transitive dependencies. This avoids
the need to publish separate _versions_ for Spark: instead, Spark would just
consume a differently-packaged/differently-classified version of a stock Hive
release.
If we go with this latter approach, then I guess Hive would need to publish
1.2.3 or 1.2.2.1 in order to introduce the new classified artifact.
Does this sound like a reasonable approach? Or would it make more sense to have
a separate Hive branch and versioning scheme for Spark (e.g.
{{branch-1.2-spark}} and Hive {{1.2.1-spark}})? I lean towards the former
approach (releasing 1.2.3 with an additional Spark-specific classifier),
especially if we want to fix bugs or make functional / non-packaging changes
later down the road (I think [[email protected]] had a few changes / fixes he
wanted to make).
> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -----------------------------------------------------------------------------
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
> Issue Type: Task
> Components: Build Infrastructure
> Reporter: Reynold Xin
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the
> only change in the fork is to work around the issue that Hive publishes only
> two sets of jars: one set with no dependency declared, and another with all
> the dependencies included in the published uber jar. That is to say, Hive
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked
> Hive.
> The change in the forked version is recorded here
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become
> unnecessary.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)