[
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996178#comment-13996178
]
Patrick Wendell edited comment on SPARK-1802 at 5/13/14 8:18 AM:
-----------------------------------------------------------------
This protobuf thing is very troubling. The options here are pretty limited
since they publish this assembly jar. I see a few:
1. Publish a Hive 0.12 that uses our shaded protobuf 2.4.1 (we already
published a shaded version of protobuf 2.4.1). I actually have this working in
a local build of Hive 0.12, but I haven't tried to push it to sonatype yet:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf
2. Upgrade our use of hive to 0.13 (which bumps to protobuf 2.5.0) and only
support Spark SQL with Hadoop 2+ - that is, versions of Hadoop that have also
bumped to protobuf 2.5.0. I'm not sure how big of an effort that would be in
terms of the code changes between 0.12 and 0.13. Spark didn't recompile
trivially. I can talk to Michael Armbrust tomorrow morning about this.
One thing I don't totally understand is how Hive itself deals with this
conflict. For instance, if someone wants to run Hive 0.12 with Hadoop 2.
Presumably both the Hive protobuf 2.4.1 and the HDFS client protobuf 2.5.0 will
be in the JVM at the same time... I'm not sure how they are isolated from
each-other. HDP 2.1 for instance, seems to have both
(http://hortonworks.com/hdp/whats-new/)
was (Author: pwendell):
This protobuf thing is very troubling. The options here are pretty limited
since they publish this assembly jar. I see a few:
1. Publish a Hive 0.12 that users our shaded protobuf 2.4.1 (we already
published a shaded version of protobuf 2.4.1). I actually have this working in
a local build of Hive 0.12, but I haven't tried to push it to sonatype yet:
https://github.com/pwendell/hive/commits/branch-0.12-shaded-protobuf
2. Upgrade our use of hive to 0.13 (which bumps to protobuf 2.5.0) and only
support Spark SQL with Hadoop 2+ - that is, versions of Hadoop that have also
bumped to protobuf 2.5.0. I'm not sure how big of an effort that would be in
terms of the code changes between 0.12 and 0.13. Spark didn't recompile
trivially. I can talk to Michael Armbrust tomorrow morning about this.
One thing I don't totally understand is how Hive itself deals with this
conflict. For instance, if someone wants to run Hive 0.12 with Hadoop 2.
Presumably both the Hive protobuf 2.4.1 and the HDFS client protobuf 2.5.0 will
be in the JVM at the same time... I'm not sure how they are isolated from
each-other. HDP 2.1 for instance, seems to have both
(http://hortonworks.com/hdp/whats-new/)
> Audit dependency graph when Spark is built with -Phive
> ------------------------------------------------------
>
> Key: SPARK-1802
> URL: https://issues.apache.org/jira/browse/SPARK-1802
> Project: Spark
> Issue Type: Bug
> Reporter: Patrick Wendell
> Assignee: Sean Owen
> Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: hive-exec-jar-problems.txt
>
>
> I'd like to have binary release for 1.0 include Hive support. Since this
> isn't enabled by default in the build I don't think it's as well tested, so
> we should dig around a bit and decide if we need to e.g. add any excludes.
> {code}
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -pl
> assembly | grep -v INFO | tr ":" "\n" | awk ' { FS="/"; print ( $(NF) ); }'
> | sort > without_hive.txt
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -Phive -pl
> assembly | grep -v INFO | tr ":" "\n" | awk ' { FS="/"; print ( $(NF) ); }'
> | sort > with_hive.txt
> $ diff without_hive.txt with_hive.txt
> < antlr-2.7.7.jar
> < antlr-3.4.jar
> < antlr-runtime-3.4.jar
> 10,14d6
> < avro-1.7.4.jar
> < avro-ipc-1.7.4.jar
> < avro-ipc-1.7.4-tests.jar
> < avro-mapred-1.7.4.jar
> < bonecp-0.7.1.RELEASE.jar
> 22d13
> < commons-cli-1.2.jar
> 25d15
> < commons-compress-1.4.1.jar
> 33,34d22
> < commons-logging-1.1.1.jar
> < commons-logging-api-1.0.4.jar
> 38d25
> < commons-pool-1.5.4.jar
> 46,49d32
> < datanucleus-api-jdo-3.2.1.jar
> < datanucleus-core-3.2.2.jar
> < datanucleus-rdbms-3.2.1.jar
> < derby-10.4.2.0.jar
> 53,57d35
> < hive-common-0.12.0.jar
> < hive-exec-0.12.0.jar
> < hive-metastore-0.12.0.jar
> < hive-serde-0.12.0.jar
> < hive-shims-0.12.0.jar
> 60,61d37
> < httpclient-4.1.3.jar
> < httpcore-4.1.3.jar
> 68d43
> < JavaEWAH-0.3.2.jar
> 73d47
> < javolution-5.5.1.jar
> 76d49
> < jdo-api-3.0.1.jar
> 78d50
> < jetty-6.1.26.jar
> 87d58
> < jetty-util-6.1.26.jar
> 93d63
> < json-20090211.jar
> 98d67
> < jta-1.1.jar
> 103,104d71
> < libfb303-0.9.0.jar
> < libthrift-0.9.0.jar
> 112d78
> < mockito-all-1.8.5.jar
> 136d101
> < servlet-api-2.5-20081211.jar
> 139d103
> < snappy-0.2.jar
> 144d107
> < spark-hive_2.10-1.0.0.jar
> 151d113
> < ST4-4.0.4.jar
> 153d114
> < stringtemplate-3.2.1.jar
> 156d116
> < velocity-1.7.jar
> 158d117
> < xz-1.0.jar
> {code}
> Some initial investigation suggests we may need to take some precaution
> surrounding (a) jetty and (b) servlet-api.
--
This message was sent by Atlassian JIRA
(v6.2#6252)