liyunzhang_intel commented on HIVE-14240:

bq. In Pig, they don't require Spark distribution since they only test Spark 
standalone mode in their integration test.

In Pig on Spark, we don't need download spark distribution to run unit test 
because now we only enable "local"(SPARK_MASTER) mode. we don't support 
standalone, yarn-client, yarn-cluster mode now. We just [copy all spark 
dependency jars published from mvn repository to the run-time 
classpath|https://github.com/apache/pig/blob/spark/bin/pig#L399] when running 
unit tests.

> HoS itests shouldn't depend on a Spark distribution
> ---------------------------------------------------
>                 Key: HIVE-14240
>                 URL: https://issues.apache.org/jira/browse/HIVE-14240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.0.0, 2.1.0, 2.0.1
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.

This message was sent by Atlassian JIRA

Reply via email to