Sahil Takiar commented on HIVE-14240:

I looked into this today and tried to get something working, but I don't think 
its possible without making some modifications to Spark.

* The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}}
** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which 
will spawn a new process), the SparkSubmit process will then create 2 more 
processes (the Spark Executors do the actual work) with 2 cores and 1024 Mb 
memory each
** The {{local-cluster}} option is not present in the Spark docs because it is 
mainly used for integration testing within the Spark project itself; it 
basically provides a way of deploying a mini cluster locally
** The advantage of the {{local-cluster}} is that it does not require Spark 
Masters or Workers to be running
*** Spark Workers are basically like NodeManagers, a Spark Master is basically 
like HS2
* Looked through the Spark code that launches actual Spark Executors and they 
more or less require a {{SPARK_HOME}} directory to be present (ref: 
** {{SPARK_HOME}} is suppose to point to a directory containing a Spark 

Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in 
Spark so that it doesn't require {{SPARK_HOME}} to be set. However, I'm not 
sure how difficult this will be to do in Spark.

We could change the {{spark.master} from {{local-cluster}} to {{local}}, in 
which case everything will be run locally. However, I think this removes some 
functionality from the HoS tests since running locally isn't the same as 
running against a real mini-cluster.

> HoS itests shouldn't depend on a Spark distribution
> ---------------------------------------------------
>                 Key: HIVE-14240
>                 URL: https://issues.apache.org/jira/browse/HIVE-14240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.0.0, 2.1.0, 2.0.1
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.

This message was sent by Atlassian JIRA

Reply via email to