RE: separate spark and hive

assaf.mendelson Tue, 15 Nov 2016 00:12:07 -0800

Spark shell (and pyspark) by default create the spark session with hive support 
(also true when the session is created using getOrCreate, at least in pyspark)
At a minimum there should be a way to configure it using spark-defaults.conf
Assaf.

From: rxin [via Apache Spark Developers List] 
[mailto:[email protected]]
Sent: Tuesday, November 15, 2016 9:46 AM
To: Mendelson, Assaf
Subject: Re: separate spark and hive

If you just start a SparkSession without calling enableHiveSupport it actually 
won't use the Hive catalog support.

On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden 
email]</user/SendEmail.jtp?type=node&node=19882&i=0>> wrote:
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive 
context and sql context and couldn’t find it for spark 2.0 (I know for previous 
versions there were a couple of functions which required hive context as well 
as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I 
can only find how to compile it without hive (and having to build from source 
each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple 
configuration or even the default and that if there is any missing 
functionality it should be documented.
Assaf.

From: Reynold Xin [mailto:[hidden 
email]</user/SendEmail.jtp?type=node&node=19882&i=1>]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: [hidden email]</user/SendEmail.jtp?type=node&node=19882&i=2>
Subject: Re: separate spark and hive

I agree with the high level idea, and thus 
SPARK-15691<https://issues.apache.org/jira/browse/SPARK-15691>.

In reality, it's a huge amount of work to create & maintain a custom catalog. 
It might actually make sense to do, but it just seems a lot of work to do right 
now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, 
can't you?

On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden 
email]</user/SendEmail.jtp?type=node&node=19882&i=3>> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use 
of spark SQL.
When doing the default installation this means that a derby.log and 
metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working 
directory we have a problem.
The solution we employ locally is to always run from different directory as we 
ignore hive in practice (this of course means we lose the ability to use some 
of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper 
configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even 
for catalog elements such as saving a permanent table, we should be able to 
configure a target directory and simply write to it (doing everything file 
based to avoid the need for locking). Hive should be reserved for those who 
actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.

________________________________
View this message in context: separate spark and 
hive<http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
Sent from the Apache Spark Developers List mailing list 
archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at 
Nabble.com.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19882.html
To start a new topic under Apache Spark Developers List, email 
[email protected]<mailto:[email protected]>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19883.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: separate spark and hive

Reply via email to