Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview,

I found it interesting, no matter what you do by configuring


spark.sql.warehouse.dir


it will always pull up the default path which is /user/hive/warehouse


In the code, I notice that at LOC45

./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


object SimpleAnalyzer extends Analyzer(

new SessionCatalog(

  new InMemoryCatalog,

  EmptyFunctionRegistry,

  new SimpleCatalystConf(caseSensitiveAnalysis = true)),

new SimpleCatalystConf(caseSensitiveAnalysis = true))


It will always initialize with the SimpleCatalystConf which is applying the 
hardcoded default value

defined in LOC58


./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystConf.scala


case class SimpleCatalystConf(

caseSensitiveAnalysis: Boolean,

orderByOrdinal: Boolean = true,

groupByOrdinal: Boolean = true,

optimizerMaxIterations: Int = 100,

optimizerInSetConversionThreshold: Int = 10,

maxCaseBranchesForCodegen: Int = 20,

runSQLonFile: Boolean = true,

warehousePath: String = "/user/hive/warehouse")

  extends CatalystConf


I couldn't find any other way to get around this.


It looks like this was fixed (in SPARK-15387) after


https://github.com/apache/spark/commit/9c817d027713859cac483b4baaaf8b53c040ad93

[https://avatars0.githubusercontent.com/u/4736016?v=3&s=200]

[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to ... · 
apache/spark@9c817d0
github.com
...make database directory. ## What changes were proposed in this pull request? 
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when 
SimpleAnalyzer is used but SimpleAnalyzer ma...


Just want to confirm this was the root cause and the PR that fixed it. Thanks.






Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-09 Thread Andrew Lee
In fact, it does require ojdbc from Oracle which also requires a username and 
password. This was added as part of the testing scope for Oracle's docker.


I notice this PR and commit in branch-2.0 according to 
https://issues.apache.org/jira/browse/SPARK-12941.

In the comment, I'm not sure what does it mean by installing the JAR locally 
while Spark QA test run. IF this is the case,

it means someone downloaded the JAR from Oracle and manually added to the local 
build machine that is building Spark branch-2.0 or internal maven repository 
that will serve this ojdbc JAR.




commit 8afe49141d9b6a603eb3907f32dce802a3d05172

Author: thomastechs 

Date:   Thu Feb 25 22:52:25 2016 -0800


[SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map 
string datatypes to Oracle VARCHAR datatype



## What changes were proposed in this pull request?



This Pull request is used for the fix SPARK-12941, creating a data type 
mapping to Oracle for the corresponding data type"Stringtype" from

dataframe. This PR is for the master branch fix, where as another PR is already 
tested with the branch 1.4



## How was the this patch tested?



(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

This patch was tested using the Oracle docker .Created a new integration 
suite for the same.The oracle.jdbc jar was to be downloaded from the maven 
repository.Since there was no jdbc jar available in the maven repository, the 
jar was downloaded from oracle site manually and installed in the local; thus 
tested. So, for SparkQA test case run, the ojdbc jar might be manually placed 
in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test 
run.



Author: thomastechs 



Closes #11306 from thomastechs/master.




Meanwhile, I also notice that the ojdbc groupID provided by Oracle (official 
website https://blogs.oracle.com/dev2dev/entry/how_to_get_oracle_jdbc)  is 
different.




  com.oracle.jdbc

  ojdbc6

  11.2.0.4

  test




as oppose to the one in Spark branch-2.0

external/docker-integration-tests/pom.xml




  com.oracle

  ojdbc6

  11.2.0.1.0

  test





The version is out of date and not available from the Oracle Maven repo. The PR 
was created awhile back, so the solution may just cross Oracle's maven release 
blog.


Just my inference based on what I see form git and JIRA, however, I do see a 
fix required to patch pom.xml to apply the correct groupId and version # for 
ojdbc6 driver.


Thoughts?



Get Oracle JDBC drivers and UCP from Oracle Maven 
...
blogs.oracle.com
Get Oracle JDBC drivers and UCP from Oracle Maven Repository (without IDEs) By 
Nirmala Sundarappa-Oracle on Feb 15, 2016









From: Mich Talebzadeh 
Sent: Tuesday, May 3, 2016 1:04 AM
To: Luciano Resende
Cc: Hien Luu; ☼ R Nair (रविशंकर नायर); user
Subject: Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

which version of Spark are using?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 3 May 2016 at 02:13, Luciano Resende 
mailto:luckbr1...@gmail.com>> wrote:
You might have a settings.xml that is forcing your internal Maven repository to 
be the mirror of external repositories and thus not finding the dependency.

On Mon, May 2, 2016 at 6:11 PM, Hien Luu 
mailto:hien...@gmail.com>> wrote:
Not I am not.  I am considering downloading it manually and place it in my 
local repository.

On Mon, May 2, 2016 at 5:54 PM, ☼ R Nair (रविशंकर नायर) 
mailto:ravishankar.n...@gmail.com>> wrote:

Oracle jdbc is not part of Maven repository,  are you keeping a downloaded file 
in your local repo?

Best, RS

On May 2, 2016 8:51 PM, "Hien Luu" 
mailto:hien...@gmail.com>> wrote:
Hi all,

I am running into a build problem with com.oracle:ojdbc6:jar:11.2.0.1.0.  It 
kept getting "Operation timed out" while building Spark Project Docker 
Integration Tests module (see the error below).

Has anyone run this problem before? If so, how did you resolve around this 
problem?

[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [  2.423 s]

[INFO] Spark Project Test Tags  SUCCESS [  0.712 s]

[INFO] Spark Project Sketch ... SUCCESS [  0.498 s]

[INFO] Spark Project Networking ... SUCCESS [  1.743 s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.587 s]

[INFO] Spark Project Unsafe ... SUCCESS [  0.503 s]

[INFO] Spark Project Launcher . SUCCESS [  4.894 s]

[INFO] Spark Project Core . SUCCESS [ 17.953 s]

[INFO]

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew,
Thanks for the advice. I didn't see the log in the NodeManager, so apparently, 
something was wrong with the yarn-site.xml configuration.
After digging in more, I realize it was an user error. I'm sharing this with 
other people so others may know what mistake I have made.
When I review the configurations, I notice that there was another property 
setting "yarn.nodemanager.aux-services" in mapred-site.xml. It turns out that 
mapred-site.xml will override the property "yarn.nodemanager.aux-services" in 
yarn-site.xml, because of this, spark_shuffle service was never enabled.  :(  
err.. 
















After deleting the redundant invalid properties in mapred-site.xml, it starts 
working. I see the following logs from the NodeManager.









2015-07-21 21:24:44,046 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Initializing YARN shuffle service for Spark
2015-07-21 21:24:44,046 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding 
auxiliary service spark_shuffle, "spark_shuffle"
2015-07-21 21:24:44,264 INFO org.apache.spark.network.yarn.YarnShuffleService: 
Started YARN shuffle service for Spark on port 7337. Authentication is not 
enabled.

Appreciate all and the pointers where to look at. Thanks, problem solved.



Date: Tue, 21 Jul 2015 09:31:50 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi Andrew,
Based on your driver logs, it seems the issue is that the shuffle service is 
actually not running on the NodeManagers, but your application is trying to 
provide a "spark_shuffle" secret anyway. One way to verify whether the shuffle 
service is actually started is to look at the NodeManager logs for the 
following lines:
Initializing YARN shuffle service for Spark
Started YARN shuffle service for Spark on port X

These should be logged under the INFO level. Also, could you verify whether all 
the executors have this problem, or just a subset? If even one of the NM 
doesn't have the shuffle service, you'll see the stack trace that you ran into. 
It would be good to confirm whether the yarn-site.xml change is actually 
reflected on all NMs if the log statements above are missing.

Let me know if you can get it working. I've run the shuffle service myself on 
the master branch (which will become Spark 1.5.0) recently following the 
instructions and have not encountered any problems.
-Andrew   

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew Or,
Yes, NodeManager was restarted, I also checked the logs to see if the JARs 
appear in the CLASSPATH.
I have also downloaded the binary distribution and use the JAR 
"spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar" without success.
Has anyone successfully enabled the spark_shuffle via the documentation 
https://spark.apache.org/docs/1.4.1/job-scheduling.html ??
I'm testing it on Hadoop 2.4.1.
Any feedback or suggestion are appreciated, thanks.

Date: Fri, 17 Jul 2015 15:35:29 -0700
Subject: Re: The auxService:spark_shuffle does not exist
From: and...@databricks.com
To: alee...@hotmail.com
CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org

Hi all,
Did you forget to restart the node managers after editing yarn-site.xml by any 
chance?
-Andrew
2015-07-17 8:32 GMT-07:00 Andrew Lee :



I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.

yarn.nodemanager.aux-services
spark_shuffle,mapreduce_shuffle

...

yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService

and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with "-Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver" in maven.

















spark-network-yarn_2.10-1.4.1.jar
spark-network-shuffle_2.10-1.4.1.jar


and still getting the following exception.
Exception in thread "ContainerLauncher #0" java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host alee-ci-2058-slave-2.test.foo.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy  wrote:
we tried "--master yarn-client" with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

  

RE: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Lee
I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
spark.dynamicAllocation.initialExecutors 2
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.schedulerBacklogTimeout 10

and yarn-site.xml configured.

yarn.nodemanager.aux-services
spark_shuffle,mapreduce_shuffle

...

yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService

and deployed the 2 JARs to NodeManager's classpath 
/opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log and 
the JARs appear in the classpath). I notice that the JAR location is not the 
same as the document in 1.4. I found them under network/yarn/target and 
network/shuffle/target/ after building it with "-Phadoop-2.4 -Psparkr -Pyarn 
-Phive -Phive-thriftserver" in maven.
















spark-network-yarn_2.10-1.4.1.jarspark-network-shuffle_2.10-1.4.1.jar

and still getting the following exception.
Exception in thread "ContainerLauncher #0" java.lang.Error: 
org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.SparkException: Exception while starting container 
container_1437141440985_0003_01_02 on host 
alee-ci-2058-slave-2.test.altiscale.com
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
Not sure what else am I missing here or doing wrong?
Appreciate any insights or feedback, thanks.

Date: Wed, 8 Jul 2015 09:25:39 +0800
Subject: Re: The auxService:spark_shuffle does not exist
From: zjf...@gmail.com
To: rp...@njit.edu
CC: user@spark.apache.org

Did you enable the dynamic resource allocation ? You can refer to this page for 
how to configure spark shuffle service for yarn.
https://spark.apache.org/docs/1.4.0/job-scheduling.html 
On Tue, Jul 7, 2015 at 10:55 PM, roy  wrote:
we tried "--master yarn-client" with no different result.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Best Regards

Jeff Zhang
  

RE: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Andrew Lee
Hi Roberto,
I'm not an EMR person, but it looks like option -h is deploying the necessary 
dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and 
dataneucleus JARs. As long as these 2 are there, and Spark is compiled with 
-Phive, it should work.
spark-shell runs in yarn-client mode. Not sure whether your other application 
is running under the same mode or a different one. Try specifying yarn-client 
mode and see if you get the same result as spark-shell.
From: roberto.coluc...@gmail.com
Date: Wed, 10 Jun 2015 14:32:04 +0200
Subject: [Spark 1.3.1 on YARN on EMR] Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
To: user@spark.apache.org

Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an 
AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, 
etc...). I make use of the AWS emr-bootstrap-action "install-spark" 
(https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark) with the 
option/version "-v1.3.1e" so to get the latest Spark for EMR installed and 
available.
I also have a simple Spark Streaming driver in my project. Such driver is part 
of a larger Maven project: in the pom.xml I'm currently using   
[...]
2.10
2.10.4
1.7
1.3.1
2.4.1
[]

  org.apache.spark
  spark-streaming_${scala.binary.version}
  ${spark.version}
  provided
  

  org.apache.hadoop
  hadoop-client

  




  org.apache.hadoop
  hadoop-client
  ${hadoop.version}
  provided




  org.apache.spark
  spark-hive_${scala.binary.version}
  ${spark.version}
  provided


In fact, at compile and build time everything works just fine if, in my driver, 
I have:
-
val sparkConf = new SparkConf()  .setAppName(appName)  
.set("spark.local.dir", "/tmp/" + appName)  
.set("spark.streaming.unpersist", "true")  .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")  
.registerKryoClasses(Array(classOf[java.net.URI], classOf[String]))
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, config.batchDuration)
import org.apache.spark.streaming.StreamingContext._












ssc.checkpoint(sparkConf.get("spark.local.dir") + checkpointRelativeDir)
< some input reading actions >
< some input transformation actions >
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._sqlContext.sql()
ssc.start()ssc.awaitTerminationOrTimeout(config.timeout)

--- 
What happens is that, right after have been launched, the driver fails with the 
exception:
15/06/10 11:38:18 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:235)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at  myDriver.scala: < line of the sqlContext.sql(query) >
Caused by < some stuff >
Caused by: javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
Caused by: java.lang.ClassNotFoundException: 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Thinking about a wrong Hive installation/configuration or libs/classpath 
definition, I SSHed into the cluster and launched a spark-shell. Excluding the 
app configuration and StreamingContext usage/definition, I then carried out all 
the actions listed in the driver implementation, in particular all the 
Hive-related ones and they all went through smoothly!

I also tried to use the optional "-h" argument 
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md#arguments-optional)
 in the install-spark emr-bootstrap-action, but the driver failed the very same 
way. Furthermore, when launching a spark-shell (on the EMR cluster with Spark 
installed with the "-h" option), I also got:
15/06/09 14:20:51 WARN conf.HiveConf: hive-default.xml not found on CLASSPATH
15/06/09 14:20:52 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/06/09 14:20:52 INFO metastore.ObjectStore: ObjectS

RE: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
Hi Marcelo,
Exactly what I need to track, thanks for the JIRA pointer.

> Date: Mon, 20 Apr 2015 14:03:55 -0700
> Subject: Re: GSSException when submitting Spark job in yarn-cluster mode with 
> HiveContext APIs on Kerberos cluster
> From: van...@cloudera.com
> To: alee...@hotmail.com
> CC: user@spark.apache.org
> 
> I think you want to take a look at:
> https://issues.apache.org/jira/browse/SPARK-6207
> 
> On Mon, Apr 20, 2015 at 1:58 PM, Andrew Lee  wrote:
> > Hi All,
> >
> > Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1
> >
> > Posting this problem to user group first to see if someone is encountering
> > the same problem.
> >
> > When submitting spark jobs that invokes HiveContext APIs on a Kerberos
> > Hadoop + YARN (2.4.1) cluster,
> > I'm getting this error.
> >
> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
> > GSSException: No valid credentials provided (Mechanism level: Failed to find
> > any Kerberos tgt)]
> >
> > Apparently, the Kerberos ticket is not on the remote data node nor computing
> > node since we don't
> > deploy Kerberos tickets, and that is not a good practice either. On the
> > other hand, we can't just SSH to every machine and run kinit for that users.
> > This is not practical and it is insecure.
> >
> > The point here is that shouldn't there be a delegation token during the doAs
> > to use the token instead of the ticket ?
> > I'm trying to understand what is missing in Spark's HiveContext API while a
> > normal MapReduce job that invokes Hive APIs will work, but not in Spark SQL.
> > Any insights or feedback are appreciated.
> >
> > Anyone got this running without pre-deploying (pre-initializing) all tickets
> > node by node? Is this worth filing a JIRA?
> >
> >
> >
> > 15/03/25 18:59:08 INFO hive.metastore: Trying to connect to metastore with
> > URI thrift://alee-cluster.test.testserver.com:9083
> > 15/03/25 18:59:08 ERROR transport.TSaslTransport: SASL negotiation failure
> > javax.security.sasl.SaslException: GSS initiate failed [Caused by
> > GSSException: No valid credentials provided (Mechanism level: Failed to find
> > any Kerberos tgt)]
> > at
> > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
> > at
> > org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
> > at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
> > at
> > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> > at
> > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> > at
> > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
> > at
> > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> > at
> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:336)
> > at
> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:214)
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> > at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > at
> > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
> > at
> > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:62)
> > at
> > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
> > at
> > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
> > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
> > at
> > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
> > at
> > org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:235)
> > at
> > org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:231)
> > at scala.Option.orElse(Option.scala:257)
> > at
> > org.apache.spark.sql.hiv

GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
Hi All,
Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1
Posting this problem to user group first to see if someone is encountering the 
same problem. 
When submitting spark jobs that invokes HiveContext APIs on a Kerberos Hadoop + 
YARN (2.4.1) cluster, I'm getting this error. 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
Apparently, the Kerberos ticket is not on the remote data node nor computing 
node since we don't deploy Kerberos tickets, and that is not a good practice 
either. On the other hand, we can't just SSH to every machine and run kinit for 
that users. This is not practical and it is insecure.
The point here is that shouldn't there be a delegation token during the doAs to 
use the token instead of the ticket ? I'm trying to understand what is missing 
in Spark's HiveContext API while a normal MapReduce job that invokes Hive APIs 
will work, but not in Spark SQL. Any insights or feedback are appreciated.
Anyone got this running without pre-deploying (pre-initializing) all tickets 
node by node? Is this worth filing a JIRA?


15/03/25 18:59:08 INFO hive.metastore: Trying to connect to metastore with URI 
thrift://alee-cluster.test.testserver.com:908315/03/25 18:59:08 ERROR 
transport.TSaslTransport: SASL negotiation 
failurejavax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
  at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
  at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
  at java.security.AccessController.doPrivileged(Native Method)   at 
javax.security.auth.Subject.doAs(Subject.java:415)   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
 at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
   at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:336)
  at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:214)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)  at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:62)
 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)   
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)   at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)  at 
org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:235)   
 at 
org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:231)   
 at scala.Option.orElse(Option.scala:257)at 
org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scala:231)  at 
org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229) at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:229)
 at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)   
 at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.(HiveMetastoreCatalog.scala:55)
 at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:253) 
 at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:253) 
 at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:253) at 
org.apache.spark.sql.hive.HiveContext$$anon$4.(HiveContext.scala:263)  at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:263)
 at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:262)   
 at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
 at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)   
 at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) 
 at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)   

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-17 Thread Andrew Lee
HI All,
Just want to give everyone an update of what worked for me. Thanks for Cheng's 
comment and other ppl's help.
So what I misunderstood was the --driver-class-path and how that was related to 
--files.  I put both /etc/hive/hive-site.xml in both --files and 
--driver-class-path when I started in yarn-cluster mode. 
./bin/spark-submit --verbose --queue research --driver-java-options 
"-XX:MaxPermSize=8192M --files /etc/hive/hive-site.xml --driver-class-path 
/etc/hive/hive-site.xml --master yarn --deploy-mode cluster 
The problem here is that --files only look for the local files to distribute it 
onto HDFS. The --driver-class-path is what brings to CLASSPATH during runtime, 
and as you can see, it is trying to look at /etc/hive/hive-site.xml on the 
container in the remote nodes which apparently doesn't exist.  For some ppl, it 
may work fine is b/c they may deploy Hive configuration and JARs across their 
entire cluster so every node looks the same. But this wasn't my case in 
multi-tenant environment or a restricted secured cluster. So my parameter looks 
like this when I launch it.








./bin/spark-submit --verbose --queue research --driver-java-options 
"-XX:MaxPermSize=8192M --files /etc/hive/hive-site.xml --driver-class-path 
hive-site.xml --master yarn --deploy-mode cluster 
So --driver-class-path here will only look at ./hive-site.xml on the remote 
container which was pre-deployed already by the --files. 
This worked for me, and I can have HiveContext API to talk to Hive metastore, 
and vice versa. Thanks.


Date: Thu, 5 Feb 2015 16:59:12 -0800
From: lian.cs@gmail.com
To: linlin200...@gmail.com; huaiyin@gmail.com
CC: user@spark.apache.org
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database


  

  
  

  Hi Jenny,
  You may try to use --files
  $SPARK_HOME/conf/hive-site.xml --driver-class-path
  hive-site.xml when submitting your application. The
problem is that when running in cluster mode, the driver is
actually running in a random container directory on a random
executor node. By using --files,
you upload hive-site.xml to the container directory, by using 
--driver-class-path
  hive-site.xml, you add the file to classpath (the path
is relative to the container directory).
  When running in cluster
mode, have you tried to check the tables inside the default
database? If my guess is right, this should be an empty default
database inside the default Derby metastore created by
HiveContext when the hive-site.xml is missing.
  Best,

Cheng
  On 8/12/14 5:38 PM,
Jenny Zhao wrote:
  
  



  

  

  

  
  Hi Yin,

  

  hive-site.xml was copied to spark/conf and the same as
  the one under $HIVE_HOME/conf. 

  


through hive cli, I don't see any problem. but for spark
on yarn-cluster mode, I am not able to switch to a
database other than the default one, for Yarn-client
mode, it works fine.  



  
  Thanks!

  


Jenny

  
  



On Tue, Aug 12, 2014 at 12:53 PM,
  Yin Huai 
  wrote:

  
Hi Jenny,
  

  
  Have you copied hive-site.xml
  to spark/conf directory? If not, can you put it in
  conf/ and try again?
  


  Thanks,
  


  Yin


  


  

  On Mon, Aug 11, 2014 at
8:57 PM, Jenny Zhao 
wrote:


  

  

  
  Thanks Yin! 

  


here is my hive-site.xml,  which I copied
from $HIVE_HOME/conf, didn't experience
problem connecting to the metastore through
hive. which uses DB2 as metastore database.



  









 

 
 

RE: SparkSQL + Tableau Connector

2015-02-17 Thread Andrew Lee
NFO Driver: Semantic Analysis Completed
15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, 
properties:null)
15/02/11 19:25:35 INFO Driver: Starting command: use `default`
15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:36 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:36 INFO Driver: OK
15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create 
temporary table test
using org.apache.spark.sql.json
options (path ‘/data/json/*')'

15/02/11 19:25:38 INFO Driver: Starting command: use `default`
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:38 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO Driver: OK
15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query '
cache table test '
15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with 
curMem=101514, maxMem=278019440
15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 206.4 KB, free 264.8 MB)
I see no way in Tableau to see the cached table "test".  I think I am missing a 
step of associating the generated temp table from Spark SQL with the metastore. 
 Any guidance or insights on what I'm missing here.
Thanks for the assistance.
-Todd

On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee  wrote:



Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800




I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.















  javax.jdo.option.ConnectionURL

  
jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB








  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver










On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

  hive.metastore.uris
thrift://sandbox.hortonworks.com:9083  URI for 
client to contact metastore server
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush 

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800




I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.















  javax.jdo.option.ConnectionURL

  
jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB








  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver










On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

  hive.metastore.uris
thrift://sandbox.hortonworks.com:9083  URI for 
client to contact metastore server
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda  
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda  
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda  
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.


On 11 Feb 2015 01:35, "Todd Nist"  wrote:
Hi,
I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.  
For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.
import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = sq

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
Thanks Judy.
You are right. The query is going to Spark ThriftServer2. I have it setup on a 
different port number.
I got the wrong perception b/c there were other jobs running at the same time. 
It should be Spark jobs instead of Hive jobs.
From: judyn...@exchange.microsoft.com
To: alee...@hotmail.com; sjbru...@uwaterloo.ca; user@spark.apache.org
Subject: RE: Is the Thrift server right for me?
Date: Wed, 11 Feb 2015 20:12:03 +









It should relay the queries to spark (i.e. you shouldn’t see any MR job on 
Hadoop & you should see activities on the spark app on headnode UI).

 
Check your hive-site.xml. Are you directing to the hive server 2 port instead 
of spark thrift port?

Their default ports are both 1.

 


From: Andrew Lee [mailto:alee...@hotmail.com]


Sent: Wednesday, February 11, 2015 12:00 PM

To: sjbrunst; user@spark.apache.org

Subject: RE: Is the Thrift server right for me?


 

I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.

 


I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could
 just connect directly to HiveServer2 if Hive is all you need.


 


If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.


 


Please correct me if I misunderstood the purpose of Spark ThriftServer2.


 

> Date: Thu, 8 Jan 2015 14:49:00 -0700

> From: sjbru...@uwaterloo.ca

> To: user@spark.apache.org

> Subject: Is the Thrift server right for me?

> 

> I'm building a system that collects data using Spark Streaming, does some

> processing with it, then saves the data. I want the data to be queried by

> multiple applications, and it sounds like the Thrift JDBC/ODBC server might

> be the right tool to handle the queries. However, the documentation for the

> Thrift server

> <http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server>


> seems to be written for Hive users who are moving to Spark. I never used

> Hive before I started using Spark, so it is not clear to me how best to use

> this.

> 

> I've tried putting data into Hive, then serving it with the Thrift server.

> But I have not been able to update the data in Hive without first shutting

> down the server. This is a problem because new data is always being streamed

> in, and so the data must continuously be updated.

> 

> The system I'm building is supposed to replace a system that stores the data

> in MongoDB. The dataset has now grown so large that the database index does

> not fit in memory, which causes major performance problems in MongoDB.

> 

> If the Thrift server is the right tool for me, how can I set it up for my

> application? If it is not the right tool, what else can I use?

> 

> 

> 

> --

> View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html

> Sent from the Apache Spark User List mailing list archive at Nabble.com.

> 

> -

> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

> For additional commands, e-mail: user-h...@spark.apache.org

> 



  

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query 
to HiveServer2 when I pass the hive-site.xml to it.
I'm not sure if this is the expected behavior, but based on what I have up and 
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez 
query. In this case, I could just connect directly to HiveServer2 if Hive is 
all you need.
If you are programmer and want to mash up data from Hive with other tables and 
data in Spark, then Spark ThriftServer2 seems to be a good integration point at 
some use case.
Please correct me if I misunderstood the purpose of Spark ThriftServer2.

> Date: Thu, 8 Jan 2015 14:49:00 -0700
> From: sjbru...@uwaterloo.ca
> To: user@spark.apache.org
> Subject: Is the Thrift server right for me?
> 
> I'm building a system that collects data using Spark Streaming, does some
> processing with it, then saves the data. I want the data to be queried by
> multiple applications, and it sounds like the Thrift JDBC/ODBC server might
> be the right tool to handle the queries. However,  the documentation for the
> Thrift server
> 
>   
> seems to be written for Hive users who are moving to Spark. I never used
> Hive before I started using Spark, so it is not clear to me how best to use
> this.
> 
> I've tried putting data into Hive, then serving it with the Thrift server.
> But I have not been able to update the data in Hive without first shutting
> down the server. This is a problem because new data is always being streamed
> in, and so the data must continuously be updated.
> 
> The system I'm building is supposed to replace a system that stores the data
> in MongoDB. The dataset has now grown so large that the database index does
> not fit in memory, which causes major performance problems in MongoDB.
> 
> If the Thrift server is the right tool for me, how can I set it up for my
> application? If it is not the right tool, what else can I use?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-the-Thrift-server-right-for-me-tp21044.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
I'm using mysql as the metastore DB with Spark 1.2.I simply copy the 
hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in 
/etc/spark/, everything works fine now.
My setup looks like this.
Tableau => Spark ThriftServer2 => HiveServer2
It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?  
I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?


Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com
To: tsind...@gmail.com
CC: user@spark.apache.org

Hi
I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.















  javax.jdo.option.ConnectionURL

  
jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

  URL for the DB








  javax.jdo.option.ConnectionDriverName

  org.apache.derby.jdbc.EmbeddedDriver










On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist  wrote:
Hi Arush,
So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

  hive.metastore.uris
thrift://sandbox.hortonworks.com:9083  URI for 
client to contact metastore server
Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?
I am using 8.3 of tableau with the SparkSQL Connector.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda  
wrote:
BTW what tableau connector are you using?
On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda  
wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?
If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.
One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.
similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist  wrote:
Arush,
As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)Or did 
you have something else in mind?
-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist  wrote:
Arush,
Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.
-Todd
On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda  
wrote:
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO

2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql

3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.


On 11 Feb 2015 01:35, "Todd Nist"  wrote:
Hi,
I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.  
For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.
import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = sqlContext.csfFile("/data/test.csv")
test.toJSON().saveAsTextFile("/data/out")
test.saveAsParquetFile("/data/out")









When I connect from Tableau, the only thing I see is the "default" schema and 
nothing in the tables section.
So my questions are:

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON fi

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration.
Try to deploy the Hadoop configuration with your job with --files and 
--driver-class-path, or to the default /etc/hadoop/conf core-site.xml.
If that is not an option (depending on how your Hadoop cluster is setup), then 
hard code the value vie -Dkey=value to see if it works. The downside is your 
credentials are exposed in plaintext in the java commands.
or by defining it in spark-defaults.conf property 
"spark.executor.extraJavaOptions"
e.g.s3n







spark.executor.extraJavaOptions "-Dfs.s3n.awsAccessKeyId=X 
-Dfs.s3n.awsSecretAccessKey="
s3spark.executor.extraJavaOptions "-Dfs.s3.awsAccessKeyId=X 
-Dfs.s3.awsSecretAccessKey="
Hope this works. Or embed them in the s3n path. Not good security practice 
though.

From: mslimo...@gmail.com
Date: Tue, 10 Feb 2015 10:57:47 -0500
Subject: Re: hadoopConfiguration for StreamingContext
To: ak...@sigmoidanalytics.com
CC: u...@spark.incubator.apache.org

Thanks, Akhil.  I had high hopes for #2, but tried all and no luck.  
I was looking at the source and found something interesting.  The Stack Trace 
(below) directs me to FileInputDStream.scala (line 141).  This is version 
1.1.1, btw.  Line 141 has:
  private def fs: FileSystem = {
if (fs_ == null) fs_ = directoryPath.getFileSystem(new Configuration())
fs_
  }
So it looks to me like it doesn't make any attempt to use a configured 
HadoopConf.
Here is the StackTrace:








java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3n URL, or 
by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties 
(respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy5.initialize(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:141)
at 
org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:107)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:75)
...




















On Tue, Feb 10, 2015 at 10:28 AM, Akhil Das  wrote:
Try the following:
1. Set the access key and secret key in the sparkContext:
ssc.sparkContext.hadoopConfiguration.set("AWS_ACCESS_KEY_ID",yourAccessKey)
ssc.sparkContext.hadoopConfiguration.set("AWS_SECRET_ACCESS_KEY",yourSecretKey)

2. Set the access key and secret key in the environment before startingyour 
application:
​export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=​

3. Set the access key and secret key inside the hadoop configurations
val 
hadoopConf=ssc.sparkContext.hadoopConfiguration;hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey)
4. You can also try:
val stream = 
ssc.textFileStream("s3n://yourAccessKey:yourSecretKey@/path/")ThanksBest
 Regards

On Tue, Feb 10, 2015 at 8:27 PM, Marc Limotte  wrote:
I see that StreamingContext has a hadoopConfiguration() method, which can be 
used like this sample I found:
 sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");
sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "XX");
But StreamingContext doesn't have the same thing.  I want to use a 
StreamingContext with s3n: text file input, but can't find a way to set the AWS 
credentials.  I also tried (with no success):
adding the properties to 
conf/spark-defaults.conf$HADOOP_HOME/conf/hdfs-site.xmlENV variablesEmbedded as 
user:password in s3n://user:password@... (w/ url encoding)Setting the conf as 
above on a new SparkContext and passing that the StreamingContext constructor: 
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)Can 
someone point me in the right direction for setting AWS creds (hadoop

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-03 Thread Andrew Lee
Hi All,
In Spark 1.2.0-rc1, I have tried to set the hive.metastore.warehouse.dir to 
share with the Hive warehouse location on HDFS, however, it does NOT work on 
yarn-cluster mode. On the Namenode audit log, I see that spark is trying to 
access the default hive warehouse location which is 
/user/hive/warehouse/spark_hive_test_yarn_cluster_table as oppose to 
/hive/spark_hive_test_yarn_cluster_table.
A tweaked code snippet from the example looks like this. Compiled and built, 
submitted in yarn-cluster mode. (However, it works for yarn-client mode since 
it can find the hive-site.xml on the driver machine. But we don't deploy 
hive-site.xml to all data nodes, this is not standard to deploy all 
hive-site.xml to data node, instead, it should be part of the --jars or --files 
but it still fails when I do so).








import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive._


object SparkSQLTestCase2HiveContextYarnClusterApp {
 def main(args: Array[String]) {


  val conf = new SparkConf().setAppName("Spark SQL Hive Context TestCase 
Application")
  val sc = new SparkContext(conf)
  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


  import hiveContext._


  // Set default hive warehouse that aligns with /etc/hive/conf/hive-site.xml
  hiveContext.hql("SET hive.metastore.warehouse.dir=hdfs://hive")


  // Create table and clean up data
  hiveContext.hql("CREATE TABLE IF NOT EXISTS 
spark_hive_test_yarn_cluster_table (key INT, value STRING)")


  // load sample data from HDFS, need to be uploaded first
  hiveContext.hql("LOAD DATA INPATH 'spark/test/resources/kv1.txt' INTO TABLE 
spark_hive_test_yarn_cluster_table")


  // Queries are expressed in HiveQL, use collect(), results go into memory, be 
careful. This is just
  // a test case. Do NOT use the following line for production, store results 
to HDFS.
  hiveContext.hql("FROM spark_hive_test_yarn_cluster_table SELECT key, 
value").collect().foreach(println)


  }
}

From: huaiyin@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao  wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai  wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao  wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 











 
  hive.hwi.listen.port
  






 
 
  hive.querylog.location
  /var/ibm/biginsights/hive/query/${user.name}





 

 
  hive.metastore.warehouse.dir
  /biginsights/hive/warehouse
 
 
  hive.hwi.war.file






  lib/hive-hwi-0.12.0.war
 
 
  hive.metastore.metrics.enabled
  true
 
 






  javax.jdo.option.ConnectionURL
  jdbc:db2://hdtest022.svl.ibm.com:50001/BIDB
 





 

  javax.jdo.option.ConnectionDriverName
  com.ibm.db2.jcc.DB2Driver
 
 
  hive.stats.autogather






  false
 
 
  javax.jdo.mapping.Schema
  HIVE
 
 
  javax.jdo.option.ConnectionUserName






  catalog
 
 
  javax.jdo.option.ConnectionPassword
  V2pJNWMxbFlVbWhaZHowOQ==
 






 
  hive.metastore.password.encrypt
  true
 
 
  org.jpox.autoCreateSchema
  true






 
 
  hive.server2.thrift.min.worker.threads
  5
 
 
  hive.server2.thrift.max.worker.threads






  100
 
 
  hive.server2.thrift.port
  1
 
 
  hive.server2.thrift.bind.host






  hdtest022.svl.ibm.com
 
 
  hive.server2.authentication
  CUSTOM






 
 
  hive.server2.custom.authentication.class
  
org.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl
 






 
  hive.server2.enable.impersonation
  true
 
 
  hive.security.webconsole.url






  http://hdtest022.svl.ibm.com:8080
 
 
  hive.security.authorization.enabled






  true
 
 
  hive.security.authorization.createtable.owner.grants
  ALL
 










On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai  wrote:






Hi Jenny,

How's your metastore configured for both Hive and Spark S

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
A follow up on the hive-site.xml, if you 
1. Specify it in spark/conf, then you can NOT apply it via the 
--driver-class-path option, otherwise, you will get the following exceptions 
when initializing SparkContext.








org.apache.spark.SparkException: Found both spark.driver.extraClassPath and 
SPARK_CLASSPATH. Use only the former.
2. If you use the --driver-class-path, then you need to unset SPARK_CLASSPATH. 
However, the flip side is that you will need to provide all the related JARs 
(hadoop-yarn, hadoop-common, hdfs, etc) that are part of the "hadoop-provided" 
if you built your JARs with -Phadoop-provided, and other common libraries that 
are required.

From: alee...@hotmail.com
To: user@spark.apache.org
CC: lian.cs@gmail.com; linlin200...@gmail.com; huaiyin@gmail.com
Subject: RE: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
Date: Mon, 29 Dec 2014 16:01:26 -0800




Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and 
HiveContext.setConf, both failed. Based on the results (haven't get a chance to 
look into the code yet), HiveContext will try to initiate the JDBC connection 
right away, I couldn't set other properties dynamically prior to any SQL 
statement.  
The only way to get it work is to put these properties in hive-site.xml which 
did work for me. I'm wondering if there's a better way to dynamically specify 
these Hive configurations like --hiveconf or other ways such as a user-path 
hive-site.xml?  
On a shared cluster, hive-site.xml is shared and cannot be managed in a 
multiple user mode on the same edge server, especially when it contains 
personal password for metastore access. What will be the best way to pass on 
these 3 properties to spark-shell?
javax.jdo.option.ConnectionUserNamejavax.jdo.option. 
ConnectionPasswordjavax.jdo.option. ConnectionURL
According to HiveContext document, hive-site.xml is picked up from the 
classpath. Anyway to specify this dynamically for each spark-shell session?
"An instance of the Spark SQL execution engine that integrates with data stored 
in Hive. Configuration for Hive is read from hive-site.xml on the classpath."

Here are the test case I ran.
Spark 1.2.0

Test Case 1








import org.apache.spark.SparkContext
import org.apache.spark.sql.hive._


sc.setLocalProperty("javax.jdo.option.ConnectionUserName","foo")
sc.setLocalProperty("javax.jdo.option.ConnectionPassword","xx")
sc.setLocalProperty("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


import hiveContext._


// Create table and clean up data
hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
value STRING)")
// Encounter error, picking up default user 'APP'@'localhost' and creating 
metastore_db in current local directory, not honoring the JDBC settings for the 
metastore on mysql.

Test Case 2
import org.apache.spark.SparkContextimport org.apache.spark.sql.hive._
val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.setConf("javax.jdo.option.ConnectionUserName","foo")//
 Encounter error right here, it looks like HiveContext tries to initiate the 
JDBC connection prior to any settings from 
setConf.hiveContext.setConf("javax.jdo.option.ConnectionPassword","xxx")









hiveContext.setConf("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")




From: huaiyin@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao  wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai  wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao  wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 











 
  hive.hwi.listen.port
  






 
 
  hive.querylog.location
  /var/ibm/bigins

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and 
HiveContext.setConf, both failed. Based on the results (haven't get a chance to 
look into the code yet), HiveContext will try to initiate the JDBC connection 
right away, I couldn't set other properties dynamically prior to any SQL 
statement.  
The only way to get it work is to put these properties in hive-site.xml which 
did work for me. I'm wondering if there's a better way to dynamically specify 
these Hive configurations like --hiveconf or other ways such as a user-path 
hive-site.xml?  
On a shared cluster, hive-site.xml is shared and cannot be managed in a 
multiple user mode on the same edge server, especially when it contains 
personal password for metastore access. What will be the best way to pass on 
these 3 properties to spark-shell?
javax.jdo.option.ConnectionUserNamejavax.jdo.option. 
ConnectionPasswordjavax.jdo.option. ConnectionURL
According to HiveContext document, hive-site.xml is picked up from the 
classpath. Anyway to specify this dynamically for each spark-shell session?
"An instance of the Spark SQL execution engine that integrates with data stored 
in Hive. Configuration for Hive is read from hive-site.xml on the classpath."

Here are the test case I ran.
Spark 1.2.0

Test Case 1








import org.apache.spark.SparkContext
import org.apache.spark.sql.hive._


sc.setLocalProperty("javax.jdo.option.ConnectionUserName","foo")
sc.setLocalProperty("javax.jdo.option.ConnectionPassword","xx")
sc.setLocalProperty("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)


import hiveContext._


// Create table and clean up data
hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
value STRING)")
// Encounter error, picking up default user 'APP'@'localhost' and creating 
metastore_db in current local directory, not honoring the JDBC settings for the 
metastore on mysql.

Test Case 2
import org.apache.spark.SparkContextimport org.apache.spark.sql.hive._
val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.setConf("javax.jdo.option.ConnectionUserName","foo")//
 Encounter error right here, it looks like HiveContext tries to initiate the 
JDBC connection prior to any settings from 
setConf.hiveContext.setConf("javax.jdo.option.ConnectionPassword","xxx")









hiveContext.setConf("javax.jdo.option.ConnectionURL","jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true")




From: huaiyin@gmail.com
Date: Wed, 13 Aug 2014 16:56:13 -0400
Subject: Re: Spark sql failed in yarn-cluster mode when connecting to 
non-default hive database
To: linlin200...@gmail.com
CC: lian.cs@gmail.com; user@spark.apache.org

I think the problem is that when you are using yarn-cluster mode, because the 
Spark driver runs inside the application master, the hive-conf is not 
accessible by the driver. Can you try to set those confs by using 
hiveContext.set(...)? Or, maybe you can copy hive-site.xml to spark/conf in the 
node running the application master.



On Tue, Aug 12, 2014 at 8:38 PM, Jenny Zhao  wrote:



Hi Yin,

hive-site.xml was copied to spark/conf and the same as the one under 
$HIVE_HOME/conf. 



through hive cli, I don't see any problem. but for spark on yarn-cluster mode, 
I am not able to switch to a database other than the default one, for 
Yarn-client mode, it works fine.  


Thanks!

Jenny




On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai  wrote:

Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put it 
in conf/ and try again?





Thanks,
Yin






On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao  wrote:






Thanks Yin! 

here is my hive-site.xml,  which I copied from $HIVE_HOME/conf, didn't 
experience problem connecting to the metastore through hive. which uses DB2 as 
metastore database. 











 
  hive.hwi.listen.port
  






 
 
  hive.querylog.location
  /var/ibm/biginsights/hive/query/${user.name}





 

 
  hive.metastore.warehouse.dir
  /biginsights/hive/warehouse
 
 
  hive.hwi.war.file






  lib/hive-hwi-0.12.0.war
 
 
  hive.metastore.metrics.enabled
  true
 
 






  javax.jdo.option.ConnectionURL
  jdbc:db2://hdtest022.svl.ibm.com:50001/BIDB
 





 

  javax.jdo.option.ConnectionDriverName
  com.ibm.db2.jcc.DB2Driver
 
 
  hive.stats.autogather






  false
 
 
  javax.jdo.mapping.Schema
  HIVE
 
 
  javax.jdo.option.ConnectionUserName






  catalog
 
 
  javax.jdo.option.ConnectionPassword
  V2pJNWMxbFlVbWhaZHowOQ==
 






 
  hive.metastore.password.encrypt
  true
 
 
  org.jpox.autoCreateSchema
  true






 
 
  hive.server2.thrift.min.worker.threads
  5
 
 
  hive.server2.thrift.max.worker.threads






  100
 
 
  hive.server2.thrift.port
  1
 
 
  hive.server2.thrift.bind.host






  hdtest022.svl.ibm.com
 
 
  hive.server2.authentication
  CUSTOM






 
 
 

RE: Hive From Spark

2014-08-25 Thread Andrew Lee
Hi Du,
I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of 
Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 
will release in Spark 1.2.0 according to the current JIRA status.
I'm tracking branch-1.1 instead of the master and haven't seen the results 
merged. Still seeing guava 14.0.1 so I don't think Spark 2848 has been merged 
yet.
Will be great to have someone to confirm or clarify the expectation.
> From: l...@yahoo-inc.com.INVALID
> To: van...@cloudera.com; alee...@hotmail.com
> CC: user@spark.apache.org
> Subject: Re: Hive From Spark
> Date: Sat, 23 Aug 2014 00:08:47 +
> 
> I thought the fix had been pushed to the apache master ref. commit
> "[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my
> previous email was based on own build of the apache master, which turned
> out not working yet.
> 
> Marcelo: Please correct me if I got that commit wrong.
> 
> Thanks,
> Du
> 
> 
> 
> On 8/22/14, 11:41 AM, "Marcelo Vanzin"  wrote:
> 
> >SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
> >be too risky at this point.
> >
> >I'm not familiar with spark-sql.
> >
> >On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee  wrote:
> >> Hopefully there could be some progress on SPARK-2420. It looks like
> >>shading
> >> may be the voted solution among downgrading.
> >>
> >> Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
> >> 1.1.2?
> >>
> >> By the way, regarding bin/spark-sql? Is this more of a debugging tool
> >>for
> >> Spark job integrating with Hive?
> >> How does people use spark-sql? I'm trying to understand the rationale
> >>and
> >> motivation behind this script, any idea?
> >>
> >>
> >>> Date: Thu, 21 Aug 2014 16:31:08 -0700
> >>
> >>> Subject: Re: Hive From Spark
> >>> From: van...@cloudera.com
> >>> To: l...@yahoo-inc.com.invalid
> >>> CC: user@spark.apache.org; u...@spark.incubator.apache.org;
> >>> pwend...@gmail.com
> >>
> >>>
> >>> Hi Du,
> >>>
> >>> I don't believe the Guava change has made it to the 1.1 branch. The
> >>> Guava doc says "hashInt" was added in 12.0, so what's probably
> >>> happening is that you have and old version of Guava in your classpath
> >>> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
> >>> source of your problem.)
> >>>
> >>> On Thu, Aug 21, 2014 at 4:23 PM, Du Li 
> >>>wrote:
> >>> > Hi,
> >>> >
> >>> > This guava dependency conflict problem should have been fixed as of
> >>> > yesterday according to
> >>>https://issues.apache.org/jira/browse/SPARK-2420
> >>> >
> >>> > However, I just got java.lang.NoSuchMethodError:
> >>> >
> >>> > 
> >>>com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
> >>>shCode;
> >>> > by the following code snippet and ³mvn3 test² on Mac. I built the
> >>>latest
> >>> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
> >>> > local
> >>> > maven repo. From my pom file I explicitly excluded guava from almost
> >>>all
> >>> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
> >>> > hadoop-client. This snippet is abstracted from a larger project. So
> >>>the
> >>> > pom.xml includes many dependencies although not all are required by
> >>>this
> >>> > snippet. The pom.xml is attached.
> >>> >
> >>> > Anybody knows what to fix it?
> >>> >
> >>> > Thanks,
> >>> > Du
> >>> > ---
> >>> >
> >>> > package com.myself.test
> >>> >
> >>> > import org.scalatest._
> >>> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
> >>> > import org.apache.spark.{SparkContext, SparkConf}
> >>> > import org.apache.spark.SparkContext._
> >>> >
> >>> > class MyRecord(name: String) extends Serializable {
> >>> > def getWritable(): BytesWritable = {
> >>> > new
> >>> > 
> >>>BytesWritable(Option(name).getOrElse("\\N").toString.getBytes(

RE: Hive From Spark

2014-08-22 Thread Andrew Lee
Hopefully there could be some progress on SPARK-2420. It looks like shading may 
be the voted solution among downgrading.
Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2? 
By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark 
job integrating with Hive? How does people use spark-sql? I'm trying to 
understand the rationale and motivation behind this script, any idea?

> Date: Thu, 21 Aug 2014 16:31:08 -0700
> Subject: Re: Hive From Spark
> From: van...@cloudera.com
> To: l...@yahoo-inc.com.invalid
> CC: user@spark.apache.org; u...@spark.incubator.apache.org; pwend...@gmail.com
> 
> Hi Du,
> 
> I don't believe the Guava change has made it to the 1.1 branch. The
> Guava doc says "hashInt" was added in 12.0, so what's probably
> happening is that you have and old version of Guava in your classpath
> before the Spark jars. (Hadoop ships with Guava 11, so that may be the
> source of your problem.)
> 
> On Thu, Aug 21, 2014 at 4:23 PM, Du Li  wrote:
> > Hi,
> >
> > This guava dependency conflict problem should have been fixed as of
> > yesterday according to https://issues.apache.org/jira/browse/SPARK-2420
> >
> > However, I just got java.lang.NoSuchMethodError:
> > com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
> > by the following code snippet and “mvn3 test” on Mac. I built the latest
> > version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local
> > maven repo. From my pom file I explicitly excluded guava from almost all
> > possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
> > hadoop-client. This snippet is abstracted from a larger project. So the
> > pom.xml includes many dependencies although not all are required by this
> > snippet. The pom.xml is attached.
> >
> > Anybody knows what to fix it?
> >
> > Thanks,
> > Du
> > ---
> >
> > package com.myself.test
> >
> > import org.scalatest._
> > import org.apache.hadoop.io.{NullWritable, BytesWritable}
> > import org.apache.spark.{SparkContext, SparkConf}
> > import org.apache.spark.SparkContext._
> >
> > class MyRecord(name: String) extends Serializable {
> >   def getWritable(): BytesWritable = {
> > new
> > BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
> >   }
> >
> >   final override def equals(that: Any): Boolean = {
> > if( !that.isInstanceOf[MyRecord] )
> >   false
> > else {
> >   val other = that.asInstanceOf[MyRecord]
> >   this.getWritable == other.getWritable
> > }
> >   }
> > }
> >
> > class MyRecordTestSuite extends FunSuite {
> >   // construct an MyRecord by Consumer.schema
> >   val rec: MyRecord = new MyRecord("James Bond")
> >
> >   test("generated SequenceFile should be readable from spark") {
> > val path = "./testdata/"
> >
> > val conf = new SparkConf(false).setMaster("local").setAppName("test data
> > exchange with Hive")
> > conf.set("spark.driver.host", "localhost")
> > val sc = new SparkContext(conf)
> > val rdd = sc.makeRDD(Seq(rec))
> > rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
> >   .saveAsSequenceFile(path)
> >
> > val bytes = sc.sequenceFile(path, classOf[NullWritable],
> > classOf[BytesWritable]).first._2
> > assert(rec.getWritable() == bytes)
> >
> > sc.stop()
> > System.clearProperty("spark.driver.port")
> >   }
> > }
> >
> >
> > From: Andrew Lee 
> > Reply-To: "user@spark.apache.org" 
> > Date: Monday, July 21, 2014 at 10:27 AM
> > To: "user@spark.apache.org" ,
> > "u...@spark.incubator.apache.org" 
> >
> > Subject: RE: Hive From Spark
> >
> > Hi All,
> >
> > Currently, if you are running Spark HiveContext API with Hive 0.12, it won't
> > work due to the following 2 libraries which are not consistent with Hive
> > 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common
> > practice, they should be consistent to work inter-operable).
> >
> > These are under discussion in the 2 JIRA tickets:
> >
> > https://issues.apache.org/jira/browse/HIVE-7387
> >
> > https://issues.apache.org/jira/browse/SPARK-2420
> >
> > When I ran the command by tweaking the classpath and build for Spark
> > 1.0.1-rc3, I was able to cre

RE: Spark SQL, Parquet and Impala

2014-08-02 Thread Andrew Lee
Hi Patrick,
In Impala 131, when you update tables and metadata, do you still need to run 
'invalidate metadata' in impala-shell? My understanding is that it is a pull 
architecture to refresh the metastore on the catalogd in Impala, not sure if 
this still applies to this case since you are updating the Hive metastore data 
when creating the external tables.
If the 'invalidate metadata' still applies, I would point this to an Impala 
problem since HiveContext is passive and depends on when and who invoke the 
command. The underneath driver is still Hiveserver2 to Hive (I haven't looked 
into the Spark code, not sure if they are using the ql.Driver class, however, 
I'm assuming it is HS2 here in Spark) where Impala needs to fetch the metadata 
from Hive-metastore. HiveContext should update the Hive-metastore when you 
create the table, but this doesn't mean it will trigger Impala's catalogd to 
pull in the latest metadata which is cached on catalogd.
This is probably not a Parquet related answers but more of the background how 
Impala works with Hive, and how Spark updates data into Hive?
AL

Date: Sat, 2 Aug 2014 10:30:27 +0200
Subject: Re: Spark SQL, Parquet and Impala
From: mcgloin.patr...@gmail.com
To: user@spark.apache.org

Hi Michael,
Thanks for your reply.  Is this the correct way to load data from Spark into 
Parquet?  Somehow it doesn't feel right.  When we followed the steps described 
for storing the data into Hive tables everything was smooth, we used 
HiveContext and the table is automatically recognised by Hive (and Impala).

When we loaded the data into Parquet using the method I described we used both 
SQLContext and HiveContext.  We had to manually define the table using the 
CREATE EXTERNAL in Hive.  Then we have to refresh to see changes.

So the problem isn't just the refresh, its that we're unsure of the best 
practice for loading data into Parquet tables.  Is the way we are doing the 
Spark part correct in your opinion?

Best regards,Patrick





On 1 August 2014 19:32, Michael Armbrust  wrote:

So is the only issue that impala does not see changes until you refresh the 
table?  This sounds like a configuration that needs to be changed on the impala 
side.





On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin  
wrote:




Sorry, sent early, wasn't finished typing.
CREATE EXTERNAL TABLE 

Then we can select the data using Impala.  But this is registered as an 
external table and must be refreshed if new data is inserted.





Obviously this doesn't seem good and doesn't seem like the correct solution.
How should we insert data from SparkSQL into a Parquet table which can be 
directly queried by Impala?





Best regards,Patrick

On 1 August 2014 16:18, Patrick McGloin  wrote:





Hi,
We would like to use Spark SQL to store data in Parquet format and then query 
that data using Impala.





We've tried to come up with a solution and it is working but it doesn't seem 
good.  So I was wondering if you guys could tell us what is the correct way to 
do this.  We are using Spark 1.0 and Impala 1.3.1.






First we are registering our tables using SparkSQL:
val sqlContext = new 
SQLContext(sc)sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
 true)







Then we are using the HiveContext to register the table and do the insert:
val hiveContext = new HiveContext(sc)import 
hiveContext._hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")





eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
Now we have the data stored in a Parquet file.  To access it in Hive or Impala 
we run 












  

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Could you enable HistoryServer and provide the properties and CLASSPATH for the 
spark-shell? And 'env' command to list your environment variables?

By the way, what does the spark logs says? Enable debug mode to see what's 
going on in spark-shell when it tries to interact and init HiveContext.



> On Jul 31, 2014, at 19:09, "chenjie"  wrote:
> 
> Hi, Yin and Andrew, thank you for your reply.
> When I create table in hive cli, it works correctly and the table will be
> found in hdfs. I forgot start hiveserver2 before and I started it today.
> Then I run the command below:
>spark-shell --master spark://192.168.40.164:7077  --driver-class-path
> conf/hive-site.xml
> Furthermore, I added the following command:
>hiveContext.hql("SET
> hive.metastore.warehouse.dir=hdfs://192.168.40.164:8020/user/hive/warehouse")
> But then didn't work for me. I got the same exception as before and found
> the table file in local directory instead of hdfs.
> 
> 
> Yin Huai-2 wrote
>> Another way is to set "hive.metastore.warehouse.dir" explicitly to the
>> HDFS
>> dir storing Hive tables by using SET command. For example:
>> 
>> hiveContext.hql("SET
>> hive.metastore.warehouse.dir=hdfs://localhost:54310/user/hive/warehouse")
>> 
>> 
>> 
>> 
>> On Thu, Jul 31, 2014 at 8:05 AM, Andrew Lee <
> 
>> alee526@
> 
>> > wrote:
>> 
>>> Hi All,
>>> 
>>> It has been awhile, but what I did to make it work is to make sure the
>>> followings:
>>> 
>>> 1. Hive is working when you run Hive CLI and JDBC via Hiveserver2
>>> 
>>> 2. Make sure you have the hive-site.xml from above Hive configuration.
>>> The
>>> problem here is that you want the hive-site.xml from the Hive metastore.
>>> The one for Hive and HCatalog may be different files. Make sure you check
>>> the xml properties in that file, pick the one that has the warehouse
>>> property configured and the JDO setup.
>>> 
>>> 3. Make sure hive-site.xml from step 2 is included in $SPARK_HOME/conf,
>>> and in your runtime CLASSPATH when you run spark-shell
>>> 
>>> 4. Use the history server to check the runtime CLASSPATH and order to
>>> ensure hive-site.xml is included.
>>> 
>>> HiveContext should pick up the hive-site.xml and talk to your running
>>> hive
>>> service.
>>> 
>>> Hope these tips help.
>>> 
>>>> On Jul 30, 2014, at 22:47, "chenjie" <
> 
>> chenjie2001@
> 
>> > wrote:
>>>> 
>>>> Hi, Michael. I Have the same problem. My warehouse directory is always
>>>> created locally. I copied the default hive-site.xml into the
>>>> $SPARK_HOME/conf directory on each node. After I executed the code
>>> below,
>>>>   val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>   hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value
>>>> STRING)")
>>>>   hiveContext.hql("LOAD DATA LOCAL INPATH
>>>> '/extdisk2/tools/spark/examples/src/main/resources/kv1.txt' INTO TABLE
>>> src")
>>>>   hiveContext.hql("FROM src SELECT key, value").collect()
>>>> 
>>>> I got the exception below:
>>>> java.io.FileNotFoundException: File
>>> file:/user/hive/warehouse/src/kv1.txt
>>>> does not exist
>>>>   at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
>>>>   at
>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>>>>   at
>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.
>> 
>> (ChecksumFileSystem.java:137)
>>>>   at
>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>>>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
>>>>   at
>>> org.apache.hadoop.mapred.LineRecordReader.
>> 
>> (LineRecordReader.java:106)
>>>>   at
>>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>>>   at org.apache.spark.rdd.HadoopRDD$$anon$1.
>> 
>> (HadoopRDD.scala:193)
>>>> 
>>>> At last, I found /user/hive/warehouse/src/kv1.txt was created on the
>>> node
>>>> where I start spark-shell.
>>>> 
>>>> The spark that I used is pre-built spark1.0.1 for hadoop2.
>>>> 
>>>> Thanks in advance.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p1.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Deployment Patterns - Automated Deployment & Performance Testing

2014-07-31 Thread Andrew Lee
You should be able to use either SBT or maven to create your JAR files (not a 
fat jar), and only deploying the JAR for spark-submit.

1. Sync spark libs and versions with your development env and CLASSPATH in your 
IDE (unfortunately this needs to be hard copied, and may result in split-brain 
syndrome and version inconsistency if you didn't manage this part with your 
Spark Jenkins pipeline, assuming you are building Spark yourself, if not then 
it's easier, just make sure you have the same copy on HDFS or S3 for reuse)

2. Copy only the jar and reuse the assembly jar for Spark core. Either manual 
copy to HDFS or let spark-submit pick up your jars and deploy them into 
.stagingSpark will work.

You don't need to rebuild Spark every time. 

Hope this helps.

AL

> On Jul 30, 2014, at 16:52, "nightwolf"  wrote:
> 
> Hi all,
> 
> We are developing an application which uses Spark & Hive to do static and
> ad-hoc reporting. For these static reports, they take a number of parameters
> and then run over a data set. We would like to make it easier to test
> performance of these reports on a cluster.
> 
> If we have a test cluster running with a sufficient sample data set which
> developers can share. To speed up development time, what is the best way to
> deploy a Spark application to a Spark cluster (in standalone) via an IDE?
> 
> I'm thinking we would create an SBT task which would run the spark submit
> script. Is there a better way?
> 
> Eventually this will feed into some automated performance testing which we
> plan to run as a twice daily Jenkins job. If its an SBT deploy task, it
> makes it easy to call in Jenkins. Is there a better way to do this?
> 
> Posted on StackOverflow as well;
> http://stackoverflow.com/questions/25048784/spark-automated-deployment-performance-testing
>  
> 
> Any advice/experience appreciated!
> 
> Cheers!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Deployment-Patterns-Automated-Deployment-Performance-Testing-tp11000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Hi All,

It has been awhile, but what I did to make it work is to make sure the 
followings:

1. Hive is working when you run Hive CLI and JDBC via Hiveserver2

2. Make sure you have the hive-site.xml from above Hive configuration. The 
problem here is that you want the hive-site.xml from the Hive metastore. The 
one for Hive and HCatalog may be different files. Make sure you check the xml 
properties in that file, pick the one that has the warehouse property 
configured and the JDO setup.

3. Make sure hive-site.xml from step 2 is included in $SPARK_HOME/conf, and in 
your runtime CLASSPATH when you run spark-shell

4. Use the history server to check the runtime CLASSPATH and order to ensure 
hive-site.xml is included.

HiveContext should pick up the hive-site.xml and talk to your running hive 
service.

Hope these tips help.

> On Jul 30, 2014, at 22:47, "chenjie"  wrote:
> 
> Hi, Michael. I Have the same problem. My warehouse directory is always
> created locally. I copied the default hive-site.xml into the
> $SPARK_HOME/conf directory on each node. After I executed the code below,
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")
>hiveContext.hql("LOAD DATA LOCAL INPATH
> '/extdisk2/tools/spark/examples/src/main/resources/kv1.txt' INTO TABLE src")
>hiveContext.hql("FROM src SELECT key, value").collect()
> 
> I got the exception below:
> java.io.FileNotFoundException: File file:/user/hive/warehouse/src/kv1.txt
> does not exist
>at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
>at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
>at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
>at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:193)
> 
> At last, I found /user/hive/warehouse/src/kv1.txt was created on the node
> where I start spark-shell.
> 
> The spark that I used is pre-built spark1.0.1 for hadoop2.
> 
> Thanks in advance.
> 
> 
> Michael Armbrust wrote
>> The warehouse and the metastore directories are two different things.  The
>> metastore holds the schema information about the tables and will by
>> default
>> be a local directory.  With javax.jdo.option.ConnectionURL you can
>> configure it to be something like mysql.  The warehouse directory is the
>> default location where the actual contents of the tables is stored.  What
>> directory are seeing created locally?
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p11024.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Andrew Lee
Hi Jianshi,
My understanding is 'No' based on how Spark's is designed even with your own 
log4j.properties in the Spark's conf folder.
In YARN mode, the Application Master is running inside the cluster and all logs 
are part of containers log which is defined by another log4j.properties file 
from the Hadoop and YARN environment. Spark can't override that unless it can 
provide its own log4j prior to YARN's in the classpath. So the only way is to 
login to the resource manager and click on the job itself to read the 
containers log. (Other people) Please correct me if my understanding is wrong.
You may be thinking why can't I stream the log's to an external service (e.g. 
Flume, syslogd) with a different appender in log4j, myself don't consider this 
a good practice since:1. you need 2 infra structure to operate the entire 
cluster.  2. you will need to open up the firewall ports between the 2 services 
to transfer/stream logs.3. unpredictable traffic, the YARN cluster may bring 
down the logging service/infra (DDoS) when someone accidentally change the 
logging level from WARN to INFO, or worst, DEBUG.
I was thinking maybe we can suggest the community to enhance the Spark 
HistoryServer to capture the last failure exception from the container logs in 
the last failed stage? Not sure if this is an good idea since it may complicate 
the event model. I'm not sure if Akka model can support this or some other 
components in Spark could help to capture these exceptions and pass it back to 
AM and eventually stored in somewhere for later troubleshooting. I'm not clear 
how this path is constructed until reading the source code, so I can't give a 
better answer.
AL

From: jianshi.hu...@gmail.com
Date: Mon, 28 Jul 2014 13:32:05 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org

Hi Andrew,
Thanks for the reply, I figured out the cause of the issue. Some resource files 
were missing in JARs. A class initialization depends on the resource files so 
it got that exception.


I appended the resource files explicitly to --jars option and it worked fine.
The "Caused by..." messages were found in yarn logs actually, I think it might 
be useful if I can seem them from the console which runs spark-submit. Would 
that be possible?


Jianshi


On Sat, Jul 26, 2014 at 7:08 AM, Andrew Lee  wrote:





Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?


Were you able to manually write one record to HBase with the serialize 
function? Hardcode and test it ?

From: jianshi.hu...@gmail.com


Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org



I nailed it down to a union operation, here's my code snippet:
val properties: RDD[((String, String, String), Externalizer[KeyValue])] = 
vertices.map { ve =>

  val (vertices, dsName) = ve

  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)   
   val (_, rvalAsc, rvalType) = rval
  println(s"Table name: $dsName, Rval: $rval")



  println(vertices.toDebugString)
  vertices.map { v =>val rk = appendHash(boxId(v.id)).getBytes  
  val cf = PROP_BYTES



val cq = boxRval(v.rval, rvalAsc, rvalType).getBytesval value = 
Serializer.serialize(v.properties)
((new String(rk), new String(cf), new String(cq)),



 Externalizer(put(rk, cf, cq, value)))  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're 
transformed to the a KeyValue to be insert in HBase, so I need to do a 
.reduce(_.union(_)) to combine them into one RDD[(key, value)].




I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang  wrote:




I can successfully run my code in local mode using spark-submit (--master 
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.




Any hints what is the problem? Is it a closure serialization problem? How can I 
debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError  
  at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal




a:40)at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)




at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)   
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)at 
org.apache.spark.r

RE: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi Andrew,
Thanks to re-confirm the problem. I thought it only happens to my own build. :)
by the way, we have multiple users using the spark-shell to explore their 
dataset, and we are continuously looking into ways to isolate their jobs 
history. In the current situation, we can't really ask them to create their own 
spark-defaults.conf since this is set to read-only. A workaround is to set it 
to a shared folder e.g. /user/spark/logs and user permission 1777. This isn't 
really ideal since other people can see what are the other jobs running on the 
shared cluster.
It will be nice to have a better security if this is enhanced so people aren't 
exposing their algorithm (which is usually embed in their job's name) to other 
users.
Will there or is there a JIRA ticket to keep track of this? any plan to enhance 
this part for spark-shell ?


Date: Mon, 28 Jul 2014 13:54:56 -0700
Subject: Re: Issues on spark-shell and spark-submit behave differently on 
spark-defaults.conf parameter spark.eventLog.dir
From: and...@databricks.com
To: user@spark.apache.org

Hi Andrew,
It's definitely not bad practice to use spark-shell with HistoryServer. The 
issue here is not with spark-shell, but the way we pass Spark configs to the 
application. spark-defaults.conf does not currently support embedding 
environment variables, but instead interprets everything as a string literal. 
You will have to manually specify "test" instead of "$USER" in the path you 
provide to spark.eventLog.dir.

-Andrew

2014-07-28 12:40 GMT-07:00 Andrew Lee :




Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for

spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 

For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152

but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.

It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.

Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?









  

  

Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for
spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 
For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152
but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.
It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.
Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?








  

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?
Were you able to manually write one record to HBase with the serialize 
function? Hardcode and test it ?

From: jianshi.hu...@gmail.com
Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org

I nailed it down to a union operation, here's my code snippet:
val properties: RDD[((String, String, String), Externalizer[KeyValue])] = 
vertices.map { ve =>  val (vertices, dsName) = ve

  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)   
   val (_, rvalAsc, rvalType) = rval
  println(s"Table name: $dsName, Rval: $rval")

  println(vertices.toDebugString)
  vertices.map { v =>val rk = appendHash(boxId(v.id)).getBytes  
  val cf = PROP_BYTES

val cq = boxRval(v.rval, rvalAsc, rvalType).getBytesval value = 
Serializer.serialize(v.properties)
((new String(rk), new String(cf), new String(cq)),

 Externalizer(put(rk, cf, cq, value)))  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're 
transformed to the a KeyValue to be insert in HBase, so I need to do a 
.reduce(_.union(_)) to combine them into one RDD[(key, value)].


I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang  wrote:


I can successfully run my code in local mode using spark-submit (--master 
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.


Any hints what is the problem? Is it a closure serialization problem? How can I 
debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError  
  at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal


a:40)at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)   
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)at 
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)


at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)


at org.apache.spark.scheduler.Task.run(Task.scala:51)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)



-- 
Jianshi Huang

LinkedIn: jianshi

Twitter: @jshuang
Github & Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



  

RE: Spark SQL and Hive tables

2014-07-25 Thread Andrew Lee
Hi Michael,
If I understand correctly, the assembly JAR file is deployed onto HDFS 
/user/$USER/.stagingSpark folders that will be used by all computing (worker) 
nodes when people run in yarn-cluster mode.
Could you elaborate more what does the document mean by this? It is a bit 
misleading and I guess this only applies to standalone mode?
Andrew L

Date: Fri, 25 Jul 2014 15:25:42 -0700
Subject: RE: Spark SQL and Hive tables
From: ssti...@live.com
To: user@spark.apache.org






Thanks!  Will do.







Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone





 Original message 
From: Michael Armbrust 
Date:07/25/2014 3:24 PM (GMT-08:00) 
To: user@spark.apache.org 
Subject: Re: Spark SQL and Hive tables 






[S]ince Hive has a large number of dependencies, it is not included in the 
default Spark assembly. In order to use Hive
 you must first run ‘SPARK_HIVE=true sbt/sbt assembly/assembly’
 (or use -Phive for
 maven). This command builds a new assembly jar that includes Hive. Note that 
this Hive assembly jar must also be present on all of the worker nodes, as they 
will need access to the Hive serialization and deserialization libraries 
(SerDes) in order to acccess
 data stored in Hive.





On Fri, Jul 25, 2014 at 3:20 PM, Sameer Tilak 
 wrote:



Hi Jerry,




I am having trouble with this. May be something wrong with my import or version 
etc. 



scala> import org.apache.spark.sql._;
import org.apache.spark.sql._



scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
:24: error: object hive is not a member of package org.apache.spark.sql
   val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  ^
Here is what I see for autocompletion:



scala> org.apache.spark.sql.
Row SQLContext  SchemaRDD   SchemaRDDLike   api
catalystcolumnarexecution   package parquet
test







Date: Fri, 25 Jul 2014 17:48:27 -0400


Subject: Re: Spark SQL and Hive tables


From: chiling...@gmail.com

To: user@spark.apache.org





Hi Sameer,



The blog post you referred to is about Spark SQL. I don't think the intent of 
the article is meant to guide you how to read data from Hive via Spark SQL. So 
don't worry too much about the blog post. 



The programming guide I referred to demonstrate how to read data from Hive 
using Spark SQL. It is a good starting point.



Best Regards,



Jerry





On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:



Hi Michael,
Thanks. I am not creating HiveContext, I am creating SQLContext. I am using CDH 
5.1. Can you please let me know which conf/ directory you are talking about? 





From: mich...@databricks.com

Date: Fri, 25 Jul 2014 14:34:53 -0700


Subject: Re: Spark SQL and Hive tables


To: user@spark.apache.org





In particular, have you put your hive-site.xml in the conf/ directory?  Also, 
are you creating a HiveContext instead of a SQLContext?




On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:


Hi Sameer,



Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



Best Regards,



Jerry










On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:




Hi All,
I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 



val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")



14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
java.lang.RuntimeException: Table Not Found: prod.



I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 




































  

RE: Hive From Spark

2014-07-22 Thread Andrew Lee
Hi Sean,
Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding.
>From a user perspective, what would you recommend to build Spark with Hive 
>0.12 / 0.13+ libraries moving forward and deploy to production cluster that 
>runs on a older version of Hadoop (e.g. 2.2 or 2.4) ?
My concern is that there's going to be a lag for technology adoption and since 
Spark is moving fast, the libraries may always be newer. Protobuf is one good 
example, shading. From a biz point of view, if there is no benefit to upgrade 
the library, the chances that this will happen with a higher priority is low 
due to stability concern and re-running the entire test suite. Just by 
observation, there's still a lot of ppl running Hadoop 2.2 instead of 2.4 or 
2.5 and the release and upgrade is depending on other big players such as 
Cloudera, Hortonwork, etc for their distro. Not to mention the process of 
upgrading.
Is there any benefit to use Guava 14 in Spark? I believe there is usually some 
competitive reason why Spark choose Guava 14, however, I'm not sure if anyone 
raise that in the conversation so I don't know if that is necessary.
Looking forward to seeing Hive on Spark to work soon. Please let me know if 
there's any help or feedback I can provide.
Thanks Sean.


> From: so...@cloudera.com
> Date: Mon, 21 Jul 2014 18:36:10 +0100
> Subject: Re: Hive From Spark
> To: user@spark.apache.org
> 
> I haven't seen anyone actively 'unwilling' -- I hope not. See
> discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I
> sketch what a downgrade means. I think it just hasn't gotten a looking
> over.
> 
> Contrary to what I thought earlier, the conflict does in fact cause
> problems in theory, and you show it causes a problem in practice. Not
> to mention it causes issues for Hive-on-Spark now.
> 
> On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee  wrote:
> > Hive and Hadoop are using an older version of guava libraries (11.0.1) where
> > Spark Hive is using guava 14.0.1+.
> > The community isn't willing to downgrade to 11.0.1 which is the current
> > version for Hadoop 2.2 and Hive 0.12.
  

RE: Hive From Spark

2014-07-21 Thread Andrew Lee
Hi All,
Currently, if you are running Spark HiveContext API with Hive 0.12, it won't 
work due to the following 2 libraries which are not consistent with Hive 0.12 
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common 
practice, they should be consistent to work inter-operable).
These are under discussion in the 2 JIRA tickets:
https://issues.apache.org/jira/browse/HIVE-7387
https://issues.apache.org/jira/browse/SPARK-2420
When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, 
I was able to create table through HiveContext, however, when I fetch the data, 
due to incompatible API calls in Guava, it breaks. This is critical since it 
needs to map the cllumns to the RDD schema.
Hive and Hadoop are using an older version of guava libraries (11.0.1) where 
Spark Hive is using guava 14.0.1+.The community isn't willing to downgrade to 
11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12. Be aware of 
protobuf version as well in Hive 0.12 (it uses protobuf 2.4).
scala>scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContextscala> import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._scala>scala> val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@34bee01ascala>scala> 
hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
res0: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
scala> hiveContext.hql("LOAD DATA LOCAL 
INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
res1: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
scala>scala> // Queries are expressed in 
HiveQLscala> hiveContext.hql("FROM src SELECT key, 
value").collect().foreach(println)
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at org.apache.spark.broadcast.HttpBroadcast.(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:60)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:70)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:280)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:69)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:319)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:319)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC

RE: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-11 Thread Andrew Lee
Ok, I found it on JIRA SPARK-2390:
https://issues.apache.org/jira/browse/SPARK-2390
So it looks like this is a known issue.

From: alee...@hotmail.com
To: user@spark.apache.org
Subject: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file 
option?
Date: Tue, 8 Jul 2014 15:17:00 -0700




Build: Spark 1.0.0 rc11 (git commit tag: 
2f1dc868e5714882cf40d2633fb66772baf34789)








Hi All,
When I enabled the spark-defaults.conf for the eventLog, spark-shell broke 
while spark-submit works.
I'm trying to create a separate directory per user to keep track with their own 
Spark job event logs with the env $USER in spark-defaults.conf.
Here's the spark-defaults.conf I specified so that HistoryServer can start 
picking up these event log from HDFS.As you can see here, I was trying to 
create a directory for each user so they can store the event log on a per user 
base.However, when I launch spark-shell, it didn't pick up $USER as the current 
login user. However, this works for spark-submit.
Here's more details.
/opt/spark/ is SPARK_HOME








[test@ ~]$ cat /opt/spark/conf/spark-defaults.conf
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.


# Example:
# spark.masterspark://master:7077
spark.eventLog.enabledtrue
spark.eventLog.dirhdfs:///user/$USER/spark/logs/
# spark.serializerorg.apache.spark.serializer.KryoSerializer

and I tried to create a separate config file to override the default one:







[test@ ~]$ SPARK_SUBMIT_OPTS="-XX:MaxPermSize=256m" /opt/spark/bin/spark-shell 
--master yarn --driver-class-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --properties-file 
/home/test/spark-defaults.conf [test@~]$ cat /home/test/spark-defaults.conf# 
Default system properties included when running spark-submit.# This is useful 
for setting default environmental settings.
# Example:# spark.masterspark://master:7077spark.eventLog.enabled   
 truespark.eventLog.dirhdfs:///user/test/spark/logs/















# spark.serializerorg.apache.spark.serializer.KryoSerializer
But it didn't work also, it is still looking at the 
/opt/spark/conf/spark-defaults.conf. According to the document, 
http://spark.apache.org/docs/latest/configuration.htmlHardcoded properties in 
SparkConf > spark-submit / spark-shell > conf/spark-defaults.conf
2 problems here:
1. In repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala, the instance 
SparkConf didn't look for the user specified spark-defaults.conf anywhere.
I don't see anywhere that pulls in the file from option --properties-file, it 
is just the default location conf/spark-defaults.confval conf = new SparkConf() 
 .setMaster(getMaster())  .setAppName("Spark shell")  .setJars(jars)











  .set("spark.repl.class.uri", intp.classServer.uri)
2. The $USER isn't picked up in spark-shell. This may be another problem and 
fixed at the same time when it re-use how SparkSubmit.scala does to SparkConf???











  

RE: SPARK_CLASSPATH Warning

2014-07-11 Thread Andrew Lee
As mentioned, deprecated in Spark 1.0+.
Try to use the --driver-class-path:
 ./bin/spark-shell --driver-class-path yourlib.jar:abc.jar:xyz.jar

Don't use glob *, specify the JAR one by one with colon.

Date: Wed, 9 Jul 2014 13:45:07 -0700
From: kat...@cs.pitt.edu
Subject: SPARK_CLASSPATH Warning
To: user@spark.apache.org

Hello,

I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop 
Distribution installed (v2.2.0 without yarn). Due to the fact that the Hadoop 
Distribution that I am using, uses a list of jars , I do the following changes 
to the conf/spark-env.sh


#!/usr/bin/env bash

export HADOOP_CONF_DIR=/path-to-hadoop-conf/hadoop-conf
export SPARK_LOCAL_IP=impl41
export 
SPARK_CLASSPATH="/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*"

...

Also, to make sure that I have everything working I execute the Spark shell as 
follows:

[biadmin@impl41 spark]$ ./bin/spark-shell --jars 
/path-to-proprietary-hadoop-lib/lib/*.jar


14/07/09 13:37:28 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:28 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(biadmin)

14/07/09 13:37:28 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:29 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:29 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:44292

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
  /_/

Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)

Type in expressions to have them evaluated.
Type :help for more information.
14/07/09 13:37:36 WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to 
'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').

This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath


14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' 
to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as 
a work-around.
14/07/09 13:37:36 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' 
to '/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*' as 
a work-around.

14/07/09 13:37:36 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:36 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(biadmin)

14/07/09 13:37:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/09 13:37:37 INFO Remoting: Starting remoting
14/07/09 13:37:37 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@impl41:46081]

14/07/09 13:37:37 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO spark.SparkEnv: Registering MapOutputTracker
14/07/09 13:37:37 INFO spark.SparkEnv: Registering BlockManagerMaster

14/07/09 13:37:37 INFO storage.DiskBlockManager: Created local directory at 
/tmp/spark-local-20140709133737-798b
14/07/09 13:37:37 INFO storage.MemoryStore: MemoryStore started with capacity 
307.2 MB.
14/07/09 13:37:38 INFO network.ConnectionManager: Bound socket to port 16685 
with id = ConnectionManagerId(impl41,16685)

14/07/09 13:37:38 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
14/07/09 13:37:38 INFO storage.BlockManagerInfo: Registering block manager 
impl41:16685 with 307.2 MB RAM
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Registered BlockManager

14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:21938

14/07/09 13:37:38 INFO broadcast.HttpBroadcast: Broadcast server started at 
http://impl41:21938
14/07/09 13:37:38 INFO spark.HttpFileServer: HTTP File server directory is 
/tmp/spark-91e8e040-f2ca-43dd-b574-805033f476c7

14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:52678

14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
14/07/09 13:37:38 INFO ui.SparkUI: Started SparkUI at http://impl41:4040

14/07/09 13:37:39 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
14/07/09 13:37:39 INFO spark.SparkContext: Added JAR 
file:/opt/ibm/biginsights/IHC/lib/adaptive-mr.jar at 
http://impl41:52678/jars/adaptive-mr.jar with timestamp 1404938259526

14/07/09 13:37:39 INFO executor.Executor: Using REPL class URI: 
http://impl41:44292
14/07/09 13:37

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 
2f1dc868e5714882cf40d2633fb66772baf34789)








Hi All,
When I enabled the spark-defaults.conf for the eventLog, spark-shell broke 
while spark-submit works.
I'm trying to create a separate directory per user to keep track with their own 
Spark job event logs with the env $USER in spark-defaults.conf.
Here's the spark-defaults.conf I specified so that HistoryServer can start 
picking up these event log from HDFS.As you can see here, I was trying to 
create a directory for each user so they can store the event log on a per user 
base.However, when I launch spark-shell, it didn't pick up $USER as the current 
login user. However, this works for spark-submit.
Here's more details.
/opt/spark/ is SPARK_HOME








[test@ ~]$ cat /opt/spark/conf/spark-defaults.conf
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.


# Example:
# spark.masterspark://master:7077
spark.eventLog.enabledtrue
spark.eventLog.dirhdfs:///user/$USER/spark/logs/
# spark.serializerorg.apache.spark.serializer.KryoSerializer

and I tried to create a separate config file to override the default one:







[test@ ~]$ SPARK_SUBMIT_OPTS="-XX:MaxPermSize=256m" /opt/spark/bin/spark-shell 
--master yarn --driver-class-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --properties-file 
/home/test/spark-defaults.conf [test@~]$ cat /home/test/spark-defaults.conf# 
Default system properties included when running spark-submit.# This is useful 
for setting default environmental settings.
# Example:# spark.masterspark://master:7077spark.eventLog.enabled   
 truespark.eventLog.dirhdfs:///user/test/spark/logs/















# spark.serializerorg.apache.spark.serializer.KryoSerializer
But it didn't work also, it is still looking at the 
/opt/spark/conf/spark-defaults.conf. According to the document, 
http://spark.apache.org/docs/latest/configuration.htmlHardcoded properties in 
SparkConf > spark-submit / spark-shell > conf/spark-defaults.conf
2 problems here:
1. In repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala, the instance 
SparkConf didn't look for the user specified spark-defaults.conf anywhere.
I don't see anywhere that pulls in the file from option --properties-file, it 
is just the default location conf/spark-defaults.confval conf = new SparkConf() 
 .setMaster(getMaster())  .setAppName("Spark shell")  .setJars(jars)











  .set("spark.repl.class.uri", intp.classServer.uri)
2. The $USER isn't picked up in spark-shell. This may be another problem and 
fixed at the same time when it re-use how SparkSubmit.scala does to SparkConf???










  

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee
Hi Kudryavtsev,
Here's what I am doing as a common practice and reference, I don't want to say 
it is best practice since it requires a lot of customer experience and 
feedback, but from a development and operating stand point, it will be great to 
separate the YARN container logs with the Spark logs.
Event Log - Use HistoryServer to take a look at the workflow, overall resource 
usage, etc for the Job.

Spark Log - Provide readable info on settings and configuration, and is covered 
by the event logs. You can customize this in the 'conf' folder with your own 
log4j.properties file. This won't be picked up by your YARN container since 
your Hadoop may be referring to a different log4j file somewhere else.
Stderr/Stdout log - This is actually picked up by the YARN container and you 
won't be able to override this unless you override the one in the resource 
folder (yarn/common/src/main/resources/log4j-spark-container.properties) during 
the build process and include it in your build (JAR file).
One thing I haven't tried yet is to separate that resource file into a separate 
JAR, and include it in the ext jar options on HDFS to suppress the log. This is 
more of a exploiting the CLASSPATH search behavior to override YARN log4j 
settings without building JARs to include the YARN container log4j settings, I 
don't know if this is a good practice though. Just some ideas that gives ppl 
flexibility, but probably not a good practice.
Anyone else have ideas? thoughts?








> From: kudryavtsev.konstan...@gmail.com
> Subject: Spark logging strategy on YARN
> Date: Thu, 3 Jul 2014 22:26:48 +0300
> To: user@spark.apache.org
> 
> Hi all,
> 
> Could you please share your the best practices on writing logs in Spark? I’m 
> running it on YARN, so when I check logs I’m bit confused… 
> Currently, I’m writing System.err.println to put a message in log and access 
> it via YARN history server. But, I don’t like this way… I’d like to use 
> log4j/slf4j and write them to more concrete place… any practices?
> 
> Thank you in advance
  

RE: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-07 Thread Andrew Lee
Hi Suren,
It showed up after awhile when I touch the APPLICATION_COMPLETE file in the 
event log folders.
I checked the source code and it looks like it is re-scanning (polling) the 
folders every 10 seconds (configurable)?
Not sure what exactly triggers that 'refresh', may need to do more digging.
Thanks.


Date: Thu, 3 Jul 2014 06:56:46 -0400
Subject: Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN 
mode)
From: suren.hira...@velos.io
To: user@spark.apache.org

I've had some odd behavior with jobs showing up in the history server in 1.0.0. 
Failed jobs do show up but it seems they can show up minutes or hours later. I 
see in the history server logs messages about bad task ids. But then eventually 
the jobs show up.

This might be your situation.
Anecdotally, if you click on the job in the Spark Master GUI after it is done, 
this may help it show up in the history server faster. Haven't reliably tested 
this though. May just be a coincidence of timing.

-Suren


On Wed, Jul 2, 2014 at 8:01 PM, Andrew Lee  wrote:




Hi All,
I have HistoryServer up and running, and it is great.
Is it possible to also enable HsitoryServer to parse failed jobs event by 
default as well?

I get "No Completed Applications Found" if job fails.

=
Event Log Location: hdfs:///user/test01/spark/logs/
No Completed Applications Found=
The reason is that it is good to run the HistoryServer to keep track of 
performance and resource usage for each completed job, but I found it more 
useful when job fails. I can identify which stage did it fail, etc instead of 
sipping through the logs 
from the Resource Manager. The same event log is only available when the 
Application Master is still active, once the job fails, the Application Master 
is killed, and I lose the GUI access, even though I have the event log in JSON 
format, I can't open it with 
the HistoryServer.
This is very helpful especially for long running jobs that last for 2-18 hours 
that generates Gigabytes of logs.
So I have 2 questions:

1. Any reason why we only render completed jobs? Why can't we bring in all jobs 
and choose from the GUI? Like a time machine to restore the status from the 
Application Master?








./core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 








val logInfos = logDirs

  .sortBy { dir => getModificationTime(dir) }

  .map { dir => (dir, 
EventLoggingListener.parseLoggingInfo(dir.getPath, fileSystem)) }

  .filter { case (dir, info) => info.applicationComplete }




2. If I force to touch a file "APPLICATION_COMPLETE" in the failed job event 
log folder, will this cause any problem?
















  


-- 

SUREN HIRAMAN, VP TECHNOLOGY
VelosAccelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105F: 646.349.4063

E: suren.hira...@velos.io
W: www.velos.io



  

Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-02 Thread Andrew Lee
Hi All,
I have HistoryServer up and running, and it is great.
Is it possible to also enable HsitoryServer to parse failed jobs event by 
default as well?
I get "No Completed Applications Found" if job fails.
=Event Log Location: hdfs:///user/test01/spark/logs/No Completed 
Applications Found=
The reason is that it is good to run the HistoryServer to keep track of 
performance and resource usage for each completed job, but I found it more 
useful when job fails. I can identify which stage did it fail, etc instead of 
sipping through the logs from the Resource Manager. The same event log is only 
available when the Application Master is still active, once the job fails, the 
Application Master is killed, and I lose the GUI access, even though I have the 
event log in JSON format, I can't open it with the HistoryServer.
This is very helpful especially for long running jobs that last for 2-18 hours 
that generates Gigabytes of logs.
So I have 2 questions:
1. Any reason why we only render completed jobs? Why can't we bring in all jobs 
and choose from the GUI? Like a time machine to restore the status from the 
Application Master?








./core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala 







val logInfos = logDirs
  .sortBy { dir => getModificationTime(dir) }
  .map { dir => (dir, 
EventLoggingListener.parseLoggingInfo(dir.getPath, fileSystem)) }
  .filter { case (dir, info) => info.applicationComplete }

2. If I force to touch a file "APPLICATION_COMPLETE" in the failed job event 
log folder, will this cause any problem?











  

RE: write event logs with YARN

2014-07-02 Thread Andrew Lee
Hi Christophe,
Make sure you have 3 slashes in the hdfs scheme.
e.g.
hdfs:///:9000/user//spark-events
and in the spark-defaults.conf as 
well.spark.eventLog.dir=hdfs:///:9000/user//spark-events

> Date: Thu, 19 Jun 2014 11:18:51 +0200
> From: christophe.pre...@kelkoo.com
> To: user@spark.apache.org
> Subject: write event logs with YARN
> 
> Hi,
> 
> I am trying to use the new Spark history server in 1.0.0 to view finished 
> applications (launched on YARN), without success so far.
> 
> Here are the relevant configuration properties in my spark-defaults.conf:
> 
> spark.yarn.historyServer.address=:18080
> spark.ui.killEnabled=false
> spark.eventLog.enabled=true
> spark.eventLog.compress=true
> spark.eventLog.dir=hdfs://:9000/user//spark-events
> 
> And the history server has been launched with the command below:
> 
> /opt/spark/sbin/start-history-server.sh 
> hdfs://:9000/user//spark-events
> 
> 
> However, the finished application do not appear in the history server UI 
> (though the UI itself works correctly).
> Apparently, the problem is that the APPLICATION_COMPLETE file is not created:
> 
> hdfs dfs -stat %n spark-events/-1403166516102/*
> COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec
> EVENT_LOG_2
> SPARK_VERSION_1.0.0
> 
> Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above 
> directory, the application can now be viewed normally in the history server.
> 
> Finally, here is the relevant part of the YARN application log, which seems 
> to imply that
> the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is 
> created:
> 
> (...)
> 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with 
> SUCCEEDED
> 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal.
> 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1397477394591_0798
> 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at 
> http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877
> 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler
> 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all 
> executors
> 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to 
> shut down
> 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
> stopped!
> 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted!
> 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped
> 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared
> 14/06/19 08:29:30 INFO BlockManager: BlockManager stopped
> 14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
> 14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped
> Exception in thread "Thread-44" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:380)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:324)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
> at org.apache.spark.util.FileLogger.createWriter(FileLogger.scala:117)
> at org.apache.spark.util.FileLogger.newFile(FileLogger.scala:181)
> at 
> org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:129)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$2.apply(SparkContext.scala:989)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:989)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:443)
> 14/06/19 08:29:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 
> 
> Am I missing something, or is it a bug?
> 
> Thanks,
> Christophe.
> 
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
> 
> Ce message et les pièces jointe

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-23 Thread Andrew Lee
I checked the source code, it looks like it was re-added back based on JIRA 
SPARK-1588, but I don't know if there's any test case associated with this?









  SPARK-1588.  Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN.
  Sandy Ryza 
  2014-04-29 12:54:02 -0700
  Commit: 5f48721, github.com/apache/spark/pull/586


From: alee...@hotmail.com
To: user@spark.apache.org
Subject: RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn 
mode
Date: Wed, 18 Jun 2014 11:24:36 -0700




Forgot to mention that I am using spark-submit to submit jobs, and a verbose 
mode print out looks like this with the SparkPi examples.The .sparkStaging 
won't be deleted. My thoughts is that this should be part of the staging and 
should be cleaned up as well when sc gets terminated.









[test@ spark]$ SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false" 
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar 
./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M 
--executor-cores 1 --queue research --num-executors 2 
examples/target/spark-examples_2.10-1.0.0.jar 
















Using properties file: null
Using properties file: null
Parsed arguments:
  master  yarn
  deployMode  cluster
  executorMemory  512M
  executorCores   1
  totalExecutorCores  null
  propertiesFile  null
  driverMemory512M
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
  driverExtraJavaOptions  null
  supervise   false
  queue   research
  numExecutors2
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.SparkPi
  primaryResource 
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
  nameorg.apache.spark.examples.SparkPi
  childArgs   []
  jarsnull
  verbose true


Default properties from null:
  



Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath -> 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT -> true
spark.app.name -> org.apache.spark.examples.SparkPi
Classpath elements:








From: alee...@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700




Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", 
"false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 .sparkStaging/

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Forgot to mention that I am using spark-submit to submit jobs, and a verbose 
mode print out looks like this with the SparkPi examples.The .sparkStaging 
won't be deleted. My thoughts is that this should be part of the staging and 
should be cleaned up as well when sc gets terminated.









[test@ spark]$ SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false" 
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar 
./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --driver-memory 512M --driver-library-path 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar --executor-memory 512M 
--executor-cores 1 --queue research --num-executors 2 
examples/target/spark-examples_2.10-1.0.0.jar 
















Using properties file: null
Using properties file: null
Parsed arguments:
  master  yarn
  deployMode  cluster
  executorMemory  512M
  executorCores   1
  totalExecutorCores  null
  propertiesFile  null
  driverMemory512M
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
  driverExtraJavaOptions  null
  supervise   false
  queue   research
  numExecutors2
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.SparkPi
  primaryResource 
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
  nameorg.apache.spark.examples.SparkPi
  childArgs   []
  jarsnull
  verbose true


Default properties from null:
  



Using properties file: null
Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--jar
file:/opt/spark/examples/target/spark-examples_2.10-1.0.0.jar
--class
org.apache.spark.examples.SparkPi
--name
org.apache.spark.examples.SparkPi
--driver-memory
512M
--queue
research
--num-executors
2
--executor-memory
512M
--executor-cores
1
System properties:
spark.driver.extraLibraryPath -> 
/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo.jar
SPARK_SUBMIT -> true
spark.app.name -> org.apache.spark.examples.SparkPi
Classpath elements:








From: alee...@hotmail.com
To: user@spark.apache.org
Subject: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode
Date: Wed, 18 Jun 2014 11:05:12 -0700




Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", 
"false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx--   - 
test users  0 2014-06-16 15:46 
.sparkStaging/application_1402001910637_0138drwx--   - test users  
0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx--   - 
test users  0 2014-06-17 05:55 
.sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in 
:SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.

HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Hi All,
Have anyone ran into the same problem? By looking at the source code in 
official release (rc11),this property settings is set to false by default, 
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it 
to fill up the disk pretty fast since SparkContext deploys the fat JAR file 
(~115MB) every time for each job and it is not cleaned up.








yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
  val preserveFiles = sparkConf.get("spark.yarn.preserve.staging.files", 
"false").toBoolean
[test@spark ~]$ hdfs dfs -ls .sparkStagingFound 46 itemsdrwx--   - test 
users  0 2014-05-01 01:42 
.sparkStaging/application_1398370455828_0050drwx--   - test users  
0 2014-05-01 02:03 .sparkStaging/application_1398370455828_0051drwx--   - 
test users  0 2014-05-01 02:04 
.sparkStaging/application_1398370455828_0052drwx--   - test users  
0 2014-05-01 05:44 .sparkStaging/application_1398370455828_0053drwx--   - 
test users  0 2014-05-01 05:45 
.sparkStaging/application_1398370455828_0055drwx--   - test users  
0 2014-05-01 05:46 .sparkStaging/application_1398370455828_0056drwx--   - 
test users  0 2014-05-01 05:49 
.sparkStaging/application_1398370455828_0057drwx--   - test users  
0 2014-05-01 05:52 .sparkStaging/application_1398370455828_0058drwx--   - 
test users  0 2014-05-01 05:58 
.sparkStaging/application_1398370455828_0059drwx--   - test users  
0 2014-05-01 07:38 .sparkStaging/application_1398370455828_0060drwx--   - 
test users  0 2014-05-01 07:41 
.sparkStaging/application_1398370455828_0061….drwx--   - test users 
 0 2014-06-16 14:45 .sparkStaging/application_1402001910637_0131drwx--   - 
test users  0 2014-06-16 15:03 
.sparkStaging/application_1402001910637_0135drwx--   - test users  
0 2014-06-16 15:16 .sparkStaging/application_1402001910637_0136drwx--   - 
test users  0 2014-06-16 15:46 
.sparkStaging/application_1402001910637_0138drwx--   - test users  
0 2014-06-16 23:57 .sparkStaging/application_1402001910637_0157drwx--   - 
test users  0 2014-06-17 05:55 
.sparkStaging/application_1402001910637_0161
Is this something that needs to be explicitly set in 
:SPARK_YARN_USER_ENV="spark.yarn.preserve.staging.files=false"
http://spark.apache.org/docs/latest/running-on-yarn.htmlspark.yarn.preserve.staging.filesfalseSet
 to true to preserve the staged files (Spark jar, app jar, distributed cache 
files) at the end of the job rather then delete them.or this is a bug that is 
not honoring the default value and is override to true somewhere?
Thanks.


  

RE: Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. 

Date: Wed, 21 May 2014 11:07:55 -0700
Subject: Re: Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster 
mode or yarn-client mode?
From: and...@databricks.com
To: user@spark.apache.org

The answer is actually yarn-client. A quick way to find out:
$ bin/spark-shell --master yarn --verbose
>From the system properties you can see spark.master is set to "yarn-client." 
>From the code, this is because args.deployMode is null, and so it's not equal 
>to "cluster" and so it falls into the second "if" case you mentioned:

if (args.deployMode != "cluster" && args.master.startsWith("yarn")) {
  args.master = "yarn-client"}

2014-05-21 10:57 GMT-07:00 Andrew Lee :




Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?

Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala


if (args.deployMode == "cluster" && args.master.startsWith("yarn")) {

  args.master = "yarn-cluster"
}
if (args.deployMode != "cluster" && args.master.startsWith("yarn")) {

  args.master = "yarn-client"














}

It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  


  

Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?
Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
if (args.deployMode == "cluster" && args.master.startsWith("yarn")) {  
args.master = "yarn-cluster"}if (args.deployMode != "cluster" && 
args.master.startsWith("yarn")) {  args.master = "yarn-client"












}
It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  

RE: run spark0.9.1 on yarn with hadoop CDH4

2014-05-06 Thread Andrew Lee
Please check JAVA_HOME. Usually it should point to /usr/java/default on 
CentOS/Linux.
or FYI: http://stackoverflow.com/questions/1117398/java-home-directory


> Date: Tue, 6 May 2014 00:23:02 -0700
> From: sln-1...@163.com
> To: u...@spark.incubator.apache.org
> Subject: run spark0.9.1 on yarn with hadoop CDH4
> 
> Hi all,
>  I have make HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which
> contains the (client side) configuration files for the hadoop cluster. 
> The command to launch the YARN Client which I run is like this:
> 
> #
> SPARK_JAR=./~/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
> ./bin/spark-class org.apache.spark.deploy.yarn.Client\--jar
> examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar\--class
> org.apache.spark.examples.SparkPi\--args yarn-standalone \--num-workers 3
> \--master-memory 2g \--worker-memory 2g \--worker-cores 1
> ./bin/spark-class: line 152: /usr/lib/jvm/java-7-sun/bin/java: No such file
> or directory
> ./bin/spark-class: line 152: exec: /usr/lib/jvm/java-7-sun/bin/java: cannot
> execute: No such file or directory
> How to make it runs well?
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/run-spark0-9-1-on-yarn-with-hadoop-CDH4-tp5426.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-06 Thread Andrew Lee
Hi Jacob,
I agree, we need to address both driver and workers bidirectionally.
If the subnet is isolated and self-contained, only limited ports are configured 
to access the driver via a dedicated gateway for the user, could you explain 
your concern? or what doesn't satisfy the security criteria?
Are you referring to any security certificate or regulation requirement that 
separate subnet with a configurable policy couldn't satisfy?
What I meant a subnet basically includes both driver and Workers running in 
this subnet. See following example setup.
e.g. (254 max nodes for example)Hadoop / HDFS => 10.5.5.0/24 (GW 10.5.5.1) 
eth0Spark Driver and Worker bind to => 10.10.10.0/24 eth1 with routing to 
10.5.5.0/24 on specific ports for NameNode and DataNode.So basically driver and 
Worker are bound to the same subnet that is separated from others.iptables for 
10.10.10.0/24 can allow SSH 22 login (or port forwarding) onto the Spark Driver 
machine to launch shell or submit spark jobs.


Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
To: user@spark.apache.org
From: jeis...@us.ibm.com
Date: Mon, 5 May 2014 12:40:53 -0500


Howdy Andrew,



I agree; the subnet idea is a good one...  unfortunately, it doesn't really 
help to secure the network.



You mentioned that the drivers need to talk to the workers.  I think it is 
slightly broader - all of the workers and the driver/shell need to be 
addressable from/to each other on any dynamic port.



I would check out setting the environment variable SPARK_LOCAL_IP [1].  This 
seems to enable Spark to bind correctly to a private subnet.



Jacob



[1]  http://spark.apache.org/docs/latest/configuration.html 



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/04/2014 09:57:08 PM---Hi Jacob, Taking both concerns into 
account, I'm actually thinking about using a separate subnet to



From:   Andrew Lee 

To: "user@spark.apache.org" 

Date:   05/04/2014 09:57 PM

Subject:RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication








Hi Jacob,



Taking both concerns into account, I'm actually thinking about using a separate 
subnet to isolate the Spark Workers, but need to look into how to bind the 
process onto the correct interface first. This may require some code change.

Separate subnet doesn't limit itself with port range so port exhaustion should 
rarely happen, and won't impact performance.



By opening up all port between 32768-61000 is actually the same as no firewall, 
this expose some security concerns, but need more information whether that is 
critical or not.



The bottom line is the driver needs to talk to the Workers. The way how user 
access the Driver should be easier to solve such as launching Spark (shell) 
driver on a specific interface.



Likewise, if you found out any interesting solutions, please let me know. I'll 
share the solution once I have something up and running. Currently, it is 
running ok with iptables off, but still need to figure out how to 
product-ionize the security part.



Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication

To: user@spark.apache.org

From: jeis...@us.ibm.com

Date: Fri, 2 May 2014 16:07:50 -0500



Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark 
opens up dynamic / ephemera [2] ports for each job on the shell and the 
workers.  As you are finding out, this makes securing and managing the network 
for Spark very difficult.



> Any idea how to restrict the 'Workers' port range?

The port range can be found by running: 
$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 32768 61000


With that being said, a couple avenues you may try: 

Limit the dynamic ports [3] to a more reasonable number and open all of these 
ports on your firewall; obviously, this might have unintended consequences like 
port exhaustion. 
Secure the network another way like through a private VPN; this may reduce 
Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] 
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the 
port in spark-env.sh, the problem is not the driver port which



From: Andrew Lee 

To: "user@spark.apache.org" 

Date: 05/02/2014 03:15 PM

Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication







Hi Yana, 



I did. I c

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-04 Thread Andrew Lee
Hi Jacob,
Taking both concerns into account, I'm actually thinking about using a separate 
subnet to isolate the Spark Workers, but need to look into how to bind the 
process onto the correct interface first. This may require some code 
change.Separate subnet doesn't limit itself with port range so port exhaustion 
should rarely happen, and won't impact performance.
By opening up all port between 32768-61000 is actually the same as no firewall, 
this expose some security concerns, but need more information whether that is 
critical or not.
The bottom line is the driver needs to talk to the Workers. The way how user 
access the Driver should be easier to solve such as launching Spark (shell) 
driver on a specific interface.
Likewise, if you found out any interesting solutions, please let me know. I'll 
share the solution once I have something up and running. Currently, it is 
running ok with iptables off, but still need to figure out how to 
product-ionize the security part.
Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
To: user@spark.apache.org
From: jeis...@us.ibm.com
Date: Fri, 2 May 2014 16:07:50 -0500


Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark 
opens up dynamic / ephemera [2] ports for each job on the shell and the 
workers.  As you are finding out, this makes securing and managing the network 
for Spark very difficult.



> Any idea how to restrict the 'Workers' port range?

The port range can be found by running:

$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 3276861000


With that being said, a couple avenues you may try:

Limit the dynamic ports [3] to a more reasonable number and open all of these 
ports on your firewall; obviously, this might have unintended consequences like 
port exhaustion.
Secure the network another way like through a private VPN; this may reduce 
Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] 
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the 
port in spark-env.sh, the problem is not the driver port which



From:   Andrew Lee 

To: "user@spark.apache.org" 

Date:   05/02/2014 03:15 PM

Subject:RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication







Hi Yana, 



I did. I configured the the port in spark-env.sh, the problem is not the driver 
port which is fixed.

it's the Workers port that are dynamic every time when they are launched in the 
YARN container. :-(



Any idea how to restrict the 'Workers' port range?



Date: Fri, 2 May 2014 14:49:23 -0400

Subject: Re: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication

From: yana.kadiy...@gmail.com

To: user@spark.apache.org



I think what you want to do is set spark.driver.port to a fixed port.





On Fri, May 2, 2014 at 1:52 PM, Andrew Lee  wrote:
Hi All,



I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.



When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.



For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?

What will be the better way to predict the ports so I can configure the 
firewall correctly between the driver (spark-shell) and the Workers? Is there a 
range of ports we can specify in the firewall/iptables?



Any ideas?

  

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi Yana, 
I did. I configured the the port in spark-env.sh, the problem is not the driver 
port which is fixed.it's the Workers port that are dynamic every time when they 
are launched in the YARN container. :-(
Any idea how to restrict the 'Workers' port range?

Date: Fri, 2 May 2014 14:49:23 -0400
Subject: Re: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
From: yana.kadiy...@gmail.com
To: user@spark.apache.org

I think what you want to do is set spark.driver.port to a fixed port.


On Fri, May 2, 2014 at 1:52 PM, Andrew Lee  wrote:




Hi All,
I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.
When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.

For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?What will be the better way to predict the ports 
so I can configure the firewall correctly between the driver (spark-shell) and 
the Workers? Is there a range of ports we can specify in the firewall/iptables?

Any ideas?

  

spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi All,
I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.
When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.
For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?What will be the better way to predict the ports 
so I can configure the firewall correctly between the driver (spark-shell) and 
the Workers? Is there a range of ports we can specify in the firewall/iptables?
Any ideas?

RE: Using an external jar in the driver, in yarn-standalone mode.

2014-03-25 Thread Andrew Lee
Hi Julien,
The ADD_JAR doesn't work in the command line. I checked spark-class, and I 
couldn't find any Bash shell bringing in the variable ADD_JAR to the CLASSPATH.
Were you able to print out the properties and environment variables from the 
Web GUI?
localhost:4040
This should give you an idea what is included in the current Spark shell. The 
bin/spark-shell invokes bin/spark-class, and I don't see ADD_JAR in 
bin/spark-class as well.
Hi Sandy,
Does Spark automatically deploy the JAR for you on the DFS cache if Spark is 
running on cluster mode? I haven't got that far yet to deploy my own one-time 
JAR for testing. Just setup a local cluster for practice.

Date: Tue, 25 Mar 2014 23:13:58 +0100
Subject: Re: Using an external jar in the driver, in yarn-standalone mode.
From: julien.ca...@gmail.com
To: user@spark.apache.org

Thanks for your answer.
I am using bin/spark-class  org.apache.spark.deploy.yarn.Client --jar myjar.jar 
--class myclass ...

myclass in myjar.jar contains a main that initializes a SparkContext in 
yarn-standalone mode.

Then I am using some code that uses myotherjar.jar, but I do not execute it 
using the spark context or a RDD, so my understanding is that it is not excuted 
on yarn slaves, only on the yarn master. 

I found no way to make my code being able to find myotherjar.jar. CLASSPATH is 
set by Spark (or Yarn?) before being executed on the Yarn Master, it is not set 
by me. It seems that the idea is to set SPARK_CLASSPATH and/or ADD_JAR and then 
these jars becomes automatically available in the Yarn Master but it did not 
work for me. 

I tried also to use sc.addJar, it did not work either, but anyway it seems 
clear that this is used for dependancies in the code exectued on the slaves, 
not on the master. Tell me if I am wrong








2014-03-25 21:11 GMT+01:00 Nathan Kronenfeld :

by 'use ... my main program' I presume you mean you have a main function in a 
class file you want to use as your entry point.

SPARK_CLASSPATH, ADD_JAR, etc add your jars in on the master and the workers... 
but they don't on the client.
For that, you're just using ordinary, everyday java/scala - so it just has to 
be on the normal java classpath.
Could that be your issue?
  -Nathan




On Tue, Mar 25, 2014 at 2:18 PM, Sandy Ryza  wrote:


Hi Julien,
Have you called SparkContext#addJars?
-Sandy



On Tue, Mar 25, 2014 at 10:05 AM, Julien Carme  wrote:



Hello,



I have been struggling for ages to use an external jar in my spark driver 
program, in yarn-standalone mode. I just want to use in my main program, 
outside the calls to spark functions, objects that are defined in another jar.




I tried to set SPARK_CLASSPATH, ADD_JAR, I tried to use --addJar in the 
spark-class arguments, I always end up with a "Class not found exception" when 
I want to use classes defined in my jar.




Any ideas?




Thanks a lot,




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,

Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238

Email:  nkronenf...@oculusinfo.com



  

RE: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi Paul,
I got it sorted out.
The problem is that the JARs are built into the assembly JARs when you run
sbt/sbt clean assembly
What I did is:sbt/sbt clean package
This will only give you the small JARs. The next steps is to update the 
CLASSPATH in the bin/compute-classpath.sh script manually, appending all the 
JARs.
With :
sbt/sbt assembly
We can't introduce our own Hadoop patch since it will always pull from Maven 
repo, unless we hijack the repository path, or do a 'mvn install' locally. This 
is more of a hack I think.


Date: Tue, 25 Mar 2014 15:23:08 -0700
Subject: Re: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar 
files?
From: paulmscho...@gmail.com
To: user@spark.apache.org

Andrew, 
I ran into the same problem and eventually settled on just running the jars 
directly with java. Since we use sbt to build our jars we had all the 
dependancies builtin to the jar it self so need for random class paths. 


On Tue, Mar 25, 2014 at 1:47 PM, Andrew Lee  wrote:




Hi All,
I'm getting the following error when I execute start-master.sh which also 
invokes spark-class at the end.








Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/

You need to build Spark with 'sbt/sbt assembly' before running this program.


After digging into the code, I see the CLASSPATH is hardcoded with 
"spark-assembly.*hadoop.*.jar".

In bin/spark-class :


if [ ! -f "$FWDIR/RELEASE" ]; then
  # Exit if the user hasn't compiled Spark
  num_jars=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep 
"spark-assembly.*hadoop.*.jar" | wc -l)

  jars_list=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep 
"spark-assembly.*hadoop.*.jar")
  if [ "$num_jars" -eq "0" ]; then

echo "Failed to find Spark assembly in 
$FWDIR/assembly/target/scala-$SCALA_VERSION/" >&2
echo "You need to build Spark with 'sbt/sbt assembly' before running this 
program." >&2

exit 1
  fi
  if [ "$num_jars" -gt "1" ]; then
echo "Found multiple Spark assembly jars in 
$FWDIR/assembly/target/scala-$SCALA_VERSION:" >&2

echo "$jars_list"
echo "Please remove all but one jar."
exit 1
  fi

























fi


Is there any reason why this is only grabbing spark-assembly.*hadoop.*.jar ? I 
am trying to run Spark that links to my own version of Hadoop under 
/opt/hadoop23/, 

and I use 'sbt/sbt clean package' to build the package without the Hadoop jar. 
What is the correct way to link to my own Hadoop jar?





  


  

Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi All,
I'm getting the following error when I execute start-master.sh which also 
invokes spark-class at the end.








Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/
You need to build Spark with 'sbt/sbt assembly' before running this program.
After digging into the code, I see the CLASSPATH is hardcoded with 
"spark-assembly.*hadoop.*.jar".In bin/spark-class :
if [ ! -f "$FWDIR/RELEASE" ]; then  # Exit if the user hasn't compiled Spark  
num_jars=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep 
"spark-assembly.*hadoop.*.jar" | wc -l)  jars_list=$(ls 
"$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep 
"spark-assembly.*hadoop.*.jar")  if [ "$num_jars" -eq "0" ]; thenecho 
"Failed to find Spark assembly in $FWDIR/assembly/target/scala-$SCALA_VERSION/" 
>&2echo "You need to build Spark with 'sbt/sbt assembly' before running 
this program." >&2exit 1  fi  if [ "$num_jars" -gt "1" ]; thenecho 
"Found multiple Spark assembly jars in 
$FWDIR/assembly/target/scala-$SCALA_VERSION:" >&2echo "$jars_list"echo 
"Please remove all but one jar."exit 1  fi






















fi
Is there any reason why this is only grabbing spark-assembly.*hadoop.*.jar ? I 
am trying to run Spark that links to my own version of Hadoop under 
/opt/hadoop23/, and I use 'sbt/sbt clean package' to build the package without 
the Hadoop jar. What is the correct way to link to my own Hadoop jar?


  

Spark 0.9.0-incubation + Apache Hadoop 2.2.0 + YARN encounter Compression codec com.hadoop.compression.lzo.LzoCodec not found

2014-03-17 Thread Andrew Lee
Hi All,

I have been contemplating at this problem and couldn't figure out what is
missing in the configuration. I traced the script and try to look for
CLASSPATH and see what is included, however, I couldn't find any place that
is honoring/inheriting HADOOP_CLASSPATH (or pulling in any map-reduce
JARs). The only thing I saw was bringing in the HADOOP_CONF_DIR and
YARN_CONF_DIR folders and other JARs, but not the mapred compression JARs.

spark-shell => spark-class => compute-classpath.sh => spark-env.sh(empty)

This is the command I execute to run the spark-shell:

CLASSPATH=/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201403101806.jar:${CLASSPATH}
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar
MASTER=yarn-client ./bin/spark-shell

and try to run the LinearRegression examples from (
http://docs.sigmoidanalytics.com/index.php/MLlib), however, every time when
I try to generate the model, it runs into LzoCodec not found exception.
Does anyone have any clue why this is happening? I see reflection applied
from the API stack, but I'm wondering why it isn't bringing in the correct
lzo libs? My assumptions are:


1. It looks like when Spark launched Yarn application, it override the
original classpath for Hadoop on the NodeManager machine? I'm not sure what
happened here behind the scene?
2. By running other MR apps, or program, they worked fine, including
examples SparkPi, and KMean.
3. Wondering what are the command/tools/places I can look into to figure
out why the lib is missing from the remote classpath?
4. On YARN log, I only see 2 JARs were deployed to HDFS in the
.sparkStaging folder.

hdfs@alexie-dt ~/spark-0.9.0-incubating $ hdfs dfs -ls
/user/hdfs/.sparkStaging/application_1395091699241_0019/

Found 2 items

-rw-r--r--   3 hdfs hdfs   99744537 2014-03-17 23:52
/user/hdfs/.sparkStaging/application_1395091699241_0019/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar

-rw-r--r--   3 hdfs hdfs  135645837 2014-03-17 23:52
/user/hdfs/.sparkStaging/application_1395091699241_0019/spark-examples-assembly-0.9.0-incubating.jar

Any insights and feedback are welcome and appreciated.  Guess I probably
overlooked something in the doc.


I copied/pasted the console here for reference.

==

14/03/17 23:30:30 INFO spark.HttpServer: Starting HTTP Server

14/03/17 23:30:30 INFO server.Server: jetty-7.x.y-SNAPSHOT

14/03/17 23:30:30 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:43720

Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 0.9.0

  /_/


Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_45)

Type in expressions to have them evaluated.

Type :help for more information.

14/03/17 23:30:34 INFO slf4j.Slf4jLogger: Slf4jLogger started

14/03/17 23:30:34 INFO Remoting: Starting remoting

14/03/17 23:30:34 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sp...@alexie-dt.local.altiscale.com:47469]

14/03/17 23:30:34 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sp...@alexie-dt.local.altiscale.com:47469]

14/03/17 23:30:34 INFO spark.SparkEnv: Registering BlockManagerMaster

14/03/17 23:30:34 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140317233034-81b8

14/03/17 23:30:34 INFO storage.MemoryStore: MemoryStore started with
capacity 294.9 MB.

14/03/17 23:30:34 INFO network.ConnectionManager: Bound socket to port
43255 with id = ConnectionManagerId(alexie-dt.local.altiscale.com,43255)

14/03/17 23:30:34 INFO storage.BlockManagerMaster: Trying to register
BlockManager

14/03/17 23:30:34 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager alexie-dt.local.altiscale.com:43255 with 294.9 MB
RAM

14/03/17 23:30:34 INFO storage.BlockManagerMaster: Registered BlockManager

14/03/17 23:30:34 INFO spark.HttpServer: Starting HTTP Server

14/03/17 23:30:34 INFO server.Server: jetty-7.x.y-SNAPSHOT

14/03/17 23:30:34 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:33611

14/03/17 23:30:34 INFO broadcast.HttpBroadcast: Broadcast server started at
http://10.10.10.4:33611

14/03/17 23:30:34 INFO spark.SparkEnv: Registering MapOutputTracker

14/03/17 23:30:34 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-9abcbe38-ef79-418d-94af-20979b1083fc

14/03/17 23:30:34 INFO spark.HttpServer: Starting HTTP Server

14/03/17 23:30:34 INFO server.Server: jetty-7.x.y-SNAPSHOT

14/03/17 23:30:34 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:53963

14/03/17 23:30:35 INFO server.Server: jetty-7.x.y-SNAPSHOT

14/03/17 23:30:35 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage/rdd,null}

14/03/17 23:30:35 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage,nu