A TPCH benchmark for Spark

2015-08-26 Thread Feng Tian
Hi, We released a package called LLQL, which is a serialization of operators of relational algebra. Spark SQL Plan is the first one supported. More interesting to the spark community probably is our test that implements TPCH. We manually rewrote some sql -- mainly pulling subqueries out and con

Differing performance in self joins

2015-08-26 Thread David Smith
I've noticed that two queries, which return identical results, have very different performance. I'd be interested in any hints about how avoid problems like this. The DataFrame df contains a string field "series" and an integer "eday", the number of days since (or before) the 1970-01-01 epoch. I'

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Calvin Jia
+1, tested that 1.5.0-RC2 works with Tachyon 0.7.1 as external block store.

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi Ted, You can check full stack trace log from the attachment at Jira: https://issues.apache.org/jira/browse/GORA-386 Kind Regards, Furkan KAMACI On Wed, Aug 26, 2015 at 6:55 PM, Ted Yu wrote: > My understanding is that people on this mailing list who are interested to > help can log comment

Re: Building with sbt "impossible to get artifacts when data has not been loaded"

2015-08-26 Thread Josh Rosen
I ran into a similar problem while working on the spark-redshift library and was able to fix it by bumping that library's ScalaTest version. I'm still fighting some mysterious Scala issues while trying to test the spark-csv library against 1.5.0-RC1, so it's possible that a build or dependency chan

Re: Building with sbt "impossible to get artifacts when data has not been loaded"

2015-08-26 Thread Marcelo Vanzin
I ran into the same error (different dependency) earlier today. In my case, the maven pom files and the sbt dependencies had a conflict (different versions of the same artifact) and ivy got confused. Not sure whether that will help in your case or not... On Wed, Aug 26, 2015 at 2:23 PM, Holden Kar

Building with sbt "impossible to get artifacts when data has not been loaded"

2015-08-26 Thread Holden Karau
Has anyone else run into "impossible to get artifacts when data has not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3" during hive/update when building with sbt. Working around it is pretty simple (just add it as a dependency), but I'm wondering if its impacting anyone else and I shoul

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
One small update -- the vote should close Saturday Aug 29. Not Friday Aug 29. On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes > if a majorit

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
The Scala 2.11 issue should be fixed, but doesn't need to be a blocker, since Maven builds fine. The sbt build is more aggressive to make sure we catch warnings. On Wed, Aug 26, 2015 at 10:01 AM, Sean Owen wrote: > My quick take: no blockers at this point, except for one potential > issue. Sti

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQLContext-read-json-path-throws-j

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread Reynold Xin
Any reason why you have more than 2G in a single line? There is a limit of 2G in the Hadoop library we use. Also the JVM doesn't work when your string is that long. On Wed, Aug 26, 2015 at 11:38 AM, gsvic wrote: > Yes, it contain one line > > On Wed, Aug 26, 2015 at 8:20 PM, Yin Huai-2 [via A

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
Yes, it contain one line On Wed, Aug 26, 2015 at 8:20 PM, Yin Huai-2 [via Apache Spark Developers List] wrote: > The JSON support in Spark SQL handles a file with one JSON object per line > or one JSON array of objects per line. What is the format your file? Does > it only contain a single line?

Re: Maven issues with 1.5-RC

2015-08-26 Thread shane knapp
we build on jenkins w/3.1.1, but also have 3.0.4. On Wed, Aug 26, 2015 at 8:18 AM, Sean Owen wrote: > It sounds like you're doing the right things. I believe the Jenkins > test machines also have 3.0.4, but successfully build by using > build/mvn --force. Not sure what to make of that. > > On Wed

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread Yin Huai
The JSON support in Spark SQL handles a file with one JSON object per line or one JSON array of objects per line. What is the format your file? Does it only contain a single line? On Wed, Aug 26, 2015 at 6:47 AM, gsvic wrote: > Hi, > > I have the following issue. I am trying to load a 2.5G JSON

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Sean Owen
My quick take: no blockers at this point, except for one potential issue. Still some 'critical' bugs worth a look. The release seems to pass tests but i get a lot of spurious failures; it took about 16 hours of running tests to get everything to pass at least once. Current score: 56 issues target

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Luc Bourlier
- tested the backpressure/rate controlling in streaming. It works as expected. - there is a problem with the Scala 2.11 sbt build: https://issues.apache.org/jira/browse/SPARK-10227 Luc Bourlier Luc Bourlier *Spark Team - Typesafe, Inc.* luc.bourl...@typesafe.com On We

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
My understanding is that people on this mailing list who are interested to help can log comments on the GORA JIRA. HBase integration with Spark is proven to work. So the intricacies should be on Gora side. On Wed, Aug 26, 2015 at 8:08 AM, Furkan KAMACI wrote: > Btw, here is the source code of Go

Re: Maven issues with 1.5-RC

2015-08-26 Thread Sean Owen
It sounds like you're doing the right things. I believe the Jenkins test machines also have 3.0.4, but successfully build by using build/mvn --force. Not sure what to make of that. On Wed, Aug 26, 2015 at 4:08 PM, Chris Freeman wrote: > Currently trying to compile 1.5-RC2 (from > https://github.c

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Btw, here is the source code of GoraInputFormat.java : https://github.com/kamaci/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/GoraInputFormat.java 26 Ağu 2015 18:05 tarihinde "Furkan KAMACI" yazdı: > I'll send an e-mail to Gora dev list too and also attach my patch into my

Maven issues with 1.5-RC

2015-08-26 Thread Chris Freeman
Currently trying to compile 1.5-RC2 (from https://github.com/apache/spark/commit/727771352855dbb780008c449a877f5aaa5fc27a) and running into issues with the new Maven requirement. I have 3.0.4 installed at the system level, 1.5 requires 3.3.3. As Patrick has pointed out in other places, this sho

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
I'll send an e-mail to Gora dev list too and also attach my patch into my GSoC Jira issue you mentioned and then we can continue at there. Before I do that stuff, I wanted to get Spark dev community's ideas to solve my problem due to you may have faced such kind of problems before. 26 Ağu 2015 17:

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
I found GORA-386 Gora Spark Backend Support Should the discussion be continued there ? Cheers On Wed, Aug 26, 2015 at 7:02 AM, Ted Malaska wrote: > Where is the input format class. When every I use the search on your > github it says "We couldn’t find any issues matching 'GoraInputFormat'" >

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where is the input format class. When every I use the search on your github it says "We couldn’t find any issues matching 'GoraInputFormat'" On Wed, Aug 26, 2015 at 9:48 AM, Furkan KAMACI wrote: > Hi, > > Here is the MapReduceTestUtils.testSparkWordCount() > > > https://github.com/kamaci/gora

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, Here is the MapReduceTestUtils.testSparkWordCount() https://github.com/kamaci/gora/blob/master/gora-core/src/test/java/org/apache/gora/mapreduce/MapReduceTestUtils.java#L108 Here is SparkWordCount https://github.com/kamaci/gora/blob/8f1acc6d4ef6c192e8fc06287558b7bc7c39b040/gora-core/src/e

SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json("hdfs://master:9000/path/file.json"). The JSON file contains a parsed table(relation) from the TPCH benchmark. After finis

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where can I find the code for MapReduceTestUtils.testSparkWordCount? On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI wrote: > Hi, > > Here is the test method I've ignored due to Connection Refused problem > failure: > > > https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apa

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, Here is the test method I've ignored due to Connection Refused problem failure: https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65 I've implemented a Spark backend for Apache Gora as GSoC project and this is th

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic co

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
Can you log the contents of the Configuration you pass from Spark ? The output would give you some clue. Cheers > On Aug 26, 2015, at 2:30 AM, Furkan KAMACI wrote: > > Hi Ted, > > I'll check Zookeeper connection but another test method which runs on hbase > without Spark works without any

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark works without any error. Hbase version is 0.98.8-hadoop2 and I use Spark 1.3.1 Kind Regards, Furkan KAMACI 26 Ağu 2015 12:08 tarihinde "Ted Yu" yazdı: > The connection failure was to zookeeper. > >

RE: Spark builds: allow user override of project version at buildtime

2015-08-26 Thread andrew.rowson
So, I actually tried this, and it built without problems, but publishing the artifacts to artifactory ended up with some strangeness in the child poms, where the property wasn’t resolved. This leads to issues pulling them into other projects of: “Could not find org.apache.spark:spark-parent_2.1

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
The connection failure was to zookeeper. Have you verified that localhost:2181 can serve requests ? What version of hbase was Gora built against ? Cheers > On Aug 26, 2015, at 1:50 AM, Furkan KAMACI wrote: > > Hi, > > I start an Hbase cluster for my test class. I use that helper class: >

Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, I start an Hbase cluster for my test class. I use that helper class: https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java and use it as like that: private static final HBaseClusterSingleton cluster = HBaseClusterSingleton.

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-26 Thread pishen tsai
Please ask questions at the gitter channel for now. https://gitter.im/pishen/spark-deployer - spark-deployer.conf should be placed in your project's root directory (beside build.sbt) - To use the nightly builds, you can replace the value of "spark-tgz-url" in spark-deployer.conf to the tgz you wan

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-26 Thread rake
This looks promising. I'm trying to use spark-ec2 to launch a cluster with Spark 1.5.0-SNAPSHOT and failing. Where should we ask questions, report problems? I couple of questions I have already after looking through the project: - Where does the configuration file /spark-deployer.conf/ go (w

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread rake
rxin wrote > > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > I was