Repeated data item search with Spark SQL(1.0.1)

2014-07-13 Thread anyweil
Hi All: I am using Spark SQL 1.0.1 for a simple test, the loaded data (JSON format) which is registered as table people is: {name:Michael, schools:[{name:ABC,time:1994},{name:EFG,time:2000}]} {name:Andy, age:30,scores:{eng:98,phy:89}} {name:Justin, age:19} the schools has repeated value

can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too. However, if I do the following: dstream.reduce{case(x,y) = x}.print I don't get anything on my console. What's going on? Thanks

Re: Nested Query With Spark SQL(1.0.1)

2014-07-13 Thread anyweil
Or is it supported? I know I could doing it myself with filter, but if SQL could support, would be much better, thx! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Query-With-Spark-SQL-1-0-1-tp9544p9547.html Sent from the Apache Spark User List

Error in JavaKafkaWordCount.java example

2014-07-13 Thread Mahebub Sayyed
Hello, I am referring following example: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java I am getting following C*ompilation Error* : \example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag Please help

Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski
Hello, I have been trying to play with the Google ngram dataset provided by Amazon in form of LZO compressed files. I am having trouble understanding what is going on ;). I have added the compression jar and native library to the underlying Hadoop/HDFS installation, restarted the name node

Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
If you’re still seeing gibberish, it’s because Spark is not using the LZO libraries properly. In your case, I believe you should be calling newAPIHadoopFile() instead of textFile(). For example: sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list concatenation operations, and found that the performance becomes even worse. So groupByKey is not that bad in my code. Best regards, - Guanhua From: Aaron Davidson ilike...@gmail.com Reply-To: user@spark.apache.org

Re: Problem reading in LZO compressed files

2014-07-13 Thread Ognen Duzlevski
Nicholas, Thanks! How do I make spark assemble against a local version of Hadoop? I have 2.4.1 running on a test cluster and I did SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly but all it did was pull in hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1 HDFS). I am

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson
Ah -- I should have been more clear, list concatenation isn't going to be any faster. In many cases I've seen people use groupByKey() when they are really trying to do some sort of aggregation. and thus constructing this concatenated list is more expensive than they need. On Sun, Jul 13, 2014 at

Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too.

Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
my yarn environment does have less memory for the executors. i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which shows an RDD as fully cached in memory, yet it does not show up in the UI On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia matei.zaha...@gmail.com wrote: The

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Update on this: val lines = ssc.socketTextStream(localhost, ) lines.print // works lines.map(_-1).print // works lines.map(_-1).reduceByKey(_+_).print // nothing printed to driver console Just lots of: 14/07/13 11:37:40 INFO receiver.BlockGenerator: Pushed block input-0-1405276660400

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Thanks for your interest. lines.foreachRDD(x = println(x.count)) And I got 0 every once in a while (which I think is strange, because lines.print prints the input I'm giving it over the socket.) When I tried: lines.map(_-1).reduceByKey(_+_).foreachRDD(x = println(x.count)) I got no count.

SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi I am running into trouble with a nested query using python. To try and debug it, I first wrote the query I want using sqlite3 select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) from Frequency as freq, (select term, docid, count from Frequency) as

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Michael Armbrust
Hi Andy, The SQL parser is pretty basic (we plan to improve this for the 1.2 release). In this case I think part of the problem is that one of your variables is count, which is a reserved word. Unfortunately, we don't have the ability to escape identifiers at this point. However, I did manage

Task serialized size dependent on size of RDD?

2014-07-13 Thread Sébastien Rainville
Hi, I'm having trouble serializing tasks for this code: val rddC = (rddA join rddB) .map { case (x, (y, z)) = z - y } .reduceByKey( { (y1, y2) = Semigroup.plus(y1, y2) }, 1000) Somehow when running on a small data set the size of the serialized task is about 650KB, which is very big, and

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
More strange behavior: lines.foreachRDD(x = println(x.first)) // works lines.foreachRDD(x = println((x.count,x.first))) // no output is printed to driver console On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat walrusthe...@gmail.com wrote: Thanks for your interest. lines.foreachRDD(x =

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi Michael Changing my col name to something other the Œcount¹ . Fixed the parse error Many thanks, Andy From: Michael Armbrust mich...@databricks.com Reply-To: user@spark.apache.org Date: Sunday, July 13, 2014 at 1:18 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org

Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Great success! I was able to get output to the driver console by changing the construction of the Streaming Spark Context from: val ssc = new StreamingContext(local /**TODO change once a cluster is up **/, AppName, Seconds(1)) to: val ssc = new StreamingContext(local[2] /**TODO

Ideal core count within a single JVM

2014-07-13 Thread lokesh.gidra
Hello, What would be an ideal core count to run a spark job in local mode to get best utilization of CPU? Actually I have a 48-core machine but the performance of local[48] is poor as compared to local[10]. Lokesh -- View this message in context:

Re: can't print DStream after reduce

2014-07-13 Thread Michael Campbell
This almost had me not using Spark; I couldn't get any output. It is not at all obvious what's going on here to the layman (and to the best of my knowledge, not documented anywhere), but now you know you'll be able to answer this question for the numerous people that will also have it. On Sun,

Re: not getting output from socket connection

2014-07-13 Thread Michael Campbell
Make sure you use local[n] (where n 1) in your context setup too, (if you're running locally), or you won't get output. On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat walrusthe...@gmail.com wrote: Thanks! I thought it would get passed through netcat, but given your email, I was able to

Re: can't print DStream after reduce

2014-07-13 Thread Sean Owen
How about a PR that rejects a context configured for local or local[1]? As I understand it is not intended to work and has bitten several people. On Jul 14, 2014 12:24 AM, Michael Campbell michael.campb...@gmail.com wrote: This almost had me not using Spark; I couldn't get any output. It is not

Re: Problem reading in LZO compressed files

2014-07-13 Thread Nicholas Chammas
I actually never got this to work, which is part of the reason why I filed that JIRA. Apart from using --jar when starting the shell, I don’t have any more pointers for you. :( ​ On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: Nicholas, Thanks! How do I

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
Alsom the reason the spark-streaming-kafka is not included in the spark assembly is that we do not want dependencies of external systems like kafka (which itself probably has a complex dependency tree) to cause conflict with the core spark's functionality and stability. TD On Sun, Jul 13, 2014

Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Nicholas Chammas
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: I can easily fix this by changing this to YarnConfiguration instead of MRJobConfig but was wondering what the steps are for submitting a fix. Relevant links: -

Re: not getting output from socket connection

2014-07-13 Thread Walrus theCat
Hah, thanks for tidying up the paper trail here, but I was the OP (and solver) of the recent reduce thread that ended in this solution. On Sun, Jul 13, 2014 at 4:26 PM, Michael Campbell michael.campb...@gmail.com wrote: Make sure you use local[n] (where n 1) in your context setup too, (if

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Akhil Das Thanks. I tried the codes. and it works. There's a problem with my socket codes that is not flushing the content out, and for the test tool, Hercules, I have to close the socket connection to flush the content out. I am going to troubleshoot why nc works, and the codes and test

Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Chester Chen
Ron, Which distribution and Version of Hadoop are you using ? I just looked at CDH5 ( hadoop-mapreduce-client-core- 2.3.0-cdh5.0.0), MRJobConfig does have the field : java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH; Chester On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1 yields java.lang.ExceptionInInitializerError. Nick ​ On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Is there a place where we can find an up-to-date list of supported SQL syntax in Spark

Re: Anaconda Spark AMI

2014-07-13 Thread Jeremy Freeman
Hi Ben, This is great! I just spun up an EC2 cluster and tested basic pyspark + ipython/numpy/scipy functionality, and all seems to be working so far. Will let you know if any issues arise. We do a lot with pyspark + scientific computing, and for EC2 usage I think this is a terrific way to

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Actually, this looks like its some kind of regression in 1.0.1, perhaps related to assembly and packaging with spark-ec2. I don’t see this issue with the same data on a 1.0.0 EC2 cluster. How can I trace this down for a bug report? Nick ​ On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas

Catalyst dependency on Spark Core

2014-07-13 Thread Aniket Bhatnagar
As per the recent presentation given in Scala days ( http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Michael Armbrust
Are you sure the code running on the cluster has been updated? We recently optimized the execution of LIKE queries that can be evaluated without using full regular expressions. So it's possible this error is due to missing functionality on the executors. How can I trace this down for a bug

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread Tobias Pfeiffer
Hi, I experienced exactly the same problems when using SparkContext with local[1] master specification, because in that case one thread is used for receiving data, the others for processing. As there is only one thread running, no processing will take place. Once you shut down the connection, the

SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Hi I'm trying to read lzo compressed files from S3 using spark. The lzo files are not indexed. Spark job starts to read the files just fine but after a while it just hangs. No network throughput. I have to restart the worker process to get it back up. Any idea what could be causing this. We were

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Nicholas Chammas
Are you sure the code running on the cluster has been updated? I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m assuming that’s taken care of, at least in theory. I just spun down the clusters I had up, but I will revisit this tomorrow and provide the information you

Re: Powered By Spark: Can you please add our org?

2014-07-13 Thread Alex Gaudio
Awesome! Thanks, Reynold! On Tue, Jul 8, 2014 at 4:00 PM, Reynold Xin r...@databricks.com wrote: I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote: Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread kytay
Hi Tobias I have been using local[4] to test. My problem is likely caused by the tcp host server that I am trying the emulate. I was trying to emulate the tcp host to send out messages. (although I am not sure at the moment :D) First way I tried was to use a tcp tool called, Hercules. Second