Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
I'm not aware of a way to do that. CTRL-C is generally handled by the terminal emulator and just sends a SIGINT to the process; and the JVM doesn't allow you to ignore SIGINT. Setting a signal handler on the parent process doesn't work either. And there are other combinations you need to care

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
BTW this might be useful: http://superuser.com/questions/160388/change-bash-shortcut-keys-such-as-ctrl-c On Fri, Mar 13, 2015 at 10:53 AM, Sean Owen so...@cloudera.com wrote: This is more of a function of your shell, since ctrl-C will send SIGINT. I'm sure you can change that, depending on your

Date and decimal datatype not working

2015-03-13 Thread BASAK, ANANDA
Hi All, I am very new in Spark world. Just started some test coding from last week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding. I am having issues while using Date and decimal data types. Following is my code that I am simply running on scala prompt. I am trying to define a table and

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread Burak Yavuz
Hi, I would suggest you use LBFGS, as I think the step size is hurting you. You can run the same thing in LBFGS as: ``` val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) val initialWeights = Vectors.dense(Array.fill(3)( scala.util.Random.nextDouble())) val weights =

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Sean Owen
This is more of a function of your shell, since ctrl-C will send SIGINT. I'm sure you can change that, depending on your shell or terminal, but it's not a Spark-specific answer. On Fri, Mar 13, 2015 at 5:44 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: this doesn't solve my problem...

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Tathagata Das
Are you running the code before or after closing the spark context? It must be after stopping streaming context (without spark context) and before stopping spark context. Cc'ing Sean. He may have more insights. On Mar 13, 2015 11:19 AM, Jose Fernandez jfernan...@sdl.com wrote: Thanks, I am

serialization stakeoverflow error during reduce on nested objects

2015-03-13 Thread ilaxes
Hi, I'm working on a RDD of a tuple of objects which represent trees (Node containing a hashmap of nodes). I'm trying to aggregate these trees over the RDD. Let's take an example, 2 graphs : C - D - B - A - D - B - E F - E - B - A - D - B - F I'm spliting each graphs according to the vertex A

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
Hum increased it to 1024 but doesn't help still have the same problem :( 2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com: The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com:

Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Kushal Datta
Kudos to the whole team for such a significant achievement! On Fri, Mar 13, 2015 at 10:00 AM, Patrick Wendell pwend...@gmail.com wrote: Hi All, I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is the fourth release on the API-compatible 1.X line. It is Spark's largest

Re: How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Marcelo Vanzin
You can type :quit. On Fri, Mar 13, 2015 at 10:29 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. // Adamantios -- Marcelo

Any way to find out feature importance in Spark SVM?

2015-03-13 Thread Natalia Connolly
Hello, While running an SVMClassifier in spark, is there any way to print out the most important features the model selected? I see there's a way to print out the weights but I am not sure how to associate them with the input features, by name or in any other way. Any help would be

How do I alter the combination of keys that exit the Spark shell?

2015-03-13 Thread Adamantios Corais
Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. *// Adamantios*

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Ted Yu
Might be related: what's the value for spark.yarn.executor.memoryOverhead ? See SPARK-6085 Cheers On Fri, Mar 13, 2015 at 9:45 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange thing, the exact same code does work

Re: [ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Sebastián Ramírez
Awesome! Thanks! *Sebastián Ramírez* Head of Software Development http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email:

how to print RDD by key into file with grouByKey

2015-03-13 Thread Adrian Mocanu
Hi I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print into a file, a file for each key string. I tried to trigger a repartition of the RDD by doing group by on it. The grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I flattened that:

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Shivaram Venkataraman
Do you have a small test case that can reproduce the out of memory error ? I have also seen some errors on large scale experiments but haven't managed to narrow it down. Thanks Shivaram On Fri, Mar 13, 2015 at 6:20 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: It runs faster but there is some

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com: Might be related: what's the value for spark.yarn.executor.memoryOverhead ? See SPARK-6085 Cheers On Fri, Mar 13, 2015 at 9:45 AM,

RE: Handling worker batch processing during driver shutdown

2015-03-13 Thread Jose Fernandez
Thanks, I am using 1.2.0 so it looks like I am affected by the bug you described. It also appears that the shutdown hook doesn’t work correctly when the driver is running in YARN. According to the logs it looks like the SparkContext is closed and the code you suggested is never executed and

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread EcoMotto Inc.
Thanks a lot Burak, that helped. On Fri, Mar 13, 2015 at 1:55 PM, Burak Yavuz brk...@gmail.com wrote: Hi, I would suggest you use LBFGS, as I think the step size is hurting you. You can run the same thing in LBFGS as: ``` val algorithm = new LBFGS(new LeastSquaresGradient(), new

Unable to connect

2015-03-13 Thread Mohit Anchlia
I am running spark streaming standalone in ec2 and I am trying to run wordcount example from my desktop. The program is unable to connect to the master, in the logs I see, which seems to be an issue with hostname. 15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class

LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread cjwang
I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |

Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
Markus, Thanks. That makes sense. I was able to get this to work with spark-shell passing in the git built jar. I did notice that I couldn't get AvroSaver.save to work with SQLContext, but it works with HiveContext. Not sure if that is an issue, but for me, it is fine. Once again, thanks for

Re: Unable to connect

2015-03-13 Thread Tathagata Das
Are you running the driver program (that is your application process) in your desktop and trying to run it on the cluster in EC2? It could very well be a hostname mismatch in some way due to the all the public hostname, private hostname, private ip, firewall craziness of ec2. You have to probably

Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation

2015-03-13 Thread Edmon Begoli
All, Does anyone have any reference to a publication or other, informal sources (blogs, notes), showing performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file system (NFS). I need this for formal performance research. We are currently doing a research into this on a very

Re: Using rdd methods with Dstream

2015-03-13 Thread Tathagata Das
Is the number of top K elements you want to keep small? That is, is K small? In which case, you can 1. either do it in the driver on the array DStream.foreachRDD ( rdd = { val topK = rdd.top(K) ; // use top K }) 2. Or, you can use the topK to create another RDD using sc.makeRDD

Re: spark sql writing in avro

2015-03-13 Thread M. Dale
I probably did not do a good enough job explaining the problem. If you used Maven with the default Maven repository you have an old version of spark-avro that does not contain AvroSaver and does not have the saveAsAvro method implemented: Assuming you use the default Maven repo location: cd

spark flume tryOrIOException NoSuchMethodError

2015-03-13 Thread jaredtims
I am trying to process events from a flume avro sink, but i keep getting this same error. I am just running it locally using flumes avro-client. With the following commands to start the job and client. It seems like it should be a configuration problems since its a NoSuchMethodError, but

Re: Joining data using Latitude, Longitude

2015-03-13 Thread Ankur Srivastava
Hi Everyone, Thank you for your suggestions, based on that I was able to move forward. I am now generating a Geohash for all the lats and lons in our reference data and then creating a trie of all the Geohash's I am then broadcasting that trie and then using it to search the nearest Geohash for

Re: Partitioning

2015-03-13 Thread Tathagata Das
If you want to access the keys in an RDD that is partition by key, then you can use RDD.mapPartition(), which gives you access to the whole partition as an iteratorkey, value. You have the option of maintaing the partitioning information or not by setting the preservePartitioning flag in

Re: Partitioning

2015-03-13 Thread Mohit Anchlia
I still don't follow how spark is partitioning data in multi node environment. Is there a document on how spark does portioning of data. For eg: in word count eg how is spark distributing words to multiple nodes? On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das t...@databricks.com wrote: If you

org.apache.spark.SparkException Error sending message

2015-03-13 Thread Chen Song
When I ran Spark SQL query (a simple group by query) via hive support, I have seen lots of failures in map phase. I am not sure if that is specific to Spark SQL or general. Any one has seen such errors before? java.io.IOException: org.apache.spark.SparkException: Error sending message [message

Re: spark sql writing in avro

2015-03-13 Thread Michael Armbrust
BTW, I'll add that we are hoping to publish a new version of the Avro library for Spark 1.3 shortly. It should have improved support for writing data both programmatically and from SQL. On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng kpe...@gmail.com wrote: Markus, Thanks. That makes sense. I

spark there is no space on the disk

2015-03-13 Thread Peng Xia
Hi I was running a logistic regression algorithm on a 8 nodes spark cluster, each node has 8 cores and 56 GB Ram (each node is running a windows system). And the spark installation driver has 1.9 TB capacity. The dataset I was training on are has around 40 million records with around 6600

Re: RE: Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Thanks Daoyuan. What do you mean by running some native command, I never thought that hive will run without an computing engine like Hadoop MR or spark. Thanks. bit1...@163.com From: Wang, Daoyuan Date: 2015-03-13 16:39 To: bit1...@163.com; user Subject: RE: Explanation on the Hive in the

Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai
Seems you want to use array for the field of providers, like providers:[{id: ...}, {id:...}] instead of providers:{{id: ...}, {id:...}} On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 kpe...@gmail.com wrote: Hi All, I was noodling around with loading in a json file into spark sql's hive context and

Re: Loading in json with spark sql

2015-03-13 Thread Kevin Peng
Yin, Yup thanks. I fixed that shortly after I posted and it worked. Thanks, Kevin On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai yh...@databricks.com wrote: Seems you want to use array for the field of providers, like providers:[{id: ...}, {id:...}] instead of providers:{{id: ...}, {id:...}}

Re: serialization stakeoverflow error during reduce on nested objects

2015-03-13 Thread Ted Yu
Have you registered your class with kryo ? See core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala and core/src/test/scala/org/apache/spark/SparkConfSuite.scala On Fri, Mar 13, 2015 at 10:52 AM, ilaxes ila...@hotmail.com wrote: Hi, I'm working on a RDD of a tuple of objects

Re: Problem connecting to HBase

2015-03-13 Thread fightf...@163.com
Hi, there You may want to check your hbase config. e.g. the following property can be changed to /hbase property namezookeeper.znode.parent/name value/hbase-unsecure/value /property fightf...@163.com From: HARIPRIYA AYYALASOMAYAJULA Date: 2015-03-14 10:47 To: user

Re: Problem connecting to HBase

2015-03-13 Thread Ted Yu
In HBaseTest.scala: val conf = HBaseConfiguration.create() You can add some log (for zookeeper.znode.parent, e.g.) to see if the values from hbase-site.xml are picked up correctly. Please use pastebin next time you want to post errors. Which Spark release are you using ? I assume it contains

Aggregation of distributed datasets

2015-03-13 Thread raggy
I am a PhD student trying to understand the internals of Spark, so that I can make some modifications to it. I am trying to understand how the aggregation of the distributed datasets(through the network) onto the driver node works. I would very much appreciate it if someone could point me towards

How does Spark honor data locality when allocating computing resources for an application

2015-03-13 Thread bit1...@163.com
Hi, sparkers, When I read the code about computing resources allocation for the newly submitted application in the Master#schedule method, I got a question about data locality: // Pack each app into as few nodes as possible until we've assigned all its cores for (worker - workers if

Re: Partitioning

2015-03-13 Thread Tathagata Das
If you want to learn about how Spark partitions the data based on keys, here is a recent talk about that http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs?related=1 Of course you can read the original Spark paper

Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-13 Thread EH
Hi all, I've been using Spark 1.1.0 for a while, and now would like to upgrade to Spark 1.1.1 or above. However, it throws the following errors: 18:05:31.522 [sparkDriver-akka.actor.default-dispatcher-3hread] ERROR TaskSchedulerImpl - Lost executor 37 on hcompute001: remote Akka client

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread Zhan Zhang
It is during function evaluation in the line search, the value is either infinite or NaN, which may be caused too large step size. In the code, the step is reduced to half. Thanks. Zhan Zhang On Mar 13, 2015, at 2:41 PM, cjwang c...@cjwang.us wrote: I am running LogisticRegressionWithLBFGS.

Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
Hi All, I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it as a single node cluster for test. The data I use to sort is around 4GB and sit on S3, output will also on S3. I just connect spark-shell to the local cluster and run the code in the script (because I just

Re: Partitioning

2015-03-13 Thread Gerard Maas
In spark-streaming, the consumers will fetch data and put it into 'blocks'. Each block becomes a partition of the rdd generated during that batch interval. The size of each is block controlled by the conf: 'spark.streaming.blockInterval'. That is, the amount of data the consumer can collect in

Spark SQL 1.3 max operation giving wrong results

2015-03-13 Thread gtinside
Hi , I am playing around with Spark SQL 1.3 and noticed that max function does not give the correct result i.e doesn't give the maximum value. The same query works fine in Spark SQL 1.2 . Is any one aware of this issue ? Regards, Gaurav -- View this message in context:

Loading in json with spark sql

2015-03-13 Thread kpeng1
Hi All, I was noodling around with loading in a json file into spark sql's hive context and I noticed that I get the following message after loading in the json file: PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47 I am using the HiveContext to load in the json file

Re: building all modules in spark by mvn

2015-03-13 Thread Ted Yu
Please try the following command: mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package Cheers On Fri, Mar 13, 2015 at 5:57 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: guys, is there any easier way to build all modules by mvn ? right now if I run “mvn package” in spark root

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
And one thing forget to mention, even I have this exception and the result is not well format in my target folder (part of them are there, rest are under different folder structure of _tempoary folder). In the webUI of spark-shell, it is still be marked as successful step. I think this is a bug?

Need Advice about reading lots of text files

2015-03-13 Thread Pat Ferrel
We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be

building all modules in spark by mvn

2015-03-13 Thread sequoiadb
guys, is there any easier way to build all modules by mvn ? right now if I run “mvn package” in spark root directory I got: [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 8.327 s] [INFO] Spark Project Networking ...

SV: Pyspark Hbase scan.

2015-03-13 Thread Castberg , René Christian
?Sorry forgot to attach traceback. Regards Rene Castberg Fra: Castberg, René Christian Sendt: 13. mars 2015 07:13 Til: user@spark.apache.org Kopi: gen tang Emne: SV: Pyspark Hbase scan. ?Hi, I have now successfully managed to test this in a local spark

SV: Pyspark Hbase scan.

2015-03-13 Thread Castberg , René Christian
?Hi, I have now successfully managed to test this in a local spark session. But i am having a huge programming getting this to work with Horton Works technical preview. I think that there is an incompatability with the way YARN has been compiled. After changing the hbase version, and

Re: spark sql performance

2015-03-13 Thread Akhil Das
That totally depends on your data size and your cluster setup. Thanks Best Regards On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Hi, What is query time for join query on hbase with spark sql. Say tables in hbase have 0.5 million records each. I am

Re: set up spark cluster with heterogeneous hardware

2015-03-13 Thread Akhil Das
You could also add SPARK_MASTER_IP to bind to a specific host/IP so that it won't get confused with those hosts in your /etc/hosts file. Thanks Best Regards On Fri, Mar 13, 2015 at 12:00 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi Spark community, I searched for a way to configure a

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-13 Thread Prannoy
You can put the files from any where it just the streaming application only picks the file having a timestamp greater than the batch was started. No you dont need to set any properties for this to work. This is the default behaviour of the spark streaming. On Fri, Mar 13, 2015 at 9:38 AM,

Re: Using Neo4j with Apache Spark

2015-03-13 Thread Gautam Bajaj
I have been trying to do the same, but where exactly do you suggest creating the static object? As creating it inside for each RDD will ultimately result in same problem and not creating it inside will result in serializability issue. On Fri, Mar 13, 2015 at 11:47 AM, Tathagata Das

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Thanks Akhil, What more info should I give so we can estimate query time in my scenario? Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:01 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: spark sql performance That totally depends

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... look like the

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 columns each in two tables with 0.5 million records. I want to perform join query on these two tables. Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:16 PM To:

Re: spark sql performance

2015-03-13 Thread Akhil Das
The size/type of your data, and your cluster configuration would be fine i think. Thanks Best Regards On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Thanks Akhil, What more info should I give so we can estimate query time in my scenario? *Thanks,*

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I've confirmed it only failed in console started by

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Eric Charles
i have the same issue running spark sql code from eclipse workspace. If you run your code from the command line (with a packaged jar) or from Intellij, I bet it should work. IMHO This is some how related to eclipse env, but would love to know how to fix it (whether via eclipse conf, or via a

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil! Thanks for the information. Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:34 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: spark sql performance Can't say that unless you try it. Thanks Best Regards On Fri, Mar

set up spark cluster with heterogeneous hardware

2015-03-13 Thread Du Li
Hi Spark community, I searched for a way to configure a heterogeneous cluster because the need recently emerged in my project. I didn't find any solution out there. Now I have thought out a solution and thought it might be useful to many other people with similar needs. Following is a blog post

Re: spark sql performance

2015-03-13 Thread Akhil Das
Can't say that unless you try it. Thanks Best Regards On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Sounds great! So can I expect response time in milliseconds from the join query over this much data ( 0.5 million in each table) ? *Thanks,*

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Sounds great! So can I expect response time in milliseconds from the join query over this much data ( 0.5 million in each table) ? Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 12:27 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re:

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Additionally I wanted to tell that presently I was running the query on one machine with 3gm ram and the join query was taking around 6 seconds. Thanks, Udbhav Agarwal From: Udbhav Agarwal Sent: 13 March, 2015 12:45 PM To: 'Akhil Das' Cc: user@spark.apache.org Subject: RE: spark sql performance

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I've confirmed it only failed in console started by SBT. I'm using sbt-spark-package plugin, and the initialCommands look like this (I added implicit sqlContext to it): show console::initialCommands [info] println(Welcome to\n + [info] __\n + [info] / __/__ ___

Re: spark sql performance

2015-03-13 Thread Akhil Das
You can see where it is spending time, whether there is any GC Time etc from the webUI (running on 4040),Also how many cores are you having? Thanks Best Regards On Fri, Mar 13, 2015 at 1:05 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Additionally I wanted to tell that presently I was

Re: KafkaUtils and specifying a specific partition

2015-03-13 Thread Akhil Das
Here's a simple consumer which does that https://github.com/dibbhatt/kafka-spark-consumer/ Thanks Best Regards On Thu, Mar 12, 2015 at 10:28 PM, ColinMc colin.mcqu...@shiftenergy.com wrote: Hi, How do you use KafkaUtils to specify a specific partition? I'm writing customer Marathon jobs

Re: Error running rdd.first on hadoop

2015-03-13 Thread Akhil Das
Make sure your hadoop is running on port 8020, you can check it in your core-site.xml file and use that URI like: sc.textFile(hdfs://myhost:myport/data) Thanks Best Regards On Fri, Mar 13, 2015 at 5:15 AM, Lau, Kawing (GE Global Research) kawing@ge.com wrote: Hi I was running with

Re: spark sql performance

2015-03-13 Thread Akhil Das
So you can cache upto 8GB of data in memory (hope your data size of one table is 2GB), then it should be pretty fast with SparkSQL. Also i'm assuming you have around 12-16 cores total. Thanks Best Regards On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote:

RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil. I am having 4 cores cpu.(2.4 ghz) Thanks, Udbhav Agarwal From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 13 March, 2015 1:07 PM To: Udbhav Agarwal Cc: user@spark.apache.org Subject: Re: spark sql performance You can see where it is spending time, whether there is any GC

No assemblies found in assembly/target/scala-2.10

2015-03-13 Thread Patcharee Thongtra
Hi, I am trying to build spark 1.3 from source. After I executed| mvn -DskipTests clean package| I tried to use shell but got this error [root@sandbox spark]# ./bin/spark-shell Exception in thread main java.lang.IllegalStateException: No assemblies found in

Spark SQL. Cast to Bigint

2015-03-13 Thread Masf
Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread main java.lang.RuntimeException: [34.62] failure: ``DECIMAL'' expected but identifier BIGINT found Thanks!! Regards. Miguel Ángel

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I'm almost certain the problem is the ClassLoader. So adding

RE: Explanation on the Hive in the Spark assembly

2015-03-13 Thread Wang, Daoyuan
Hi bit1129, 1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid conflicts. 2, this hive is used to run some native command, which does not rely on spark or mapreduce. Thanks, Daoyuan From: bit1...@163.com [mailto:bit1...@163.com] Sent: Friday, March 13, 2015 4:24 PM

Re: No assemblies found in assembly/target/scala-2.10

2015-03-13 Thread Sean Owen
FWIW I do not see any problem. Are you sure the build completely successfully? you should see an assembly then in assembly/target/scala-2.10, so should check that. On Fri, Mar 13, 2015 at 8:26 AM, Patcharee Thongtra patcharee.thong...@uni.no wrote: Hi, I am trying to build spark 1.3 from

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I guess it's a ClassLoader issue. But I have no idea how to debug it. Any hints? Jianshi On Fri, Mar 13, 2015 at 3:00 PM, Eric Charles e...@apache.org wrote: i have the same issue running spark sql code from eclipse workspace. If you run your code from the command line (with a packaged jar)

Re: Visualizing the DAG of a Spark application

2015-03-13 Thread Todd Nist
There is the PR https://github.com/apache/spark/pull/2077 for doing this. On Fri, Mar 13, 2015 at 6:42 AM, t1ny wbr...@gmail.com wrote: Hi all, We are looking for a tool that would let us visualize the DAG generated by a Spark application as a simple graph. This graph would represent the

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
These are quite different creatures. You have a distributed set of Strings, but want a local stream of bytes, which involves three conversions: - collect data to driver - concatenate strings in some way - encode strings as bytes according to an encoding Your approach is OK but might be faster to

Visualizing the DAG of a Spark application

2015-03-13 Thread t1ny
Hi all, We are looking for a tool that would let us visualize the DAG generated by a Spark application as a simple graph. This graph would represent the Spark Job, its stages and the tasks inside the stages, with the dependencies between them (either narrow or shuffle dependencies). The Spark

Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming API. e.g. dstream.transform(x = x.sortByKey(true)) But there are other RDD methods which return types other than RDD. e.g. dstream.transform(x = x.top(5)) top here returns

Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the same thing as Hive On Spark(Hive On Spark, is meant to run hive using spark execution engine). So, my question is: 1. What is the difference between Hive in the spark assembly

How to do spares vector product in Spark?

2015-03-13 Thread Xi Shen
Hi, I have two RDD[Vector], both Vector are spares and of the form: (id, value) id indicates the position of the value in the vector space. I want to apply dot product on two of such RDD[Vector] and get a scale value. The none exist values are treated as zero. Any convenient tool to do

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang jianshi.hu...@gmail.com

Re: RDD to InputStream

2015-03-13 Thread Ayoub
Thanks Sean, I forgot to mention that the data is too big to be collected on the driver. So yes your proposition would work in theory but in my case I cannot hold all the data in the driver memory, therefore it wouldn't work. I guess the crucial point to to do the collect in a lazy way and in

Re: Using rdd methods with Dstream

2015-03-13 Thread Sean Owen
Hm, aren't you able to use the SparkContext here? DStream operations happen on the driver. So you can parallelize() the result? take() won't work as it's not the same as top() On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Like this?

Re: Explanation on the Hive in the Spark assembly

2015-03-13 Thread bit1...@163.com
Can anyone have a look on this question? Thanks. bit1...@163.com From: bit1...@163.com Date: 2015-03-13 16:24 To: user Subject: Explanation on the Hive in the Spark assembly Hi, sparkers, I am kind of confused about hive in the spark assembly. I think hive in the spark assembly is not the

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-13 Thread Jaonary Rabarisoa
It runs faster but there is some drawbacks. It seems to consume more memory. I get java.lang.OutOfMemoryError: Java heap space error if I don't have a sufficient partitions for a fixed amount of memory. With the older (ampcamp) implementation for the same data size I didn't get it. On Thu, Mar

Re: Using rdd methods with Dstream

2015-03-13 Thread Akhil Das
Like this? dtream.repartition(1).mapPartitions(it = it.take(5)) Thanks Best Regards On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming

Pyspark saveAsTextFile exceptions

2015-03-13 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception for saving the results into hadoop. *Code :* rdd.saveAsTextFile(hdfs://localhost:9000/home/rajesh/data/result.rdd) Could you please help me how to resolve this issue. 15/03/13 17:19:31 INFO spark.SparkContext: Starting job: saveAsTextFile at

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
Thanks, it makes sense. On Thursday, March 12, 2015, Daniel Siegmann daniel.siegm...@teamaol.com wrote: Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, repartition is expensive. Looking for an efficient to do this. Regards,Laeeq On Friday, March 13, 2015 12:24 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Like this? dtream.repartition(1).mapPartitions(it = it.take(5)) ThanksBest Regards On Fri, Mar 13, 2015 at 4:11 PM,

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, Earlier my code was like follwing but slow due to repartition. I want top K of each window in a stream. val counts = keyAndValues.map(x = math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))val topCounts = counts.repartition(1).map(_.swap).transform(rdd =

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
OK, then you do not want to collect() the RDD. You can get an iterator, yes. There is no such thing as making an Iterator into an InputStream. An Iterator is a sequence of arbitrary objects; an InputStream is a channel to a stream of bytes. I think you can employ similar Guava / Commons utilities

[GRAPHX] could not process graph with 230M edges

2015-03-13 Thread Hlib Mykhailenko
Hello, I cannot process graph with 230M edges. I cloned apache.spark, build it and then tried it on cluster. I used Spark Standalone Cluster: -5 machines (each has 12 cores/32GB RAM) -'spark.executor.memory' == 25g -'spark.driver.memory' == 3g Graph has 231359027 edges. And its file

Lots of fetch failures on saveAsNewAPIHadoopDataset PairRDDFunctions

2015-03-13 Thread freedafeng
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3 cores and 16G memory were assigned to spark in each worker node. Standalone mode. Data set is 3.8 T. wondering how to fix this. Thanks!

  1   2   >