Re: Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-02-09 Thread Arush Kharbanda
Is this what you are looking for 1. Build Spark with the YARN profile http://spark.apache.org/docs/1.2.0/building-spark.html. Skip this step if you are using a pre-packaged distribution. 2. Locate the spark-version-yarn-shuffle.jar. This should be under

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Luca Puggini
Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks, Luca 2015-02-07 3:00 GMT+00:00 Burak Yavuz brk...@gmail.com: Hi, You can do the following: ``` import

OutofMemoryError: Java heap space

2015-02-09 Thread Yifan LI
Hi, I just found the following errors during computation(graphx), anyone has ideas on this? thanks so much! (I think the memory is sufficient, spark.executor.memory 30GB ) 15/02/09 00:37:12 ERROR Executor: Exception in task 162.0 in stage 719.0 (TID 7653) java.lang.OutOfMemoryError: Java

Need a spark application.

2015-02-09 Thread Kartheek.R
Hi, Can someone please suggest some real life application implemented in spark ( things like gene sequencing) that is of type below code. Basically, the application should have jobs submitted via as many threads as possible. I need similar kind of spark application for benchmarking. val

Re: Custom streaming receiver slow on YARN

2015-02-09 Thread Jong Wook Kim
replying to my own thread; I realized that this only happens when the replication level is 1. Regardless of whether setting memory_only or disk or deserialized, I had to make the replication level = 2 to make the streaming work properly on YARN. I still don't get it why, because intuitively less

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin
Hi All, I have a use case where I have cached my schemaRDD and I want to launch executors just on the partition which I know of (prime use-case of PartitionPruningRDD). I tried something like following :- val partitionIdx = 2 val schemaRdd = hiveContext.table(myTable) //myTable is cached in

Re: Spark Driver Host under Yarn

2015-02-09 Thread nitin
Are you running in yarn-cluster or yarn-client mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Driver Host under Yarn

2015-02-09 Thread Al M
Yarn-cluster. When i run in yarn-client the driver is just run on the machine that runs spark-submit. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21558.html Sent from the Apache Spark User List mailing list archive

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
Have you checked the corresponding executor logs as well? I think information provided by you here is less to actually understand your issue. -- View this message in context:

Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do rdd.map(_.mkString(\t)) or with some other delimiter to make it `RDD[String]`, and then call `saveAsTextFile`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Burak Yavuz
Sorry about that, yes, it should be uniformVectorRDD. Thanks Sean! Burak On Mon, Feb 9, 2015 at 2:05 AM, Sean Owen so...@cloudera.com wrote: Yes the example given here should have used uniformVectorRDD. Then it's correct. On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini lucapug...@gmail.com

running spark project using java -cp command

2015-02-09 Thread Hafiz Mujadid
hi experts! Is there any way to run spark application using java -cp command ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-spark-project-using-java-cp-command-tp21567.html Sent from the Apache Spark User List mailing list archive at

Re: running spark project using java -cp command

2015-02-09 Thread Akhil Das
Yes like this: /usr/lib/jvm/java-7-openjdk-i386/bin/java -cp

Re: How to create spark AMI in AWS

2015-02-09 Thread Guodong Wang
Hi Nicholas, Thanks for your quick reply. I'd like to try to build a image with create_image.sh. Then let's see how we can launch spark cluster in region cn-north-1. Guodong On Tue, Feb 10, 2015 at 3:59 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Guodong, spark-ec2 does not

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Yitong Zhou
Hi Marcelo, Thanks for the explanation! So you mean in this way, actually only the output of the map closure would need to be serialized so that it could be passed further for other operations (maybe reduce or else)? And we don't have to worry about Utils.funcX because for each closure instance we

Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
OK, good luck! On Mon Feb 09 2015 at 6:41:14 PM Guodong Wang wangg...@gmail.com wrote: Hi Nicholas, Thanks for your quick reply. I'd like to try to build a image with create_image.sh. Then let's see how we can launch spark cluster in region cn-north-1. Guodong On Tue, Feb 10, 2015 at

Re: textFile partitions

2015-02-09 Thread Kostas Sakellis
The partitions parameter to textFile is the minPartitions. So there will be at least that level of parallelism. Spark delegates to Hadoop to create the splits for that file (yes, even for a text file on disk and not hdfs). You can take a look at the code in FileInputFormat - but briefly it will

External Data Source in SPARK

2015-02-09 Thread Addanki, Santosh Kumar
Hi, We implemented an External Data Source by extending the TableScan . We added the classes to the classpath The data source works fine when run in Spark Shell . But currently we are unable to use this same data source in Python Environment. So when we execute the following below in an

SparkSQL 1.2 and ElasticSearch-Spark 1.4 not working together, NoSuchMethodError problems

2015-02-09 Thread Aris
Hello Spark community and Holden, I am trying to follow Holden Karau's SparkSQL and ElasticSearch tutorial from Spark Summit 2014. I am trying to use elasticsearch-spark 2.1.0.Beta3 and SparkSQL 1.2 together. https://github.com/holdenk/elasticsearchspark *(Side Note: This very nice tutorial does

Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng
`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui On Fri, Feb 6, 2015 at 12:50 PM, SK

Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
Guodong, spark-ec2 does not currently support the cn-north-1 region, but you can follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to find out when it does. The base AMI used to generate the current Spark AMIs is very old. I'm not sure anyone knows what it is anymore. What I

Re: Number of goals to win championship

2015-02-09 Thread Xiangrui Meng
Logistic regression outputs probabilities if the data fits the model assumption. Otherwise, you might need to calibrate its output to correctly read it. You may be interested in reading this: http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/. We have isotonic

Re: no option to add intercepts for StreamingLinearAlgorithm

2015-02-09 Thread Xiangrui Meng
No particular reason. We didn't add it in the first version. Let's add it in 1.4. -Xiangrui On Thu, Feb 5, 2015 at 3:44 PM, jamborta jambo...@gmail.com wrote: hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple

RE: no space left at worker node

2015-02-09 Thread ey-chih chow
In other words, the working command is: /root/spark/bin/spark-submit --class com.crowdstar.etl.ParseAndClean --master spark://ec2-54-213-73-150.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --total-executor-cores 4 file:///root/etl-admin/jar/spark-etl-0.0.1-SNAPSHOT.jar

Re: getting error when submit spark with master as yarn

2015-02-09 Thread Al M
Open up 'yarn-site.xml' in your hadoop configuration. You want to create configuration for yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb. Have a look here for details on how they work:

ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Ashish Kumar
*Command:* sudo python ./examples/src/main/python/pi.py *Error:* Traceback (most recent call last): File ./examples/src/main/python/pi.py, line 22, in module from pyspark import SparkContext ImportError: No module named pyspark

Re: ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Mohit Singh
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py instead of normal python pi.py On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com wrote: *Command:* sudo python ./examples/src/main/python/pi.py *Error:* Traceback (most recent call last):

pamameter passed for AppendOnlyMap initialCapacity

2015-02-09 Thread fightf...@163.com
Hi, all Any experts can show me what can be done to change the initialCapacity of the following ? org.apache.spark.util.collection.AppendOnlyMap Cause we had caught problems in using spark to process large data sets during sort shuffle. Does spark offer a configurable parameter for

RE: no space left at worker node

2015-02-09 Thread ey-chih chow
Thanks. But, in spark-submit, I specified the jar file in the form of local:/spark-etl-0.0.1-SNAPSHOT.jar. It comes back with the following. What's wrong with this? Ey-Chih Chow === Date: Sun, 8 Feb 2015 22:27:17 -0800Sending launch command to

Re: Spark streaming app shutting down

2015-02-09 Thread Mukesh Jha
Thanks for the info guys. For now I'm using the high level consumer i will give this one a try. As far as the queries are concerned, check pointing helps. I'm still no t sure whats the best way to gracefully stop the application in yarn cluster mode. On 5 Feb 2015 09:38, Dibyendu Bhattacharya

sum of columns in rowMatrix and linear regression

2015-02-09 Thread Donbeo
I have a matrix X of type: res39: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6cfff1d3 with n rows and p columns I would like to obtain an array S of size n*1 defined as the sum of the columns of X. S will then be replaced by val

Can spark job server be used to visualize streaming data?

2015-02-09 Thread Su She
Hello Everyone, I was reading this blog post: http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/ and was wondering if this approach can be taken to visualize streaming data...not just historical data? Thank you! -Suh

Re: SparkSQL 1.2 and ElasticSearch-Spark 1.4 not working together, NoSuchMethodError problems

2015-02-09 Thread Costin Leau
Hi, Spark 1.2 changed the APIs a bit which is what's causing the problem with es-spark 2.1.0.Beta3. This has been addressed a while back in es-spark proper; you can get a hold of the dev build (the upcoming 2.1.Beta4) here [1]. P.S. Do note that a lot of things have happened in

Re: python api and gzip compression

2015-02-09 Thread Kane Kim
Found it - used saveAsHadoopFile On Mon, Feb 9, 2015 at 9:11 AM, Kane Kim kane.ist...@gmail.com wrote: Hi, how to compress output with gzip using python api? Thanks!

Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-09 Thread Marius Soutier
Hi there, I’m trying to improve performance on a job that has GC troubles and takes longer to compute simply because it has to recompute failed tasks. After deferring object creation as much as possible, I’m now trying to improve memory usage with StorageLevel.MEMORY_AND_DISK_SER and a custom

How to create spark AMI in AWS

2015-02-09 Thread Guodong Wang
Hi guys, I want to launch spark cluster in AWS. And I know there is a spark_ec2.py script. I am using the AWS service in China. But I can not find the AMI in the region of China. So, I have to build one. My question is 1. Where is the bootstrap script to create the Spark AMI? Is it here(

Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi, Please take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html Cheers Gen On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi I am very new both in spark and aws stuff.. Say, I want to install pandas on ec2.. (pip install

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Sean Owen
Yes the example given here should have used uniformVectorRDD. Then it's correct. On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini lucapug...@gmail.com wrote: Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks,

using spark in web services

2015-02-09 Thread Hafiz Mujadid
Hi experts! I am trying to use spark in my restful webservices.I am using scala lift frramework for writing web services. Here is my boot class class Boot extends Bootable { def boot { Constants.loadConfiguration val sc=new SparkContext(new

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, The storage tab shows the RDD resides fully in memory (10 partitions) with zero disk usage. Tasks for subsequent select on this table in cache shows minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is not issue. However, it is still twice as slow as reading

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Marcelo Vanzin
`func1` and `func2` never get serialized. They must exist on the other end in the form of a class loaded by the JVM. What gets serialized is an instance of a particular closure (the argument to your map function). That's a separate class. The instance of that class that is serialized contains

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals in the in-memory columnar storage, so you are paying expensive serialization there likely. On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel manojsamelt...@gmail.com wrote: Flat data of

Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Yitong Zhou
If we define an Utils object: object Utils { def func1 = {..} def func2 = {..} } And then in a RDD we refer to one of the function: rdd.map{r = Utils.func1(r)} Will Utils.func2 also get serialized or not? Thanks, Yitong -- View this message in context:

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Could you share which data types are optimized in the in-memory storage and how are they optimized ? On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust mich...@databricks.com wrote: You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals

Re: SparkSQL DateTime

2015-02-09 Thread Michael Armbrust
The standard way to add timestamps is java.sql.Timestamp. On Mon, Feb 9, 2015 at 3:23 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark ! We are working on the bigpetstore-spark implementation in apache bigtop, and want to implement idiomatic date/time usage for SparkSQL. It appears

Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support? Thanks,Ashic.

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You could add a new ColumnType https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala . PRs welcome :) On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi Michael, As a test, I have same data loaded as

Re: Check if spark was built with hive

2015-02-09 Thread Sean Owen
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217 Yes all releases are built with -Phive except the 'without-hive' build. On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote: Is there an easy way to check if a spark binary release was built

RE: Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Awesome...thanks Sean. From: so...@cloudera.com Date: Mon, 9 Feb 2015 22:43:45 + Subject: Re: Check if spark was built with hive To: as...@live.com CC: user@spark.apache.org https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217 Yes all releases are

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, As a test, I have same data loaded as another parquet - except with the 2 decimal(14,4) replaced by double. With this, the on disk size is ~345MB, the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the time of uncached query. Would it be possible for Spark to

SparkSQL DateTime

2015-02-09 Thread jay vyas
Hi spark ! We are working on the bigpetstore-spark implementation in apache bigtop, and want to implement idiomatic date/time usage for SparkSQL. It appears that org.joda.time.DateTime isnt in SparkSQL's rolodex of reflection types. I'd rather not force an artificial dependency on hive dates

textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple: I have a standalone cluster with default parallelism of 2, spark-shell running with 2 cores sc.textFile(README.md).partitions.size returns 2 (this makes sense) sc.textFile(README.md).coalesce(100,true).partitions.size returns 100, also makes sense

Re: word2vec more distributed

2015-02-09 Thread Xiangrui Meng
The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and