Re: yarn does not accept job in cluster mode

2014-09-29 Thread Akhil Das
Can you try running the spark-shell in yarn-cluster mode? ./bin/spark-shell --master yarn-client Read more over here http://spark.apache.org/docs/1.0.0/running-on-yarn.html Thanks Best Regards On Sun, Sep 28, 2014 at 7:08 AM, jamborta jambo...@gmail.com wrote: hi all, I have a job that

aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Hi All, After some hair pulling, I've reached the realisation that an operation I am currently doing via: myRDD.groupByKey.mapValues(func) should be done more efficiently using aggregateByKey or combineByKey. Both of these methods would do, and they seem very similar to me in terms of their

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Xiangrui Meng
Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA: http://bit.ly/1ywotkm , though this list is not fixed. The feature freeze will happen by the end of Oct. Then we will cut branch-1.2 and start QA. I don't recommend using branch-1.2 for hands-on tutorial around Oct 29th

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression

Re: aggregateByKey vs combineByKey

2014-09-29 Thread Liquan Pei
Hi Dave, You can replace groupByKey with reduceByKey to improve performance in some cases. reduceByKey performs map side combine which can reduce Network IO and shuffle size where as groupByKey will not perform map side combine. combineByKey is more general then aggregateByKey. Actually, the

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread DB Tsai
Can you check the loss of both LBFGS and SGD implementation? One reason maybe SGD doesn't converge well and you can see that by comparing both log-likelihoods. One other potential reason maybe the label of your training data is totally separable, so you can always increase the log-likelihood by

Re: aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Thanks Liquan, that was really helpful. On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei liquan...@gmail.com wrote: Hi Dave, You can replace groupByKey with reduceByKey to improve performance in some cases. reduceByKey performs map side combine which can reduce Network IO and shuffle size where

SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Hi all, I had some of my queries run on 1.1.0-SANPSHOT at commit b1b20301(Aug 24), but in current master branch, my queries would not work. I looked into the stderr file in executor, and find the following lines: 14/09/26 16:52:46 ERROR nio.NioBlockTransferService: Exception handling buffer

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
Heya, I started to port the scala-notebook to Spark some weeks ago (but doing it in my sparse time and for my Spark talks ^^). It's a WIP but works quite fine ftm, you can check my fork and branch over here: https://github.com/andypetrella/scala-notebook/tree/spark Feel free to ask any

Re: REPL like interface for Spark

2014-09-29 Thread moon soo Lee
Hi, There is project called Zeppelin. You can checkout here https://github.com/NFLabs/zeppelin Homepage is here. http://zeppelin-project.org/ It's notebook style tool (like databrics demo, scala-notebook) with nice UI, with built-in Spark integration. It's in active development, so don't

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
Cool!!! I'll give it a try ASAP! aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Mon, Sep 29, 2014 at 10:48 AM, moon soo Lee leemoon...@gmail.com wrote: Hi, There is project called Zeppelin. You can checkout here

Re: REPL like interface for Spark

2014-09-29 Thread andy petrella
However (I must say ^^) that it's funny that it has been build using usual plain old Java stuffs :-D. aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Mon, Sep 29, 2014 at 10:51 AM, andy petrella andy.petre...@gmail.com wrote: Cool!!! I'll give it

Re: REPL like interface for Spark

2014-09-29 Thread moon soo Lee
There're little histories about using Java. Any feedback is welcomed warmly. On Mon, Sep 29, 2014 at 5:57 PM, andy petrella andy.petre...@gmail.com wrote: However (I must say ^^) that it's funny that it has been build using usual plain old Java stuffs :-D. aℕdy ℙetrella about.me/noootsab

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
Thank you for all your patient response. I can conclude that if the data is totally separable or over-fit occurs, weights may be different. And it also consistent with my experiment. I have evaluate two different dataset and the result as followed: Loss function: LogisticGradient Regularizer: L2

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Romi Kuntsman
Hi all, Regarding a post here a few months ago http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html Is there an answer to this? I saw workers being still active and not reconnecting after they lost connection to the

The confusion order of rows in SVD matrix ?

2014-09-29 Thread buring
Hi: I want to use SVD in my work. I tried some examples and have some confusions. The input the 4*3 matrix as follows: 2 0 0 0 3 2 0 3 1 2 0 3 My input file text as follows which is corresponding to the matrix 0 0 2 1 1 3 1 2

Re: The confusion order of rows in SVD matrix ?

2014-09-29 Thread Sean Owen
The RDD you define has no particular ordering. So the order that you encounter the elements (rows) with an operation like take or collect isn't defined. You can try to sort the RDD by the row number before that key is discarded. On Mon, Sep 29, 2014 at 2:58 PM, buring qyqb...@gmail.com wrote:

Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread vdiwakar.malladi
Hello, I'm exploring SparkSQL and I'm facing issue while using the queries. Any help on this is appreciated. I have the following schema once loaded as RDD. root |-- data: array (nullable = true) ||-- element: struct (containsNull = false) |||-- age: integer (nullable = true) |

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Cheng Lian
In your case, the table has only one row, whose contents is “data”, which is an array. You need something like |SELECT data[0].name FROM json_table| to access the |name| field. On 9/29/14 11:08 PM, vdiwakar.malladi wrote: Hello, I'm exploring SparkSQL and I'm facing issue while using the

Re: yarn does not accept job in cluster mode

2014-09-29 Thread Tamas Jambor
thanks for the reply. As I mentioned above, all works in yarn-client mode, the problem starts when I try to run it in yarn-cluster mode. (seems that spark-shell does not work in yarn-cluster mode, so cannot debug that way). On Mon, Sep 29, 2014 at 7:30 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Is it possible to use Parquet with Dremel encoding

2014-09-29 Thread matthes
Thank you so much guys for helping me, but I have some more questions about it! Do we have to presort the columns to get the benefits of the run length encoding or do I have to group the data first and wrap it into a case class? I try to sort the data first and write it down and I get different

Spark SQL + Hive + JobConf NoClassDefFoundError

2014-09-29 Thread Patrick McGloin
Hi, I have an error when submitting a Spark SQL application to our Spark cluster: 14/09/29 16:02:11 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoClassDefFoundError *java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf* at

Simple Question: Spark Streaming Applications

2014-09-29 Thread Saiph Kappa
Hi, Do all spark streaming applications use the map operation? or the majority of them? Thanks.

Re: Simple Question: Spark Streaming Applications

2014-09-29 Thread Liquan Pei
Hi Saiph, Map is used for transformation on your input RDD. If you don't need transformation of your input, you don't need to use map. Thanks, Liquan On Mon, Sep 29, 2014 at 10:15 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, Do all spark streaming applications use the map operation? or

When to start optimizing for GC?

2014-09-29 Thread Ashish Jain
Hello, I have written a standalone spark job which I run through Ooyala Job Server. The program is working correctly, now I'm looking into how to optimize it. My program without optimization took 4 hours to run. The first optimization of KyroSerializer and compiling regex pattern and reusing

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Andy Davidson
Hi Nicholas Yes out of the box PySpark works. My problem is I am using iPython note book and matlabplot is not found. It seems that out of the box the cluster has an old version of python and iPython notebook. It was suggested I upgrade iPython because the new version include matlabplot. This

Ack RabbitMQ messages after processing through Spark Streaming

2014-09-29 Thread khaledh
Hi, I'm currently investigating whether it's possible in Spark Streaming to send back ack's to RabbitMQ after a message has gone through the processing pipeline. The problem is that the Receiver is the one who has the RabbitMQ channel open for receiving messages, but due to reliability concerns

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread vdiwakar.malladi
Thanks for your prompt response. Still on further note, I'm getting the exception while executing the query. SELECT data[0].name FROM people where data[0].age =13 *Exception in thread main java.lang.RuntimeException: [1.46] failure: ``UNION'' expected but identifier .age found SELECT

Re: how to run spark job on yarn with jni lib?

2014-09-29 Thread mbaryu
You will also need to run 'ldconfig' on each host to read the ld.so.conf file and make it active. You might also need to stop Spark (the JVM) on each node to cause the loader to reload for those processes. -- View this message in context:

Window comparison matching using the sliding window functionality: feasibility

2014-09-29 Thread nitinkak001
Need to know the feasibility of the below task. I am thinking of this one to be a mapreduce-spark effort. I need to run distributed sliding Window Comparison for digital data matching on top of Hadoop. The data(Hive Table) will be partitioned, distributed across data node. Then the window

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Benjamin Zaitlen
HI Andy, I built an anaconda/spark AMI a few months ago. I'm still iterating on it so if things break please report them. If you want to give it awhirl: ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3ecd0c56 The nice thing about anaconda is that it come pre-baked with ipython-notebook,

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Andy Davidson
Hi Nicholas I wrote some test code and found a way to get my matplotlib script to work with the out of the box cluster created by spark-ec2 1. I commented out the python inline magic #%matplotlib inline 2. Replace #clear_output(wait=True) clear_output(True) The instructions of

Re: Does Spark Driver works with HDFS in HA mode

2014-09-29 Thread Petr Novak
Thank you. HADOOP_CONF_DIR has been missing. On Wed, Sep 24, 2014 at 4:48 PM, Matt Narrell matt.narr...@gmail.com wrote: Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote: Hello, if our

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Yin Huai
What version of Spark did you use? Can you try the master branch? On Mon, Sep 29, 2014 at 1:52 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: Thanks for your prompt response. Still on further note, I'm getting the exception while executing the query. SELECT data[0].name FROM people

Re: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Reynold Xin
Hi Daoyuan, Do you mind applying this patch and look at the exception again? https://github.com/apache/spark/pull/2580 It has also been merged in master so if you pull from master, you should have that. On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi all,

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Akhil Das
This one explains it nicely http://www.dofactory.com/topic/1816/spark-performing-a-join-and-getting-results-back-in-a-strongly-typed-collection.aspx Thanks Best Regards On Tue, Sep 30, 2014 at 12:57 AM, Yin Huai huaiyin@gmail.com wrote: What version of Spark did you use? Can you try the

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Akhil Das
Sorry. Not that, this one http://arjon.es/2014/07/01/processing-json-with-spark-sql/ Thanks Best Regards On Tue, Sep 30, 2014 at 1:43 AM, Akhil Das ak...@sigmoidanalytics.com wrote: This one explains it nicely

Schema change on Spark Hive (Parquet file format) table not working

2014-09-29 Thread barge.nilesh
I am using following releases: Spark 1.1 (built using */sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly/*) , Apache HDFS 2.2 My job is able to create/add/read data in hive, parquet formatted, tables using HiveContext. But, after changing schema, job is not able to read existing data and throws

Using addFile with pipe on a yarn cluster

2014-09-29 Thread esamanas
Hi, I've been using pyspark with my YARN cluster with success. The work I'm doing involves using the RDD's pipe command to send data through a binary I've made. I can do this easily in pyspark like so (assuming 'sc' is already defined): sc.addFile(./dumb_prog) t= sc.parallelize(range(10))

about partition number

2014-09-29 Thread anny9699
Hi, I read the past posts about partition number, but am still a little confused about partitioning strategy. I have a cluster with 8 works and 2 cores for each work. Is it true that the optimal partition number should be 2-4 * total_coreNumber or should approximately equal to total_coreNumber?

Re: about partition number

2014-09-29 Thread Daniel Siegmann
A task is the work to be done on a partition for a given stage - you should expect the number of tasks to be equal to the number of partitions in each stage, though a task might need to be rerun (due to failure or need to recompute some data). 2-4 times the cores in your cluster should be a good

Fwd: about partition number

2014-09-29 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Mon, Sep 29, 2014 at 2:12 PM Subject: Re: about partition number To: anny9699 anny9...@gmail.com The number of cores available in your cluster determines the number of tasks that can be run concurrently. If your

Re: IOException running streaming job

2014-09-29 Thread Arun Ahuja
We are also seeing this PARSING_ERROR(2) error due to Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at

Re: about partition number

2014-09-29 Thread Liquan Pei
Hi Anny, Much more partitions is not recommended in general as that creates a lot of small tasks. All the tasks needs to send to worker nodes for execution. Too many partitions increases task scheduling overhead. Spark uses synchronous execution model which means that all tasks in a stage need

partitions number with variable number of cores

2014-09-29 Thread Jonathan Esterhazy
I use Spark in a cluster shared with other applications. The number of nodes (and cores) assigned to my job varies depending on how many unrelated jobs are running in the same cluster. Is there any way for me to determine at runtime how many cores have been allocated to my job, so I can select an

Spark Language Integrated SQL for join on expression

2014-09-29 Thread Benyi Wang
scala user res19: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == ParquetTableScan [id#0,name#1], (ParquetRelation /user/hive/warehouse/user), None scala order res20: org.apache.spark.sql.SchemaRDD = SchemaRDD[72] at RDD at SchemaRDD.scala:98 == Query

in memory assumption in cogroup?

2014-09-29 Thread Koert Kuipers
apologies for asking yet again about spark memory assumptions, but i cant seem to keep it in my head. if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do the contents of these iterables have to fit in memory? or is the data streamed?

ExecutorLostFailure kills sparkcontext

2014-09-29 Thread jamborta
hi all, I have a problem with my application when I increase the data size over 5GB (the cluster has about 100GB memory to handle that). First I get this warning: WARN TaskSetManager: Lost task 10.1 in stage 4.1 (TID 408, backend-node1): FetchFailed(BlockManagerId(3, backend-node0, 41484, 0),

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Andrew Ash
Hi Romi, I've observed this many times as well. So much so that on some clusters I restart the workers every night in order to maintain these worker - master connections. I couldn't find an open SPARK ticket on it so filed https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr

Re: in memory assumption in cogroup?

2014-09-29 Thread Liquan Pei
Hi Koert, cogroup is a transformation on RDD and it creates a cogroupRDD and then perform some transformations on it. When later an action is called, the compute() method of the cogroupRDD will be called. Roughly speaking, each element in cogroupRDD is fetched one at a time. Thus the contents of

newbie system architecture problem, trouble using streaming and RDD.pipe()

2014-09-29 Thread Andy Davidson
Hello I am trying to build a system that does a very simple calculation on a stream and displays the results in a graph that I want to update the graph every second or so. I think I have a fundamental mis understanding about how steams and rdd.pipe() works. I want to do the data visualization

Re: Spark Language Integrated SQL for join on expression

2014-09-29 Thread Michael Armbrust
I'll note that the DSL is pretty experimental. That said you should be able to do something like user.id.attr On Mon, Sep 29, 2014 at 3:39 PM, Benyi Wang bewang.t...@gmail.com wrote: scala user res19: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan

Re: shuffle memory requirements

2014-09-29 Thread maddenpj
Hey Ameet, Thanks for the info, I'm running into the same issue myself and my last attempt crashed and my ulimit was 16834. I'm going to up it and try again, but yea I would like to know the best practice for computing this. Can you talk about the worker nodes, what are their specs? At least 45

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread vdiwakar.malladi
I'm using the latest version i.e. Spark 1.1.0 Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unresolved-attributes-SparkSQL-on-the-schemaRDD-tp15339p15376.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a *count *action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted the job in Yarn-Client mode using the following script: export SPARK_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar export HADOOP_CLASSPATH=$(hbase classpath) export

Re: Reading from HBase is too slow

2014-09-29 Thread Nan Zhu
can you look at your HBase UI to check whether your job is just reading from a single region server? Best, -- Nan Zhu On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote: I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of

RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Hi Reynold, Seems I am getting a much larger offset than file size. reading org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data, 3154043, 588396) (actual file length 676025) at

RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
And the /mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data file is comparatively much smaller than other shuffle*.data files From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Tuesday, September 30, 2014 10:54 AM To: Reynold Xin Cc: user@spark.apache.org

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To

Re: Reading from HBase is too slow

2014-09-29 Thread Russ Weeks
Hi, Tao, When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to specify executor-memory and num-executors explicitly on the command line or else I didn't get any parallelism across the cluster. I used --executor-memory 3G --num-executors 24 but obviously other parameters will be

Re: Reading from HBase is too slow

2014-09-29 Thread Vladimir Rodionov
HBase TableInputFormat creates input splits one per each region. You can not achieve high level of parallelism unless you have 5-10 regions per RS at least. What does it mean? You probably have too few regions. You can verify that in HBase Web UI. -Vladimir Rodionov On Mon, Sep 29, 2014 at 7:21

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
Hi Haopu, My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Krishna Sankar
Thanks Xiangrui. Appreciate the insights. I have uploaded the initial version of my presentation at http://goo.gl/1nBD8N Cheers k/ On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA:

RE: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Wang, Daoyuan
Also some lines on another node : 14/09/30 10:22:31 ERROR nio.NioBlockTransferService: Exception handling buffer message java.io.IOException: Error in reading org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk10/animal/spark/spark-local-20140930101701-c9ee/38/shuffle_6_162_0.data,