Re: Execution stalls in LogisticRegressionWithSGD

2014-07-06 Thread Xiangrui Meng
Hi Bharath, 1) Did you sync the spark jar and conf to the worker nodes after build? 2) Since the dataset is not large, could you try local mode first using `spark-summit --driver-memory 12g --master local[*]`? 3) Try to use less number of partitions, say 5. If the problem is still there, please

Re: reading compress lzo files

2014-07-06 Thread Gurvinder Singh
On 07/06/2014 05:19 AM, Nicholas Chammas wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote: csv =

Re: window analysis with Spark and Spark streaming

2014-07-06 Thread alessandro finamore
On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] ml-node+s1001560n8860...@n3.nabble.com wrote: Key idea is to simulate your app time as you enter data . So you can connect spark streaming to a queue and insert data in it spaced by time. Easier said than done :). I see. I'll try

Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Robert James
If I've created a Spark EC2 cluster, how can I add or take away workers? Also: If I use EC2 spot instances, what happens when Amazon removes them? Will my computation be saved in any way, or will I need to restart from scratch? Finally: The spark-ec2 scripts seem to use Hadoop 1. How can I

Re: reading compress lzo files

2014-07-06 Thread Nicholas Chammas
Ah, indeed it looks like I need to install this separately https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1 as it is not part of the core. Nick On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 07/06/2014 05:19 AM,

Re: Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Nicholas Chammas
On Sun, Jul 6, 2014 at 10:10 AM, Robert James srobertja...@gmail.com wrote: If I've created a Spark EC2 cluster, how can I add or take away workers? There is a manual process by which this is possible, but I’m not sure of the procedure. There is also SPARK-2008

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread vs
Konstantin, HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf Let me know if you see issues with the tech preview. spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from

Re: reading compress lzo files

2014-07-06 Thread Sean Owen
Pardon, I was wrong about this. There is actually code distributed under com.hadoop, and that's where this class is. Oops. https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/source/browse/trunk/src/java/com/hadoop/mapreduce/LzoTextInputFormat.java On Sun, Jul 6, 2014 at 6:37

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote: When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Hello, thanks for your message... I'm confused, Hortonworhs suggest install spark rpm on each node, but on Spark main page said that yarn enough and I don't need to install it... What the difference? sent from my HTC On Jul 6, 2014 8:34 PM, vs vinayshu...@gmail.com wrote: Konstantin, HWRK

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM,

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Konstantin Kudryavtsev
Marco, Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try from http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf HDP 2.1 means YARN, at the same time they propose ti install rpm On other hand, http://spark.apache.org/ said Integrated with

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
That is confusing based on the context you provided. This might take more time than I can spare to try to understand. For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. Cloudera's CDH 5 express VM includes Spark, but the service isn't running by default. I can't

Controlling amount of data sent to slaves

2014-07-06 Thread asylvest
I'm in the process of evaluating Spark to see if it's a fit for my CPU-intensive application. Many operations in my chain are highly parallelizable, but some require a minimum number of rows of an input image in order to operate. Is there a way to give Spark a minimum and/or maximum size to send

Data loading to Parquet using spark

2014-07-06 Thread Shaikh Riyaz
Hi, We are planning to use spark to load data to Parquet and this data will be query by Impala for present visualization through Tableau. Can we achieve this flow? How to load data to Parquet from spark? Will impala be able to access the data loaded by spark? I will greatly appreciate if

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Robert James
I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Ankur Dave
Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before