Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Zoltán Zvara
Worked for me 2 weeks ago with a 3.0.0-alpha2 snapshot. Just changed hadoop.version while building. On Fri, Oct 28, 2016, 11:50 Sean Owen wrote: > I don't think it works, but, there is no Hadoop 3.0 right now either. As > the version implies, it's going to be somewhat

Re: Spark Streaming updateStateByKey Implementation

2015-11-08 Thread Zoltán Zvara
It is implemented with cogroup. Basically it stores states in a separate RDD and cogroups the target RDD with the state RDD, which is then hidden from you. See StateDStream.scala, there is everything you need to know. On Fri, Nov 6, 2015 at 6:25 PM Hien Luu wrote: > Hi, > > I

Re: Shuffle Write v/s Shuffle Read

2015-10-02 Thread Zoltán Zvara
Hi, Shuffle output goes to local disk each time, as far as I know, never to memory. On Fri, Oct 2, 2015 at 1:26 PM Adrian Tanase wrote: > I’m not sure this is related to memory management – the shuffle is the > central act of moving data around nodes when the computations

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Zvara
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your write will be performed by the JVM. On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán wrote: > Unfortunately I'm getting the same error: > The other interesting things are that: > - the parquet files got

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Zoltán Zvara
I personally build with SBT and run Spark on YARN with IntelliJ. You need to connect to remote JVMs with a remote debugger. You also need to do similar, if you use Python, because it will launch a JVM on the driver aswell. On Wed, Aug 19, 2015 at 2:10 PM canan chen ccn...@gmail.com wrote:

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Zoltán Zvara
Data skew is still a problem with Spark. - If you use groupByKey, try to express your logic by not using groupByKey. - If you need to use groupByKey, all you can do is to scale vertically. - If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Zoltán Zvara
Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time. Multiple operations are just chains of simple *map *and *flatMap *operators at task level on simple Scala

Re: YARN mode startup takes too long (10+ secs)

2015-05-08 Thread Zoltán Zvara
, but essentially the same place that Zoltán Zvara picked: 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1 When I

Re: JAVA for SPARK certification

2015-05-05 Thread Zoltán Zvara
I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com wrote: Production - not whole lot of companies have implemented

Re: spark-defaults.conf

2015-04-27 Thread Zoltán Zvara
You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to

Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara
You can check container logs from RM web UI or when log-aggregation is enabled with the yarn command. There are other, but less convenient options. On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job

Re: Complexity of transformations in Spark

2015-04-26 Thread Zoltán Zvara
You can calculate the complexity of these operators by looking at the RDD.scala basically. There, you will find - for example - what happens when you call a map on RDDs. It's a simple Scala map function on a simple Iterator of type T. Distinct has been implemented with mapping and grouping on the