Re: A proposal for Spark 2.0

2015-11-11 Thread Zoltán Zvara
Hi, Reconsidering the execution model behind Streaming would be a good candidate here, as Spark will not be able to provide the low latency and sophisticated windowing semantics that more and more use-cases will require. Maybe relaxing the strict batch model would help a lot. (Mainly this would hi

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Zoltán Zvara
I personally build with SBT and run Spark on YARN with IntelliJ. You need to connect to remote JVMs with a remote debugger. You also need to do similar, if you use Python, because it will launch a JVM on the driver aswell. On Wed, Aug 19, 2015 at 2:10 PM canan chen wrote: > Thanks Ted. I notice

SparkSqlSerializer2

2015-07-03 Thread Zoltán Zvara
Hi, Is there any way to bypass the limitations of SparkSqlSerializer2 in module SQL? Said that, 1) it does not support complex types, 2) assumes key-value pairs. Is there any other pluggable serializer that can be used here? Thanks!

DStream.reduce

2015-06-30 Thread Zoltán Zvara
Why is reduce in DStream implemented with a map, reduceByKey and another map, given that we have an RDD.reduce?

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Zoltán Zvara
. > > In fact, the flow is: allocator.allocateResources() -> sleep -> > allocator.allocateResources() -> sleep … > > But I guess that on the first allocateResources() the allocation is not > fulfilled. So sleep occurs. > > > > *From:* Zoltán Zvara [mailto:zoltan.

Re: Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara
pache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala > > On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara > wrote: > >> Dear Developers, >> >> I'm trying to investigate the communication pattern regarding data-flow

Connect to remote YARN cluster

2015-04-09 Thread Zoltán Zvara
I'm trying to debug Spark in yarn-client mode. On my local, single node cluster everything works fine, but the remote YARN resource manager throws away my request because of authentication error. I'm running IntelliJ 14 on Ubuntu and the driver tries to connect to YARN with my local user name. How

Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara
Dear Developers, I'm trying to investigate the communication pattern regarding data-flow during execution of a Spark program defined by an RDD chain. I'm investigating from the Task point of view, and found out that the task type ResultTask (as retrieving the iterator for its RDD for a given parti

RDD firstParent

2015-04-08 Thread Zoltán Zvara
Is does not seem to be safe to call RDD.firstParent from anywhere, as it might throw a java.util.NoSuchElementException: "head of empty list". This seems to be a bug for a consumer of the RDD API. Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 1091

Re: Can't assembly YARN project with SBT

2015-03-25 Thread Zoltán Zvara
suth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-03-25 9:45 GMT+01:00 Zoltán Zvara : > Hi! > > I'm using the latest IntelliJ and I can't compile the yarn project into > the Spark assembly fat JAR. That is why I'm getting a SparkException with > message "Unable to load

Can't assembly YARN project with SBT

2015-03-25 Thread Zoltán Zvara
Hi! I'm using the latest IntelliJ and I can't compile the yarn project into the Spark assembly fat JAR. That is why I'm getting a SparkException with message "Unable to load YARN support". The yarn project is also missing from SBT tasks and I can't add it. How can I force sbt to include? Thanks!

Re: Spark Executor resources

2015-03-24 Thread Zoltán Zvara
s: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-03-24 16:42 GMT+01:00 Sandy Ryza : > That's correct. What's the reason this information is needed? > > -Sandy > > On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara > wrote: > >> Thank

Re: Spark Executor resources

2015-03-24 Thread Zoltán Zvara
for the > amount that YARN has rounded up if those configuration properties > (yarn.scheduler.minimum-allocation-mb and > yarn.scheduler.increment-allocation-mb) are not present on the node. > > -Sandy > > -Sandy > > On Mon, Mar 23, 2015 at 5:08 PM, Zoltán Zvara > wrote:

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
ore lines into storage instead of in the memory. Could Spark > streaming work like this way? Dose Flink work like this? > > > > > > On Tue, Mar 24, 2015 at 7:04 PM Zoltán Zvara > wrote: > >> There is a BlockGenerator on each worker node next to the >> ReceiverSuper

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
There is a BlockGenerator on each worker node next to the ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in each interval (block_interval). These Blocks are passed to ReceiverSupervisorImpl, which throws these blocks to into the BlockManager for storage. BlockInfos are passed

Spark Executor resources

2015-03-23 Thread Zoltán Zvara
Let's say I'm an Executor instance in a Spark system. Who started me and where, when I run on a worker node supervised by (a) Mesos, (b) YARN? I suppose I'm the only one Executor on a worker node for a given framework scheduler (driver). If I'm an Executor instance, who is the closest object to me

Spark scheduling, data locality

2015-03-19 Thread Zoltán Zvara
I'm trying to understand the task scheduling mechanism of Spark, and I'm curious about where does locality preferences get evaluated? I'm trying to determine if locality preferences are fetchable before the task get serialized. A HintSet would be most appreciated! Have nice day! Zvara Zoltán m

Spark Streaming - received block allocation to batch

2015-03-11 Thread Zoltán Zvara
I'm trying to understand the block allocation mechanism Spark uses to generate batch jobs and a JobSet. The JobGenerator.generateJobs tries to allocate received blocks to batch, effectively in ReceivedBlockTracker.allocateBlocksToBatch creates a streamIdToBlocks, where steam ID's (Int) mapped to S