Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Graham Clark
Hi - thanks for the responses. You are right that I started by copying the word-counting example. I assumed that this would help spread the load evenly across the cluster, with each worker receiving a portion of the stream data - corresponding to one shard's worth - and then keeping the data local

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
Probably, he referred to the word-couting example in kinesis here: https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114 On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote: > Maybe a

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread ayan guha
Maybe a naive question: why are you creating 1 Dstream per shard? It should be one Dstream corresponding to kinesis stream, isn't it? On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro wrote: > Hi, > > Just a guess though, Kinesis shards sometimes have skew data. > So,

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro
Hi, Just a guess though, Kinesis shards sometimes have skew data. So, before you compute something from kinesis RDDs, you'd be better to repartition them for better parallelism. // maropu On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote: > Hi everyone - I am building a

Kinesis streaming misunderstanding..?

2017-01-26 Thread Graham Clark
Hi everyone - I am building a small prototype in Spark 1.6.0 (cloudera) to read information from Kinesis and write it to HDFS in parquet format. The write seems very slow, and if I understood Spark's diagnostics correctly, always seemed to run from the same executor, one partition after the other,