Hi - thanks for the responses. You are right that I started by copying the
word-counting example. I assumed that this would help spread the load
evenly across the cluster, with each worker receiving a portion of the
stream data - corresponding to one shard's worth - and then keeping the
data local
Probably, he referred to the word-couting example in kinesis here:
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114
On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote:
> Maybe a
Maybe a naive question: why are you creating 1 Dstream per shard? It should
be one Dstream corresponding to kinesis stream, isn't it?
On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> Just a guess though, Kinesis shards sometimes have skew data.
> So,
Hi,
Just a guess though, Kinesis shards sometimes have skew data.
So, before you compute something from kinesis RDDs, you'd be better to
repartition them
for better parallelism.
// maropu
On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote:
> Hi everyone - I am building a
Hi everyone - I am building a small prototype in Spark 1.6.0 (cloudera) to
read information from Kinesis and write it to HDFS in parquet format. The
write seems very slow, and if I understood Spark's diagnostics correctly,
always seemed to run from the same executor, one partition after the other,