unsubscribe

2016-10-02 Thread Nikos Viorres

Re: How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-30 Thread Nikos Viorres
it you > would use > > OffsetRange("topicname", 0, 0, 1) > > Kafka's simple shell consumer in that case would print > > next offset = 1 > > > So instead trying to consume > > OffsetRange("topicname", 0, 1, 2) > shouldn't be expect

How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-28 Thread Nikos Viorres
Hi, I am using KafkaUtils.createRDD to retrieve data from Kafka for batch processing and when Invoking KafkaUtils.createRDD with an OffsetRange where OffsetRange.fromOffset == OffsetRange.untilOffset for a particular partition, i get an empy RDD. Documentation is clear that until is exclusive and

Re: DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Nikos Viorres
s to first repartition > the data by partition columns first. > > Cheng > > > On 7/15/15 7:05 PM, Nikos Viorres wrote: > > Hi, > > I am trying to test partitioning for DataFrames with parquet usage so i > attempted to do df.write().partitionBy("some_column"

DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-15 Thread Nikos Viorres
Hi, I am trying to test partitioning for DataFrames with parquet usage so i attempted to do df.write().partitionBy("some_column").parquet(path) on a small dataset of 20.000 records which when saved as parquet locally with gzip take 4mb of disk space. However, on my dev machine with -Dspark.master=

Re: updateStateByKey performance & API

2015-03-18 Thread Nikos Viorres
Hi Akhil, Yes, that's what we are planning on doing at the end of the data. At the moment I am doing performance testing before the job hits production and testing on 4 cores to get baseline figures and deduced that in order to grow to 10 - 15 million keys we ll need at batch interval of ~20 secs

updateStateByKey performance / API

2015-03-17 Thread Nikos Viorres
Hi all, We are having a few issues with the performance of updateStateByKey operation in Spark Streaming (1.2.1 at the moment) and any advice would be greatly appreciated. Specifically, on each tick of the system (which is set at 10 secs) we need to update a state tuple where the key is the user_i

updateStateByKey performance

2015-03-17 Thread Nikos Viorres
Hi all, We are having a few issues with the performance of updateStateByKey operation in Spark Streaming (1.2.1 at the moment) and any advice would be greatly appreciated. Specifically, on each tick of the system (which is set at 10 secs) we need to update a state tuple where the key is the user_i