Re: Grouping and storing unordered time series data stream to HDFS

Helena Edelson Sat, 16 May 2015 05:27:43 -0700

Consider using cassandra with spark streaming and timeseries, cassandra has 
been doing time series for years.
Here’s some snippets with kafka streaming and writing/reading the data back:


https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64
 
<https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64>

or write in the stream, read back
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61
 
<https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61>

or more detailed reads back
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69
 
<https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69>
 


A CassandraInputDStream is coming, i’m working on it now.

Helena
@helenaedelson

> On May 15, 2015, at 9:59 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Do you have a cut off time, like how "late" an event can be? Else, you may 
> consider a different persistent storage like Cassandra/Hbase and delegate 
> "update: part to them. 
> 
> On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati 
> <nisrina.luthfiy...@gmail.com <mailto:nisrina.luthfiy...@gmail.com>> wrote:
> 
> Hi all,
> I have a stream of data from Kafka that I want to process and store in hdfs 
> using Spark Streaming.
> Each data has a date/time dimension and I want to write data within the same 
> time dimension to the same hdfs directory. The data stream might be unordered 
> (by time dimension).
> 
> I'm wondering what are the best practices in grouping/storing time series 
> data stream using Spark Streaming?
> 
> I'm considering grouping each batch of data in Spark Streaming per time 
> dimension and then saving each group to different hdfs directories. However 
> since it is possible for data with the same time dimension to be in different 
> batches, I would need to handle "update" in case the hdfs directory already 
> exists.
> 
> Is this a common approach? Are there any other approaches that I can try?
> 
> Thank you!
> Nisrina.
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Re: Grouping and storing unordered time series data stream to HDFS

Reply via email to