Consider using cassandra with spark streaming and timeseries, cassandra has been doing time series for years. Here’s some snippets with kafka streaming and writing/reading the data back:
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64 <https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64> or write in the stream, read back https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61 <https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61> or more detailed reads back https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69 <https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69> A CassandraInputDStream is coming, i’m working on it now. Helena @helenaedelson > On May 15, 2015, at 9:59 AM, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > Do you have a cut off time, like how "late" an event can be? Else, you may > consider a different persistent storage like Cassandra/Hbase and delegate > "update: part to them. > > On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati > <nisrina.luthfiy...@gmail.com <mailto:nisrina.luthfiy...@gmail.com>> wrote: > > Hi all, > I have a stream of data from Kafka that I want to process and store in hdfs > using Spark Streaming. > Each data has a date/time dimension and I want to write data within the same > time dimension to the same hdfs directory. The data stream might be unordered > (by time dimension). > > I'm wondering what are the best practices in grouping/storing time series > data stream using Spark Streaming? > > I'm considering grouping each batch of data in Spark Streaming per time > dimension and then saving each group to different hdfs directories. However > since it is possible for data with the same time dimension to be in different > batches, I would need to handle "update" in case the hdfs directory already > exists. > > Is this a common approach? Are there any other approaches that I can try? > > Thank you! > Nisrina. > > > > -- > Best Regards, > Ayan Guha