Hey Sandip:

TD has already outlined the right approach, but let me add a couple of
thoughts as I recently worked on a similar project. I had to compute some
real-time metrics on streaming data. Also, these metrics had to be
aggregated for hour/day/week/month. My data pipeline was Kafka --> Spark
Streaming --> Cassandra.

I had a spark streaming job that did the following: (1) receive a window of
raw streaming data and write it to Cassandra, and (2) do only the basic
computations that need to be shown on a real-time dashboard, and store the
results in Cassandra. (I had to use sliding window as my computation
involved joining data that might occur in different time windows.)

I had a separate set of Spark jobs that pulled the raw data from Cassandra,
computed the aggregations and more complex metrics, and wrote it back to
the relevant Cassandra tables. These jobs ran periodically every few
minutes.

Regards,
Sanket

On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta <sandip.mehta....@gmail.com>
wrote:

> Thank you TD for your time and help.
>
> SM
>
> On 19-Nov-2015, at 6:58 AM, Tathagata Das <t...@databricks.com> wrote:
>
> There are different ways to do the rollups. Either update rollups from the
> streaming application, or you can generate roll ups in a later process -
> say periodic Spark job every hour. Or you could just generate rollups on
> demand, when it is queried.
> The whole thing depends on your downstream requirements - if you always to
> have up to date rollups to show up in dashboard (even day-level stuff),
> then the first approach is better. Otherwise, second and third approaches
> are more efficient.
>
> TD
>
>
> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta <sandip.mehta....@gmail.com>
> wrote:
>
>> TD thank you for your reply.
>>
>> I agree on data store requirement. I am using HBase as an underlying
>> store.
>>
>> So for every batch interval of say 10 seconds
>>
>> - Calculate the time dimension ( minutes, hours, day, week, month and
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics
>> for a given set of dimensions.
>>
>> Only caveat I see is I’ll have to update each of the different roll up
>> table for each batch window.
>>
>> Is this a valid approach for calculating time series aggregation?
>>
>> Regards
>> SM
>>
>> For minutes level aggregates I have set up a streaming window say 10
>> seconds and storing minutes level aggregates across multiple dimension in
>> HBase at every window interval.
>>
>> On 18-Nov-2015, at 7:45 AM, Tathagata Das <t...@databricks.com> wrote:
>>
>> For this sort of long term aggregations you should use a dedicated data
>> storage systems. Like a database, or a key-value store. Spark Streaming
>> would just aggregate and push the necessary data to the data store.
>>
>> TD
>>
>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta <sandip.mehta....@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am working on requirement of calculating real time metrics and
>>> building prototype  on Spark streaming. I need to build aggregate at
>>> Seconds, Minutes, Hours and Day level.
>>>
>>> I am not sure whether I should calculate all these aggregates as
>>> different Windowed function on input DStream or shall I use
>>> updateStateByKey function for the same. If I have to use updateStateByKey
>>> for these time series aggregation, how can I remove keys from the state
>>> after different time lapsed?
>>>
>>> Please suggest.
>>>
>>> Regards
>>> SM
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
>


-- 
SuperReceptionist is now available on Android mobiles. Track your business
on the go with call analytics, recordings, insights and more: Download the
app here
<https://play.google.com/store/apps/details?id=com.superreceptionist>

-- 
SuperReceptionist is now available on Android mobiles. Track your business 
on the go with call analytics, recordings, insights and more: Download the 
app here 
<https://play.google.com/store/apps/details?id=com.superreceptionist>

Reply via email to