Re: data retention

Shi, Shaofeng Sun, 26 Jul 2015 18:47:42 -0700

Bijeet, your understanding is correct, thanks for the comment;

We had planned to release this feature in 0.8, for the streaming case; Now
we see the community has the need, will back port to 0.7 and release in
0.7.3 or 0.7.4;


Here are retention related JIRAs, I will associate them together:
https://issues.apache.org/jira/browse/KYLIN-886

https://issues.apache.org/jira/browse/KYLIN-895

https://issues.apache.org/jira/browse/KYLIN-906


On 7/27/15, 2:47 AM, "Bijeet Singh" <[email protected]> wrote:

>From what I understand, a cube comprises multiple segments and each
>segment
>is effectively a table in HBase. While querying, HBaseKeyRange is created
>for each matching segment of the cube and the result from the segments is
>finally merged. So it seems that truncating the HBase table, corresponding
>to an older segment will not affect the other segments. Please correct me
>if I am wrong here.
>
>If it is indeed possible to truncate the older segments, while maintaining
>the correctness of cube, then the older data can effectively be deleted
>from the cube by truncating the corresponding HBase tables.
>
>This way,  if I want to retain data for say, around 60 days, I can have 10
>segments(given that 10 seems to be the optimal number of segments) each
>having 6 days worth of data. And once I have the 11th segment ready for
>the
>most recent 6 days, I can truncate the oldest segment.
>
>Please let me know if it looks feasible.
>
>Thanks,
>Bijeet
>
>On Sat, Jul 25, 2015 at 6:41 AM, vipul jhawar <[email protected]>
>wrote:
>
>> Sure, i will open a JIRA.
>>
>> So, at eBay you are storing the data forever in the cubes ?
>>
>> Rebuilding the cube several days seems to be very suboptimal as it
>>means we
>> have to spend lot more resources again.
>> Even if i partitioned my cubes by days such as cube_01, cube_02 by
>>month i
>> would have to go and run parallel queries against all of them when my
>>date
>> range is across months and then re aggregate in memory.
>>
>> On Fri, Jul 24, 2015 at 8:39 PM, Han, Luke <[email protected]> wrote:
>>
>> > Could you please open one JIRA for this? We have one for streaming
>>case,
>> > but I think it make sense to enable retention for batch also.
>> >
>> > Currently, I would like to say you have to rebuild cube several days
>>to
>> > discard old data.
>> > To minimum impact, you can define two cubes with same logical, and
>>build
>> > one first, then build another one like 7days later, once new one done,
>> > disable old one and purge the data, then, again and again....
>> >
>> > Thanks.
>> >
>> > 发自我的 iPhone
>> >
>> > > 在 2015年7月24日，22:22，vipul jhawar <[email protected]> 写道：
>> > >
>> > > Hi
>> > >
>> > > Would be interested to know, what solutions you would recommend to
>> > > implement data retention. Say if we want to retain data for only
>>upto
>> > last
>> > > 90 days in the cube, what is the best option.
>> > >
>> > > Our daily size is > 60 G so we cannot store data forever and want
>>limit
>> > to
>> > > a time range to support advanced analysis.
>> > >
>> > > Thanks
>> >
>>

Re: data retention

Reply via email to