Richard Calaba created KYLIN-1512:
-------------------------------------

             Summary: Functionality to remove/unload Cube data (by time 
partition) - before it get's merged
                 Key: KYLIN-1512
                 URL: https://issues.apache.org/jira/browse/KYLIN-1512
             Project: Kylin
          Issue Type: Improvement
          Components: REST Service
    Affects Versions: Future
            Reporter: Richard Calaba
            Assignee: Zhong,Jason


Hello Kylin gurus,

I am raising this Issue as "Development request" for following functionality 
(see description below). If I am not mistaken, this is currently not supported. 
If this feature is supported it would give Kylin ability to support advanced 
scenarios. If this is already supported (even not on UI but only through REST 
services) - feel free to point me to the way how to achieve it.

The scenario:

1) We get data into Hive tables (fact table) loaded on daily basis. But 
sometimes the data is coming with some delay -> i.e. up to let's say 14 days 
back. After those 14 days we "freeze" the changes and proclaim the data as 
finalized.

2) We need to enable reporting on Kylin Cube for all days -> from 
today/yesterday to few weeks/months/years back. As the  current date 
partitioning and data loading strategy works the way that I can load only 
complete data to the cube and cannot neither update/nor remove and reload 
particular previously loaded data set - I am forced to design 2 cubes - "1st 
Cube with historical data" (14 days and older) and "2nd Cube - "last 14 days" 
which holds the "not yet complete" data". I can use standard incremental loads 
(Refresh) on the Historical Cube and full-reload (Purge&Build) on the 2nd "last 
14 days" cube). BUT this design forces the Query UI to understand this logical 
data split and differentiate & combine the queries from both Cubes - not nice.

Resolution suggestions:

 I understand that implementing generic data update capability in Cube is very 
challenging and this is not what I am requesting. I also understand that the 
Cube merge logic once happens makes impossible to logically remove/unload the 
most recently loaded data to the cube. BUT what can be hopefully achieved could 
be this:

 1) Schedule/set Cube Merge Logic to not to try to Merge data loads newer than 
N days (in my scenario I would set it for 14 days).

2) Having REST service (and later also UI) where I can remove previously loaded 
data. To simplify the coding logic: is OK with limitation that if the latest 
Cube Refresh was run for date interval (X,Y) then I have to unload/removed the 
whole date interval (X,Y). If there were 2 Cube Refresh loads and no Merge yet 
- having the data time partitions: Build (A,X); Refresh(X,Y); Refresh(Y,Z) - 
where A<X<Y<Z (dates).  Then the Remove/Unload logic should allow me to call 
the REST service to remove the cube data for (Y,Z). Subsequent call to remove 
(X,Y) should also succeed, so I can remove multiple previously loaded (and not 
yet merged) data. If you manage the service to allow having one call - 
remove(X,Z) - then even better.

3) Ability to re-schedule Cube Refresh with new current day - even there were 
previous Cube Refreshes for same/overlapping days ... as they were removed I 
should be able to reschedule them again (even with new most recent date).

4) Implement Cube data loads monitoring (REST service + UI) - should 
retrieve/show the whole cube load history - i.e. Build (up to day X); Refresh 
(up to day Y); Remove (X, Y); Refresh (up to day Z -> meaning interval (X,Z). 

Some advanced data warehouse solutions like SAP BW do support this 
functionality (with some constraints) which enables them to handle the scenario 
I am describing above more elegant way. 

Thak you for your considerations/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to