Richard Calaba created KYLIN-1512:
-------------------------------------
Summary: Functionality to remove/unload Cube data (by time
partition) - before it get's merged
Key: KYLIN-1512
URL: https://issues.apache.org/jira/browse/KYLIN-1512
Project: Kylin
Issue Type: Improvement
Components: REST Service
Affects Versions: Future
Reporter: Richard Calaba
Assignee: Zhong,Jason
Hello Kylin gurus,
I am raising this Issue as "Development request" for following functionality
(see description below). If I am not mistaken, this is currently not supported.
If this feature is supported it would give Kylin ability to support advanced
scenarios. If this is already supported (even not on UI but only through REST
services) - feel free to point me to the way how to achieve it.
The scenario:
1) We get data into Hive tables (fact table) loaded on daily basis. But
sometimes the data is coming with some delay -> i.e. up to let's say 14 days
back. After those 14 days we "freeze" the changes and proclaim the data as
finalized.
2) We need to enable reporting on Kylin Cube for all days -> from
today/yesterday to few weeks/months/years back. As the current date
partitioning and data loading strategy works the way that I can load only
complete data to the cube and cannot neither update/nor remove and reload
particular previously loaded data set - I am forced to design 2 cubes - "1st
Cube with historical data" (14 days and older) and "2nd Cube - "last 14 days"
which holds the "not yet complete" data". I can use standard incremental loads
(Refresh) on the Historical Cube and full-reload (Purge&Build) on the 2nd "last
14 days" cube). BUT this design forces the Query UI to understand this logical
data split and differentiate & combine the queries from both Cubes - not nice.
Resolution suggestions:
I understand that implementing generic data update capability in Cube is very
challenging and this is not what I am requesting. I also understand that the
Cube merge logic once happens makes impossible to logically remove/unload the
most recently loaded data to the cube. BUT what can be hopefully achieved could
be this:
1) Schedule/set Cube Merge Logic to not to try to Merge data loads newer than
N days (in my scenario I would set it for 14 days).
2) Having REST service (and later also UI) where I can remove previously loaded
data. To simplify the coding logic: is OK with limitation that if the latest
Cube Refresh was run for date interval (X,Y) then I have to unload/removed the
whole date interval (X,Y). If there were 2 Cube Refresh loads and no Merge yet
- having the data time partitions: Build (A,X); Refresh(X,Y); Refresh(Y,Z) -
where A<X<Y<Z (dates). Then the Remove/Unload logic should allow me to call
the REST service to remove the cube data for (Y,Z). Subsequent call to remove
(X,Y) should also succeed, so I can remove multiple previously loaded (and not
yet merged) data. If you manage the service to allow having one call -
remove(X,Z) - then even better.
3) Ability to re-schedule Cube Refresh with new current day - even there were
previous Cube Refreshes for same/overlapping days ... as they were removed I
should be able to reschedule them again (even with new most recent date).
4) Implement Cube data loads monitoring (REST service + UI) - should
retrieve/show the whole cube load history - i.e. Build (up to day X); Refresh
(up to day Y); Remove (X, Y); Refresh (up to day Z -> meaning interval (X,Z).
Some advanced data warehouse solutions like SAP BW do support this
functionality (with some constraints) which enables them to handle the scenario
I am describing above more elegant way.
Thak you for your considerations/review.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)