Dayue Gao created KYLIN-2012:
--------------------------------

             Summary: more robust approach to hive schema changes
                 Key: KYLIN-2012
                 URL: https://issues.apache.org/jira/browse/KYLIN-2012
             Project: Kylin
          Issue Type: Bug
          Components: Metadata, REST Service, Web 
    Affects Versions: v1.5.3
            Reporter: Dayue Gao
            Assignee: Dayue Gao


Our users occasionally want to change their existing cube, such as 
adding/renaming/removing a dimension. Some of these changes require 
modifications to its source hive table. So our user changed the table schema 
and reloaded its metadata in Kylin, then several issues can happen depends on 
what he changed.

I did some schema changing tests based on 1.5.3, the results after reloading 
table are listed below

|| type of changes || fact table || lookup table ||
| *minor* | both query and build still works | query can fail or return wrong 
answer |
| *major* | fail to load related cube | fail to load related cube |

{{minor}} changes refer to those doesn't change columns used in cubes, such as 
insert/append new column, remove/change unused column.

{{major}} changes are the opposite, like remove/rename/change type of used 
column.

Clearly from the table, reload a changed table is problematic in certain cases. 
KYLIN-1536 reports a similar problem.

So what can we do to support this kind of iterative development process (load 
-> define cube -> build -> reload -> change cube -> rebuild)?

My first thought is simply detect-and-prohibit reloading used table. User 
should be able to know which cube is preventing him from reloading, and then he 
could drop and recreate cube after reloading. However, defining a cube is not 
an easy task (consider editing 100 measures). Force users to recreate their 
cube over and over again will certainly not make them happy.

A better idea is to allow cube to be editable even if it's broken due to some 
columns changed after reloading. Broken cube can't be built or queried, it can 
only be edit or dropped. In fact, there is a cube status called 
{{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
take advantage of it.

An enabled cube shouldn't allow schema changes, otherwise an unintentional 
reload could make it unavailable. Similarly, a disabled but unpurged cube 
shouldn't allow schema changes since it still has data in it.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to