As Hive can have multiple partition columns, making actual event date column a partition column in Hive may help in this case.
________________________________ 发件人: ShaoFeng Shi <[email protected]> 发送时间: 2017年4月17日 11:54:48 收件人: dev 主题: Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date Billy, Junhai's question is about how to leverage Hive partition column to avoid full table scan, when Cube's partition date isn't the same as Hive's. This is a good point I think. In many cases the hive paritition column isn't the Cube's partition column (one is physical, the other is logical). If Kylin can leverage both, that would be great. In 2.0 there is no change on this part: you couldn't specify an additional time range. So if you want to avoid repeatedly scan full Hive table, pls use its paritition column as Cube's, and adding "actual event date" as a normal dimension. That will fulfill you both, although when run the query, Kylin need scan additional segments which may lower the performance a bit. "Specifying the ingestion date range" sounds a good idea, could you please open a JIRA to track this? We can discuss this in detail on JIRA. 2017-04-17 11:30 GMT+08:00 Billy Liu <[email protected]>: > Hi Junhai > > If you want to build the late arrived data, you have to refresh to cube > manually or calling refresh API. Kylin would not monitor the ingestion > timestamp. > > 2017-04-13 22:07 GMT+08:00 Junhai Guo <[email protected]>: > > > My hive fact table is partitioned on ingestion date column, but I need to > > build cube and query the cube on the actual event date column. Events can > > arrive days or event weeks late. I want to build the cube incrementally > > daily by specifying the ingestion date range. Does 2.0 support this > > scenario and be able to build the cube efficiently without scanning the > > whole fact table, and be able to merge the newly ingested data with > > existing calculation? > > > > Thanks > > > > Jerry > > > -- Best regards, Shaofeng Shi 史少锋
