Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Billy Liu Mon, 17 Apr 2017 20:23:18 -0700

Hi Shaofeng,

Currently, each partitioned range build will generate one segment.


If Kylin could support two kinds of partitions, suppose the first one is
"partition on hive ingestion", the second one is "partition on cube
segment". The "ingestion partition" will help scan only the part of data
which is never built before. Then Kylin will process all late arrival data
to merge these late arrival data into the existing segments. This is an
refresh operation I think. Are you proposing the build and refresh approach
to address this requirement?

2017-04-17 11:54 GMT+08:00 ShaoFeng Shi <[email protected]>:

> Billy, Junhai's question is about how to leverage Hive partition column to
> avoid full table scan, when Cube's partition date isn't the same as Hive's.
>
> This is a good point I think. In many cases the hive paritition column
> isn't the Cube's partition column (one is physical, the other is logical).
> If Kylin can leverage both, that would be great.
>
> In 2.0 there is no change on this part: you couldn't specify an additional
> time range. So if you want to avoid repeatedly scan full Hive table, pls
> use its paritition column as Cube's, and adding "actual event date" as a
> normal dimension. That will fulfill you both, although when run the query,
> Kylin need scan additional segments which may lower the performance a bit.
>
> "Specifying the ingestion date range" sounds a good idea, could you please
> open a JIRA to track this? We can discuss this in detail on JIRA.
>
>
> 2017-04-17 11:30 GMT+08:00 Billy Liu <[email protected]>:
>
> > Hi Junhai
> >
> > If you want to build the late arrived data, you have to refresh to cube
> > manually or calling refresh API. Kylin would not monitor the ingestion
> > timestamp.
> >
> > 2017-04-13 22:07 GMT+08:00 Junhai Guo <[email protected]>:
> >
> > > My hive fact table is partitioned on ingestion date column, but I need
> to
> > > build cube and query the cube on the actual event date column. Events
> can
> > > arrive days or event weeks late. I want to build the cube incrementally
> > > daily by specifying the ingestion date range. Does 2.0 support this
> > > scenario and be able to build the cube efficiently without scanning the
> > > whole fact table, and be able to merge the newly ingested data with
> > > existing calculation?
> > >
> > > Thanks
> > >
> > > Jerry
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: Dual timestamp columns support in 2.0, Fact table Partitioned on ingestion date, aggregate on event date

Reply via email to