Re: 答复: Querying raw data / lowest granularity with Kylin

Li Yang Tue, 11 Aug 2015 22:02:12 -0700

Thanks for sharing the use cases. I can see the demand of seeing raw data
in Kylin.


Think the TopN feature may satisfy such need to a big extent.  Say for
every aggregated number, user can see the top 10000 records that contribute
to the sum.  Would this be enough?  I guess yes, because anything behind
10000 is minority and of less interest.  And in case the breakdown is less
than 10000 rows, user will the see full population.

On Wed, Aug 12, 2015 at 1:35 AM, alex schufo <[email protected]> wrote:

> Thanks for those details.
>
> I read about mandatory dimensions in the presentation, but how does one
> make a dimension mandatory in the Cube Builder UI?
>
> In terms of use case I can see the following:
>
>    - Drill down from hierarchies (aggregations) until the lowest
>    granularity (raw data). For example imagine you have book stores
> everywhere
>    in the US, the user would pick a date range and see how many sells per
> US
>    State, then click one State and see how many sells per city for this
> State,
>    then click on one city and see the sells per book store for that city,
> and
>    finally when clicking on one store you could see the actual transactions
>    that lead to those sells total numbers
>    - Use Kylin as a single fast access to Hadoop data: build cubes for
>    regular OLAP process but also being able to query other Hive tables
> that do
>    not require specifically aggregations but dimensional filtering on raw
> data
>    and benefiting from Kylin SQL interface and fast HBase queries
>
> These are not as strong requirements as what Kylin provides (OLAP) but
> having it would be very nice in my view, if it fits the project.
>
> On Tue, Aug 11, 2015 at 10:00 AM, Li Yang <[email protected]> wrote:
>
> > > ... at least one "group by" should always be used.
> >
> > This is correct. So the lowest granularity Kylin provides is by grouping
> > all dimensions, which is what Alex has tried if I understand correctly.
> We
> > believe this can solve 90% of analysis requirement.
> >
> > > ... using a lot of space whereas in this case it would not necessarily
> be
> > used.
> >
> > You can set dimensions to be "mandatory" such that less dimension
> > combinations will be calculated.  See more at
> > http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin
> >
> > > "InvertedIndex" feature ... is still in early stage in terms of
> > functionality and stability.
> >
> > Very true.  We have experimented "inverted-index" to solve two
> > requirements: 1) Neal Real Time data readiness in Kylin;  2) Query raw
> > data.  Later 1) is solved by another feature called Stream Cubing, thus
> the
> > priority of "inverted-index" greatly reduces since the need of raw record
> > analysis seems not strong.
> >
> >
> > Do you (or any one) see raw record query a must-have feature?  We'd like
> to
> > hear your use case.
> >
> > Cheers
> > Yang
> >
> > On Tue, Aug 11, 2015 at 8:30 AM, Luke Han <[email protected]> wrote:
> >
> > > Currently, Kylin not support detail/raw data query, that's why you
> > already
> > > knew you have add at least one "group by" in your query.
> > >
> > > As growing requirement about this feature, we actually are evaluating
> > > and will update our idea soon here.
> > >
> > > The roadmap is a little bit changed due to some priority changed.
> > > I'm drafting a new one for coming release.
> > >
> > > Please help to let's know if there are any feature, function or
> anything
> > > else which missing but your cases are really need them.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > > Best Regards!
> > > ---------------------
> > >
> > > Luke Han
> > >
> > > On Mon, Aug 10, 2015 at 6:17 PM, Huang Hua <[email protected]>
> > > wrote:
> > >
> > > > I haven't used the "InvertedIndex" feature, but I think the feature
> is
> > > > still in early stage in terms of functionality and stability.
> > > >
> > > > Back to the time when we were using with kylin-0.6, we had a very
> > similar
> > > > use case that to drill down to the lowest granularity of the data.
> > > > What we did is to define the filter columns as dimensions(almost
> > defined
> > > > as mandatory ones to avoid the cube expansion), all other result
> > columns
> > > as
> > > > measures.
> > > >
> > > > You can think of our case more like using kylin to build query index
> in
> > > > HBase in order to support queries like "fetch all transactions given
> a
> > > user
> > > > or server user ids or user names or other filters so".
> > > > However, ultimately, we realized that maybe Kylin wasn't the best
> > option
> > > > to support such queries, because Kylin is very good at rollup queries
> > > with
> > > > pre-computed measures and a limited number of filters. Perhaps with
> the
> > > > enhancement of "InvertedIndex" we can see more possibilities from
> Kylin
> > > > when dealing with the lowest granularity queries.
> > > >
> > > > Best,
> > > > Hua
> > > > > -----邮件原件-----
> > > > > 发件人: dev-return-3593-
> > > > > [email protected] [mailto:
> > dev-return-
> > > > > [email protected]] 代表 alex
> > > > > schufo
> > > > > 发送时间: 2015年8月10日 17:24
> > > > > 收件人: [email protected]
> > > > > 主题: Querying raw data / lowest granularity with Kylin
> > > > >
> > > > > I have some scenarios where I would like to drill down to the
> lowest
> > > > > granularity of my table, does Kylin handle this?
> > > > >
> > > > > If I am not mistaken a least one "group by" should always be used.
> > > > >
> > > > > So I tried to query by grouping by all my dimensions at the same
> > time :
> > > > > "select dim1, dim2, ..., dimN, sum(measure1), ..., sum(measureN)
> from
> > > ...
> > > > > where ... group by dim1, dim2, ..., dimN". This gives me the
> expected
> > > > results.
> > > > > Is this the correct way to do it?
> > > > >
> > > > > Although this seems to work, with several dimension it would mean
> > > > building
> > > > > a lot of cubes and using a lot of space whereas in this case it
> would
> > > not
> > > > > necessarily be used. I know that aggregation groups can be used to
> > > solve
> > > > > reduce this. With the same example I created 1 aggregation group
> for
> > > each
> > > > > dimension and the expansion rate is 200%, but I tested only on 5
> > > > dimensions.
> > > > > Again, is this the correct way to do it?
> > > > >
> > > > > Relative to this topic, I saw:
> > > > >
> > > > > v0.7.x: InvertedIndex (HybridOLAP)
> > > > > Goal:
> > > > > Introduce InvertedIndex to optimise queries on raw data and low
> level
> > > > > aggregation
> > > > >
> > > > > on https://issues.apache.org/jira/browse/KYLIN-577
> > > > >
> > > > > Is this something that is currently available in 0.7.2? This ticket
> > > > dates back
> > > > > from beginning 2015, so I am not sure if it reflects Kylin current
> > plan
> > > > or not.
> > > >
> > > >
> > > >
> > >
> >
>

Re: 答复: Querying raw data / lowest granularity with Kylin

Reply via email to