Better support of UHC (ultra high cardinality) columns is on dev plan. I'm
thinking add custom encoding for dimension.

However, even with those done, filtering URL using like will be still very
slow because Kylin cannot pre-process and get prepared for such filtering.

Alternatively, I'd suggest talk to the user to understand what they want by
matching URL using like. Ideally you can extract "features" from URL during
ETL process and store the features in cube instead of a long URL. E.g.
maybe what user want is to know if the URL is from a search engine
(contains google, baidu, yahoo...). Then a new column
"IS_FROM_SEARCH_ENGINE" could be enriched during ETL and be stored in cube.
Not only this is more practical, it is also more flexible and extensible.
Sql like can only do substring matching, while your ETL process can handle
very complex biz logic.

On Thu, Feb 18, 2016 at 1:11 PM, hongbin ma <mahong...@apache.org> wrote:

> First of all, using high card dimension(especially space consuming
> dimension like URL) is not a really good idea. Even if the cube is built
> successfully, the expansion ratio tends to be unacceptable. Besides, for
> like functions, kylin basically treat it as another groupby dimension, so
> the performance will be really bad.
>
> When high cardinality dimension has to be included, *we have limited
> solutions now. *In 2.x-staging branch, which is not officially released
> yet, we're trying to address the issue by 1. Use new aggregation group
> techniques to reduce the number of cuboids to compute. (
> http://kylin.apache.org/blog/2016/02/18/new-agg) 2. Use short fixed length
> dimension (like url_id) to derive long length strings(like url). check
> https://issues.apache.org/jira/browse/KYLIN-1313 for more details. 3.
> Adopt
> scalable dictionary solutions to replace current in-memory dictionary
> building (TBD)
>
> I'm also answering your questions inline with this pen
>
> On Thu, Feb 18, 2016 at 11:03 AM, yu feng <olaptes...@gmail.com> wrote:
>
> > Hi All:
> >     We are encounting some problems while supporting a demand that a cube
> > with some high cardinality dimensions, those dimensions are URLs and user
> > want to use those dimensions in where clause and filter with like
> function.
> > besides, the cube has one distinct count measure.
> >
> > We has such problems :
> > 1、for one URL dimension, cardinality is about 50W one day, and the size
> > of fact_distinct_columns file is about 500M+, so when we build the cube
> > with more day, the job will failed in 'Build Dimension Dictionary'
> step(one
> > dimension file is about 3GB)
> >
> ​
> Currently ​Build Dimension Dictionary step will build a dictionary of
> dimension in memory. There're too many URLs, and each URL is too long,
> in-memory dictionary building will fail due to OOM.
>
> >
> > 2、after building segment of one day, we find like filter is so slow
> > to convert to in filter, and the filter is so big that buffer will out of
> > bounds.
> >
>
> for like functions, kylin basically treat it as another groupby dimension,
> so the performance will be really bad.
>
> >
> > 3、while executing sql with count(distinct col), the coprocossor will be
> > disable(why ?), and scanner will return more tuple so that exceed the
> > context threadhold and query will fail.
> >
> ​coprocessor is not enabled to protect region server from OOM.​
>
>
> >
> > Does anyone excounter such problem and how to solve such problems in the
> > sence that creating a cube with high cardinality dimensions such as URLs.
> >
> > Any suggestions are welcome, Thanks a lot.
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Reply via email to