答复: 答复: Increase query performance

Huang Hua Wed, 20 May 2015 05:48:06 -0700

@Dong, please find my answers below regarding your questions:

1. HARD_THRESHOLD is like an upper bound which means that all your queries 
should not scan counts more than that value, while kylin.query.scan.threshold 
is the way that you can customize the scan limit but the value is also bound by 
HARD_THRESHOLD.


2. long rowEst = MEM_BUDGET_PER_QUERY / rowSizeEst, which is the way kylin 
computes an estimated scan limit for sql queries containing count(distinct xx). 
In my understanding, kylin uses hyperloglog to calculate an approximate value 
for distinct count which may require a lot of memory. I don't know why in 
details, would be great if someone can give more comments/explanations about it.

3. String propThreshold = connProps.getProperty(OLAPQuery.PROP_SCAN_THRESHOLD); 
It's just the logic that kylin set scan limit for each olapcontext, not a big 
deal here.

Best.
Hua
> -----邮件原件-----
> 发件人: dev-return-1759-
> [email protected] [mailto:dev-return-
> [email protected]] 代表 dong
> wang
> 发送时间: 2015年5月19日 18:45
> 收件人: [email protected]
> 主题: Re: 答复: Increase query performance
> 
> In addition, what's the difference between kylin.query.scan.threshold inside
> kylin.properties and HARD_THRESHOLD inside class StorageContext?
> 
> 2015-05-19 16:29 GMT+08:00 dong wang <[email protected]>:
> 
> > Thanks Hua, from ur tests and detailed explanation, it should be safe
> > to change the value of the threshold directly and rebuild the codes.
> > however, I still have 2 questions about the threshold, from the codes,
> > there are 2 places which set the threshold:
> > 1,         long rowEst = MEM_BUDGET_PER_QUERY / rowSizeEst;
> >         context.setThreshold((int) rowEst);
> >
> > 2,         String propThreshold =
> > connProps.getProperty(OLAPQuery.PROP_SCAN_THRESHOLD);
> >         int threshold = Integer.valueOf(propThreshold);
> >         olapContext.storageContext.setThreshold(threshold);
> >
> >
> > does anyone know what's the purposes for the 2 settings?
> >
> > 2015-05-18 11:40 GMT+08:00 Li Yang <[email protected]>:
> >
> >> A query goes through GUI -> Query Engine -> HBase. You can analyze
> >> the response time of each step.
> >>
> >> - GUI, see browser console
> >> - Query Engine, see the "===[QUERY]===" log lines. In the
> >> kylinlog1.txt, it's 151 seconds
> >> - HBase, see the "HBase Metrics" log line. However I didn't see it in
> >> the log files. Wield.
> >>
> >> Concerning the SQL that pulls all 18 columns, transferring the big
> >> result alone may already eats up most of the time.
> >>
> >> Btw, Kylin is not designed for ETL purpose, we don't expect user to
> >> pull millions of rows, at least not for usual cases. Result set is
> >> typically a few thousands for interactive analysis.
> >>
> >> Cheers
> >> Yang
> >>
> >> On Mon, May 18, 2015 at 10:23 AM, dong wang
> <[email protected]>
> >> wrote:
> >>
> >> > Thanks hua, usually users don't need to fetch 4,000,000 + rows of
> >> > the result, but for the intermediate query result, the row number
> >> > may be
> >> much
> >> > more than 4,000,000+ rows,  in your above reply, u mentioned that
> >> > we can just change the value of the setting, then rebuild the codes
> >> > and restart the tomcat,  is it what you have already tested?  since
> >> > currently there
> >> are
> >> > so much data in the existing cubes, I have to make it sure that all
> >> > such operations are safe to take~
> >> >
> >> > 2015-05-15 21:29 GMT+08:00 Adunuthula, Seshu
> <[email protected]>:
> >> >
> >> > > As a short term fix, does it make sense to make this a tunable
> >> parameter
> >> > > and move this to a config file?
> >> > >
> >> > > On 5/15/15, 5:58 AM, "Huang Hua" <[email protected]>
> wrote:
> >> > >
> >> > > >Hi Dong,
> >> > > >
> >> > > >I don't think so. You can safely change that setting but then
> >> > > >you
> >> need
> >> > to
> >> > > >recompile kylin to generate the new war(don't use the deploy.sh
> >> because
> >> > > >that will wipe out all your kylin hbase meta storage). After the
> >> > > >war
> >> is
> >> > > >generated, put that war under Tomcat webapps directory and
> >> > > >restarts
> >> the
> >> > > >Tomcat. That should work well.
> >> > > >
> >> > > >Best.
> >> > > >Hua
> >> > > >> -----邮件原件-----
> >> > > >> 发件人: dev-return-1698-
> >> > > >> [email protected] [mailto:
> >> > dev-return-
> >> > > >> [email protected]] 代
> 表
> >> > > >> dong wang
> >> > > >> 发送时间: 2015年5月15日 18:54
> >> > > >> 收件人: [email protected]
> >> > > >> 主题: Re: Increase query performance
> >> > > >>
> >> > > >> I found the setting for the threshold locates in
> >> StorageContext.java,
> >> > > >>the
> >> > > >> related piece of codes are:
> >> > > >> public class StorageContext {
> >> > > >>
> >> > > >>     public static final int HARD_THRESHOLD = 4000000;
> >> > > >>
> >> > > >>
> >> > > >> thus, I have a question that currently I have already built
> >> > > >>some segments  successfully,  later on, if I change the
> >> > > >>threshold much greater,
> >> will
> >> > > >>it affect the
> >> > > >> existing data in the cube storage?
> >> > > >>
> >> > > >> 2015-05-15 18:48 GMT+08:00 dong wang
> <[email protected]>:
> >> > > >>
> >> > > >> > Hi all, today I also met with the same problem, however,
> >> > > >> > maybe
> >> mine
> >> > is
> >> > > >> > much more strange, the SQL lies in the following:
> >> > > >> > select count(* ) from (select 1 from test1 where condtionx
> >> > > >> > group
> >> by
> >> > > >> > col1, col2, col3) t1
> >> > > >> >
> >> > > >> > since the result of the sub query is greater than 4000000,
> >> > > >> > the exception is thrown out~ however, the final row count of
> >> > > >> > the the
> >> > whole
> >> > > >> > SQL is just 1 row, such kind of SQL is usually implemented
> >> > > >> > to
> >> obtain
> >> > > >> > the total row count of some queries for paging feature~
> >> > > >> >
> >> > > >> > 2015-05-13 18:15 GMT+08:00 Parkavi Nandagopal <
> >> [email protected]>:
> >> > > >> >
> >> > > >> >> After getting that below error (Scan row count exceeded
> >> threshold:
> >> > > >> >> 4000000), kylin is stopped/crashed automatically.
> >> > > >> >> Is Kylin single point of Failure?
> >> > > >> >> How to make it has an High availability?
> >> > > >> >>
> >> > > >> >> Thanks,
> >> > > >> >> Parkavi.
> >> > > >> >>
> >> > > >> >>
> >> > > >> >> -----Original Message-----
> >> > > >> >> From: Parkavi Nandagopal
> >> > > >> >> Sent: Wednesday, May 13, 2015 10:49 AM
> >> > > >> >> To: dev; '[email protected]'
> >> > > >> >> Subject: RE: Increase query performance
> >> > > >> >>
> >> > > >> >> Size of my hive fact table = 3.27 GB ( row count
> >> > > >> >> 25,236,160)
> >> Cube
> >> > > >> >> size =
> >> > > >> >> 2.21 GB
> >> > > >> >>
> >> > > >> >> I created hierarchy dimension with 18 levels.
> >> > > >> >> Col1 -> Col2 -> ......upto Col18 For this 18 levels, total
> >> > > >> >> cardinality = 2635
> >> > > >> >>
> >> > > >> >> I attached 2 log files.
> >> > > >> >> Log1 - query with limit 1000000 Partial result came.
> >> > > >> >> Log2 - Clicked show all in Query result.
> >> > > >> >> Getting ERROR : exception while executing query: Scan row
> >> > > >> >> count exceeded
> >> > > >> >> threshold: 4000000, please add filter condition to narrow
> >> > > >> >> down backend scan range, like where clause.
> >> > > >> >>
> >> > > >> >> Thanks,
> >> > > >> >> Parkavi.
> >> > > >> >>
> >> > > >> >> -----Original Message-----
> >> > > >> >> From: hongbin ma [mailto:[email protected]]
> >> > > >> >> Sent: Wednesday, May 13, 2015 7:15 AM
> >> > > >> >> To: dev
> >> > > >> >> Subject: Re: Increase query performance
> >> > > >> >>
> >> > > >> >> before you expand your cluster, you might need to analyse
> >> > > >> >> why
> >> it's
> >> > > >> >> delivering poor performance.
> >> > > >> >>
> >> > > >> >> how about the size of your hive fact table? the cardinality
> >> > > >> >> of
> >> the
> >> > > >> >> dimension columns?
> >> > > >> >>
> >> > > >> >> if possible you can run a query,and paste the query's log
> >> > > >> >> in KYLIN_HOME/logs/kylin.log for that query. we can help
> >> > > >> >> you check
> >> for
> >> > > >> >> any abnormalities. (make sure you're writing a slightly
> >> different
> >> > > >> >> query, to avoid hitting cache)
> >> > > >> >>
> >> > > >> >> On Tue, May 12, 2015 at 2:04 PM, Parkavi Nandagopal
> >> > > >> >> <[email protected]>
> >> > > >> >> wrote:
> >> > > >> >>
> >> > > >> >> > Hi ,
> >> > > >> >> >
> >> > > >> >> > I have installed kylin and created cube(3GB size) with
> >> > > >> >> > only
> >> one
> >> > > >> >> > region server and when I query the cube data, it is
> >> > > >> >> > taking
> >> much
> >> > > >> >> > time to show the query result in Kylin web UI.
> >> > > >> >> > If I add 3 or more region server node with high
> >> > > >> >> > configuration
> >> > and I
> >> > > >> >> > create a cube then query the cube means will it increase
> >> > > >> >> > the
> >> > query
> >> > > >> >> performance?
> >> > > >> >> >
> >> > > >> >> >
> >> > > >> >> > Thanks,
> >> > > >> >> > Parkavi.
> >> > > >> >> >
> >> > > >> >> >
> >> > > >> >> > ::DISCLAIMER::
> >> > > >> >> >
> >> > > >> >> >
> >> > -------------------------------------------------------------------
> >> > > >> >> > ---
> >> > > >> >> >
> >> > -------------------------------------------------------------------
> >> > > >> >> > ---
> >> > > >> >> > --------
> >> > > >> >> >
> >> > > >> >> > The contents of this e-mail and any attachment(s) are
> >> > confidential
> >> > > >> >> > and intended for the named recipient(s) only.
> >> > > >> >> > E-mail transmission is not guaranteed to be secure or
> >> error-free
> >> > as
> >> > > >> >> > information could be intercepted, corrupted, lost,
> >> > > >> >> > destroyed, arrive late or incomplete, or may contain
> >> > > >> >> > viruses in
> >> > transmission.
> >> > > >> >> > The e mail and its contents (with or without referred
> >> > > >> >> > errors)
> >> > shall
> >> > > >> >> > therefore not attach any liability on the originator or
> >> > > >> >> > HCL or
> >> > its
> >> > > >>affiliates.
> >> > > >> >> > Views or opinions, if any, presented in this email are
> >> > > >> >> > solely
> >> > those
> >> > > >> >> > of the author and may not necessarily reflect the views
> >> > > >> >> > or
> >> > opinions
> >> > > >> >> > of HCL or its affiliates. Any form of reproduction,
> >> > dissemination,
> >> > > >> >> > copying, disclosure, modification, distribution and / or
> >> > > >> >> > publication of this message without the prior written
> >> > > >> >> > consent
> >> of
> >> > > >> >> > authorized representative of HCL is strictly prohibited.
> >> > > >> >> > If
> >> you
> >> > > >> >> > have received this email in error please delete it and
> >> > > >> >> > notify
> >> the
> >> > > >> >> > sender immediately.
> >> > > >> >> > Before opening any email and/or attachments, please check
> >> > > >> >> > them
> >> > for
> >> > > >> >> > viruses and other defects.
> >> > > >> >> >
> >> > > >> >> >
> >> > > >> >> >
> >> > -------------------------------------------------------------------
> >> > > >> >> > ---
> >> > > >> >> >
> >> > -------------------------------------------------------------------
> >> > > >> >> > ---
> >> > > >> >> > --------
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >>
> >> > > >> >>
> >> > > >> >> --
> >> > > >> >> Regards,
> >> > > >> >>
> >> > > >> >> *Bin Mahone | 马洪宾*
> >> > > >> >> Apache Kylin: http://kylin.io
> >> > > >> >> Github: https://github.com/binmahone
> >> > > >> >>
> >> > > >> >
> >> > > >> >
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> >
> >>
> >
> >

答复: 答复: Increase query performance

Reply via email to