Re: trafodion query optimization

乔彦克 Mon, 19 Sep 2016 21:39:07 -0700

Many thanks  to Eric and Qifan again.
I upload a new log about the new ddl and the query plan.
@Eric, Following your advice I modified the ddl, the sum query now takes
25seconds while the old ddl will take 90seconds.(there are 3 nodes in my
cluster, 8cores(16 v-cores) and 16GRAM), Is there any other optimize
options to make the sum query more fast (better less than 10seconds)?
@Qifan, I log the output of showstats command in the attachment, looking
forward more suggestions.


Thanks again.
Qiao

Qifan Chen <[email protected]>于2016年9月19日周一 下午9:23写道：

> Hi Qiao,
>
> Thank you for the data. It is very helpful.
>
> There are several things noticed.
>
>    - The key columns are: _SALT_, _DIVISION_1_, SID, V_DATE, UID, VID
>    - The salt column is built from column SID only, which means all rows
>    with identical SID values V will be stored in the same partition.
>    - From the query plan, the compiler assigns 6 executor processes (we
>    call ESP) to read the salted table of 12 partitions: 1 ESP reads 2 salt
>    partitions.
>    - The frequency of V is high when sid='6b2a0957' (~8million rows), all
>    these relevant rows are handled by one execution process out of 6. That
>    probably is the reason of not much parallelism observed.
>    - You can use SQL command *showstats with detail* option to check the
>    frequency on column SID.
>       - showstats for table TRAFODION.SEABASE.VISIT_FROM_HIVE2 on SID
>       detail;
>    - If high frequency per unique value on column SID is confirmed, we
>    probably should consider our next step of action. For example, we could add
>    some columns from the primary key to the SALT clause to help spread V of
>    SID to all 12 partitions.
>
> Could you please send us the output of the showstats command above, and
> the showstats command below for all columns in the table?
>
> showstats for table TRAFODION.SEABASE.VISIT_FROM_HIVE2 on every column;
>
> Thanks --Qifan
>
> On Mon, Sep 19, 2016 at 3:15 AM, 乔彦克 <[email protected]> wrote:
>
>> Thanks Eric and Qifan. I am sorry to reply after so long a time because
>> I'm on the Chinese mid-autumn festival holiday.
>> According to Qifan's advice, I upload a log which contains the DDL and
>> the query plan hope to get more advice.
>>
>> and to Eric, I summit a jira about the block-encoding and the
>> compression, https://issues.apache.org/jira/browse/TRAFODION-2195, so I
>> only use the hbase compression.
>>
>> Qifan Chen <[email protected]>于2016年9月12日周一 下午10:43写道：
>>
>>> Hi Qiao,
>>>
>>> You can also send us the DDL and the query plan to help with the tuning.
>>>
>>> To generate a query plan, do the following from sqlci, and the plan is
>>> in text file mylog.
>>>
>>>
>>>    1. log mylog clear;
>>>    2. prepare xx from <your query>;
>>>    3. explain xx;
>>>    4. explain options 'f' xx;
>>>    5. exit;
>>>
>>>
>>> Thanks --Qifan
>>>
>>> On Mon, Sep 12, 2016 at 8:21 AM, Eric Owhadi <[email protected]>
>>> wrote:
>>>
>>>> Hello Qiao,
>>>> When you say whatever the table ddl it limits things a lot, as table ddl
>>>> will help define several things that will drastically improve the
>>>> degree of
>>>> parallelism and the table size.
>>>>
>>>> The DOP (degree of parallelism) of the scan operator is constrained by
>>>> the
>>>> number of regions your table uses. So if you want to increase DOP, you
>>>> need
>>>> to partition your table using the syntax like:
>>>> create table customer_demographics_salt
>>>> (
>>>>  cd_demo_sk int not null
>>>>  , cd_gender char(1)
>>>>  , cd_marital_status char(1)
>>>>  , cd_education_status char(20)
>>>>  , cd_purchase_estimate int
>>>>  , cd_credit_rating char(10)
>>>>  , cd_dep_count int
>>>>  , cd_dep_employed_count int
>>>>  , cd_dep_college_count int
>>>>  , primary key (cd_demo_sk)
>>>> )
>>>> salt using 12 partitions
>>>> ATTRIBUTES ALIGNED FORMAT
>>>> HBASE_OPTIONS
>>>>   (
>>>>     DATA_BLOCK_ENCODING = 'FAST_DIFF',
>>>>     COMPRESSION = 'SNAPPY'
>>>>   );
>>>>
>>>> you can experiment with different values of number of partitions (but
>>>> pick a
>>>> multiple of 3 since you have 3 nodes).
>>>>
>>>> Then the optimizer will pick the DOP with a compromise of resource
>>>> usage vs
>>>> gain in speed.
>>>> If you want to force higher DOP than what optimizer selected, you can
>>>> use :
>>>> CQD parallel_num_esps '12';
>>>> To force it to one ESP per partition (assuming you picked 12
>>>> partitions).
>>>> You can verify what optimizer picked as DOP by doing an explain on the
>>>> query.
>>>>
>>>> Other important factors plays in performance:
>>>> - use of aligned format (see above example)
>>>> - careful choice of the primary key (bad idea to use a varchar with big
>>>> max
>>>> size)
>>>> - It is good idea to use compression and encoding (see the most common
>>>> options we use above)
>>>> - you can also increase the HBASE_OPTIONS BLOCKSIZE parameter, but
>>>> there is
>>>> a drawback: increasing it will increase performance of SCAN but decrease
>>>> performance of keyed access. That is why I did not include it in the
>>>> example
>>>> above.
>>>>
>>>> Hope this helps,
>>>> Regards,
>>>> Eric
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: 乔彦克 [mailto:[email protected]]
>>>> Sent: Monday, September 12, 2016 1:22 AM
>>>> To: [email protected];
>>>> [email protected]
>>>> Subject: trafodion query optimization
>>>>
>>>> Hi all,
>>>>      I executed the sum and count query on my table where the cluster
>>>> has
>>>> three nodes. I found that the sum query is not well parallel
>>>> executed(not
>>>> all the three nodes get high load when executing the sum query) and the
>>>> cpu
>>>> load is very high while the memory load is very low(the machines have 16
>>>> cores and 16GB memory). My sum query on the 12 million data sets takes
>>>> about
>>>> 2 minutes and a half time.
>>>>     So my question is that is there any optimization advice that I can
>>>> use
>>>> to improve the query performance and maximize the usage of my machines,
>>>> what
>>>> ever the configuration or the table ddl.
>>>>     Any replies is appreciated.
>>>>
>>>> Thanks,
>>>> Qiao
>>>>
>>>
>>>
>>>
>>> --
>>> Regards, --Qifan
>>>
>>>
>
>
> --
> Regards, --Qifan
>
>

Re: trafodion query optimization

Reply via email to