[jira] [Commented] (TRAFODION-2322) UPDATE STATS for Hive TPC-H Lineitem table takes much longer now

ASF GitHub Bot (JIRA) Tue, 01 Nov 2016 16:38:02 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627065#comment-15627065
 ]


ASF GitHub Bot commented on TRAFODION-2322:
-------------------------------------------

GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/incubator-trafodion/pull/809

    [TRAFODION-2322] Improve UPDATE STATS performance on long char columns

    This set of code changes addresses two issues:
    
    1.  UPDATE STATISTICS performance on long character columns can be very 
poor. A bit of explanation: With string columns in Hive tables in particular, 
Trafodion by default maps these to a VARCHAR(31999) data type. For large 
tables, UPDATE STATISTICS uses a SQL query to compute groupings of values. This 
query can be very slow for large VARCHARs. Often the Hive string columns 
contain just short strings, so this slow performance can be annoying. But the 
problem isn’t just with Hive; a Trafodion table with a VARCHAR(31999) column 
that has just short strings in the column suffers from the same malady.
    
    2.  Creating a persistent sample table on very long character columns (> 
200000 bytes) would fail. A bit of explanation: Trafodion supports longer 
strings on Hive tables than it does for its own Trafodion tables. UPDATE 
STATISTICS sometimes creates sample tables. Today the sample tables are always 
Trafodion tables, even though the original table is a Hive table. JIRA 
Trafodion-2251 fixed this problem for sample tables created on-the-fly. However 
for persistent sample tables created using the UPDATE STATISTICS CREATE SAMPLE 
syntax, that fix did not work. As it turns out, in order to fix the first issue 
above, I ended up reworking the JIRA Trafodion-2251 fix, and in the process 
fixed this additional issue as well.
    
    Design notes:
    
    When UPDATE STATISTICS needs to process a long character column, it now 
truncates the value to (by default) 256 bytes. This can enormously increase the 
performance of underlying SQL queries used by UPDATE STATISTICS on such 
columns. The trade-off is that unique entry count (UEC) may be underestimated. 
In particular, in the worst case, in a data set where all strings begin with 
the same first 256 bytes, the UEC will be mis-estimated as one! The ‘256’ value 
is soft, and can be changed using CQD USTAT_MAX_CHAR_COL_LENGTH_IN_BYTES.
    
    As part of this processing, UPDATE STATISTICS also may create sample 
tables. When creating these tables, UPDATE STATISTICS now truncates long 
character columns to the same 256 byte limit.
    In JIRA Trafodion-2251, we employed a similar technique for sample tables, 
but it did not work in the UPDATE STATISTICS CREATE SAMPLE code path. That has 
been fixed.
    
    Also, the technique used in JIRA Trafodion-2251 was to use CREATE TABLE AS 
SELECT in order to get the shorter columns. This is problematic for use with 
Trafodion tables, as partitioning is not preserved. That hurts the performance 
of sample table population and querying. Extending this to allow specifying 
partitioning appeared to be hard (because in principle, the SELECT can be *any* 
query). So instead, we now use a different technique. We have extended CREATE 
TABLE LIKE to take a new LIMIT COLUMN LENGTH TO n clause. When present, CREATE 
TABLE LIKE will limit any character column to n bytes. One nuance to this is 
that if a truncated column would form part of the primary key, the primary key 
constraint is replaced by a clustering key (STORE BY). We must do this since 
truncating values may affect their uniqueness properties.
    
    A few other small changes are present. The debug routines in NAMemory.h 
have been slightly changed and made public to simply debugging buffer overrun 
issues. Also, a clear() method has been added to NAString because I got tired 
of trying to figure out the NAString equivalent of the STL string clear method. 
So, NAString becomes just a tiny bit more like STL. Regression test 
compGeneral/TEST023 was enhanced to exercise code paths where long character 
columns are truncated. Test seabase/TEST040 was extended to include testing of 
CREATE TABLE LIKE LIMIT COLUMN LENGTH.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/incubator-trafodion Trafodion2322

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-trafodion/pull/809.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #809
    
----
commit 30645fe2c7a765e98d25b5b65ea4c4b3548a6447
Author: Dave Birdsall <[email protected]>
Date:   2016-11-01T23:33:41Z

    [TRAFODION-2322] Improve UPDATE STATS performance on long character columns

----


> UPDATE STATS for Hive TPC-H Lineitem table takes much longer now
> ----------------------------------------------------------------
>
>                 Key: TRAFODION-2322
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2322
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.0-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> When using a LINEITEM table with about 12 million rows, and storing that 
> LINEITEM table in Hive files, UPDATE STATISTICS has regressed in its 
> performance. On one test system, the elapsed time changed from 6 minutes 20 
> seconds to 31 minutes 31 seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2322) UPDATE STATS for Hive TPC-H Lineitem table takes much longer now

Reply via email to