[
https://issues.apache.org/jira/browse/TRAFODION-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627065#comment-15627065
]
ASF GitHub Bot commented on TRAFODION-2322:
-------------------------------------------
GitHub user DaveBirdsall opened a pull request:
https://github.com/apache/incubator-trafodion/pull/809
[TRAFODION-2322] Improve UPDATE STATS performance on long char columns
This set of code changes addresses two issues:
1. UPDATE STATISTICS performance on long character columns can be very
poor. A bit of explanation: With string columns in Hive tables in particular,
Trafodion by default maps these to a VARCHAR(31999) data type. For large
tables, UPDATE STATISTICS uses a SQL query to compute groupings of values. This
query can be very slow for large VARCHARs. Often the Hive string columns
contain just short strings, so this slow performance can be annoying. But the
problem isn’t just with Hive; a Trafodion table with a VARCHAR(31999) column
that has just short strings in the column suffers from the same malady.
2. Creating a persistent sample table on very long character columns (>
200000 bytes) would fail. A bit of explanation: Trafodion supports longer
strings on Hive tables than it does for its own Trafodion tables. UPDATE
STATISTICS sometimes creates sample tables. Today the sample tables are always
Trafodion tables, even though the original table is a Hive table. JIRA
Trafodion-2251 fixed this problem for sample tables created on-the-fly. However
for persistent sample tables created using the UPDATE STATISTICS CREATE SAMPLE
syntax, that fix did not work. As it turns out, in order to fix the first issue
above, I ended up reworking the JIRA Trafodion-2251 fix, and in the process
fixed this additional issue as well.
Design notes:
When UPDATE STATISTICS needs to process a long character column, it now
truncates the value to (by default) 256 bytes. This can enormously increase the
performance of underlying SQL queries used by UPDATE STATISTICS on such
columns. The trade-off is that unique entry count (UEC) may be underestimated.
In particular, in the worst case, in a data set where all strings begin with
the same first 256 bytes, the UEC will be mis-estimated as one! The ‘256’ value
is soft, and can be changed using CQD USTAT_MAX_CHAR_COL_LENGTH_IN_BYTES.
As part of this processing, UPDATE STATISTICS also may create sample
tables. When creating these tables, UPDATE STATISTICS now truncates long
character columns to the same 256 byte limit.
In JIRA Trafodion-2251, we employed a similar technique for sample tables,
but it did not work in the UPDATE STATISTICS CREATE SAMPLE code path. That has
been fixed.
Also, the technique used in JIRA Trafodion-2251 was to use CREATE TABLE AS
SELECT in order to get the shorter columns. This is problematic for use with
Trafodion tables, as partitioning is not preserved. That hurts the performance
of sample table population and querying. Extending this to allow specifying
partitioning appeared to be hard (because in principle, the SELECT can be *any*
query). So instead, we now use a different technique. We have extended CREATE
TABLE LIKE to take a new LIMIT COLUMN LENGTH TO n clause. When present, CREATE
TABLE LIKE will limit any character column to n bytes. One nuance to this is
that if a truncated column would form part of the primary key, the primary key
constraint is replaced by a clustering key (STORE BY). We must do this since
truncating values may affect their uniqueness properties.
A few other small changes are present. The debug routines in NAMemory.h
have been slightly changed and made public to simply debugging buffer overrun
issues. Also, a clear() method has been added to NAString because I got tired
of trying to figure out the NAString equivalent of the STL string clear method.
So, NAString becomes just a tiny bit more like STL. Regression test
compGeneral/TEST023 was enhanced to exercise code paths where long character
columns are truncated. Test seabase/TEST040 was extended to include testing of
CREATE TABLE LIKE LIMIT COLUMN LENGTH.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DaveBirdsall/incubator-trafodion Trafodion2322
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-trafodion/pull/809.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #809
----
commit 30645fe2c7a765e98d25b5b65ea4c4b3548a6447
Author: Dave Birdsall <[email protected]>
Date: 2016-11-01T23:33:41Z
[TRAFODION-2322] Improve UPDATE STATS performance on long character columns
----
> UPDATE STATS for Hive TPC-H Lineitem table takes much longer now
> ----------------------------------------------------------------
>
> Key: TRAFODION-2322
> URL: https://issues.apache.org/jira/browse/TRAFODION-2322
> Project: Apache Trafodion
> Issue Type: Bug
> Components: sql-cmp
> Affects Versions: 2.0-incubating
> Environment: All
> Reporter: David Wayne Birdsall
> Assignee: David Wayne Birdsall
>
> When using a LINEITEM table with about 12 million rows, and storing that
> LINEITEM table in Hive files, UPDATE STATISTICS has regressed in its
> performance. On one test system, the elapsed time changed from 6 minutes 20
> seconds to 31 minutes 31 seconds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)