GitHub user DaveBirdsall opened a pull request: https://github.com/apache/incubator-trafodion/pull/811
[TRAFODION-2322] Improve UPDATE STATS performance on long char columns This set of code changes addresses two issues: 1. UPDATE STATISTICS performance on long character columns can be very poor. A bit of explanation: With string columns in Hive tables in particular, Trafodion by default maps these to a VARCHAR(31999) data type. For large tables, UPDATE STATISTICS uses a SQL query to compute groupings of values. This query can be very slow for large VARCHARs. Often the Hive string columns contain just short strings, so this slow performance can be annoying. But the problem isnât just with Hive; a Trafodion table with a VARCHAR(31999) column that has just short strings in the column suffers from the same malady. 2. Creating a persistent sample table on very long character columns (> 200000 bytes) would fail. A bit of explanation: Trafodion supports longer strings on Hive tables than it does for its own Trafodion tables. UPDATE STATISTICS sometimes creates sample tables. Today the sample tables are always Trafodion tables, even though the original table is a Hive table. JIRA TRAFODION-2251 fixed this problem for sample tables created on-the-fly. However for persistent sample tables created using the UPDATE STATISTICS CREATE SAMPLE syntax, that fix did not work. As it turns out, in order to fix the first issue above, I ended up reworking the JIRA TRAFODION-2251 fix, and in the process fixed this additional issue as well. Design notes: When UPDATE STATISTICS needs to process a long character column, it now truncates the value to (by default) 256 bytes. This can enormously increase the performance of underlying SQL queries used by UPDATE STATISTICS on such columns. The trade-off is that unique entry count (UEC) may be underestimated. In particular, in the worst case, in a data set where all strings begin with the same first 256 bytes, the UEC will be mis-estimated as one! The â256â value is soft, and can be changed using CQD USTAT_MAX_CHAR_COL_LENGTH_IN_BYTES. As part of this processing, UPDATE STATISTICS also may create sample tables. When creating these tables, UPDATE STATISTICS now truncates long character columns to the same 256 byte limit. In JIRA Trafodion-2251, we employed a similar technique for sample tables, but it did not work in the UPDATE STATISTICS CREATE SAMPLE code path. That has been fixed (ustat/hs_yacc.y). Also, the technique used in JIRA Trafodion-2251 was to use CREATE TABLE AS SELECT in order to get the shorter columns. This is problematic for use with Trafodion tables, as partitioning is not preserved. That hurts the performance of sample table population and querying. Extending this to allow specifying partitioning appeared to be hard (because in principle, the SELECT can be *any* query). So instead, we now use a different technique. We have extended CREATE TABLE LIKE to take a new LIMIT COLUMN LENGTH TO n clause. When present, CREATE TABLE LIKE will limit any character column to n bytes. One nuance to this is that if a truncated column would form part of the primary key, the primary key constraint is replaced by a clustering key (STORE BY). We must do this since truncating values may affect their uniqueness properties. A few other small changes are present. The debug routines in NAMemory.h have been slightly changed and made public to simply debugging buffer overrun issues. Also, a clear() method has been added to NAString because I got tired of trying to figure out the NAString equivalent of the STL string clear method. So, NAString becomes just a tiny bit more like STL. Regression test compGeneral/TEST023 was enhanced to exercise code paths where long character columns are truncated. Test seabase/TEST040 was extended to include testing of CREATE TABLE LIKE LIMIT COLUMN LENGTH. You can merge this pull request into a Git repository by running: $ git pull https://github.com/DaveBirdsall/incubator-trafodion Trafodion2322a Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-trafodion/pull/811.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #811 ---- commit dfaaa41b044980ddf62c74b8e8778b29e3f267eb Author: Dave Birdsall <dbirds...@apache.org> Date: 2016-11-02T18:20:20Z [TRAFODION-2322] Improve UPDATE STATS performance on long char columns ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---