[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

ASF GitHub Bot (JIRA) Tue, 04 Oct 2016 14:49:49 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546769#comment-15546769
 ]


ASF GitHub Bot commented on TRAFODION-2251:
-------------------------------------------

GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/incubator-trafodion/pull/744

    [TRAFODION-2251] Fix upd stats issues with long char/varchar columns

    The following issues in UPDATE STATISTICS concerning long char/varchar 
columns have been fixed:
    
    1. Creating a sample table would fail if the source table was a Hive table 
with columns longer than the Trafodion maximum (200000 bytes).
    2. UPDATE STATISTICS might fail because the Optimizer raised a 2053 warning 
during pass 2 optimization.  This warning happens on some queries on Hive 
tables having columns of over 1 million bytes in length. Now UPDATE STATISTICS 
will ignore the warning and continue. (The Optimizer does generate a plan in 
spite of the warning.)
    3. UPDATE STATISTICS might abend resulting in a core file when sorting 
varchars of longer than 32767 bytes.
    
    Design notes:
    
    The 32767 limitation is due to the ISVarChar class using a 2-byte length 
field. I chose to not increase this to 4 bytes as that would reduce the amount 
of memory available for existing UPDATE STATISTICS scenarios. Instead, I chose 
to truncate longer values to fit within 32767 bytes using the SQL SUBSTRING 
function. The result is that UEC might be underestimated for such columns, for 
example if the first 32767 bytes of two values are equal and the values differ 
at the 32768th or later byte. We can lift this limitation either by accepting 
the loss of memory to a longer length field, or by adding another class, 
ISLongVarChar say, for these long values. That work is left for the future, 
when needed.
    
    The logic to create the sample table used to use CREATE TABLE LIKE. This 
doesn't work for Hive source tables having char/varchar columns of longer than 
200000 bytes, as Trafodion imposes a 200000 byte limit on its own columns. 
Instead, we use CREATE TABLE AS SELECT, and use SUBSTRING to limit the size of 
the long char/varchar columns. (We also use SUBSTRING on the subsequent UPSERT 
used to populate the table.)
    
    CREATE TABLE AS SELECT did not support running in a user transaction, but 
UPDATE STATISTICS needs to do so. Fortunately, we only use the NO LOAD flavor 
of CREATE TABLE AS SELECT, so there is no non-transactional load step. 
Therefore, the ExExeUtilLoad tcb has been changed to support running in a user 
transaction if NO LOAD has been specified.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/incubator-trafodion Trafodion2251

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-trafodion/pull/744.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #744
    
----
commit 8a7fe53736fb5720199e23864910df3cb092155d
Author: Dave Birdsall <[email protected]>
Date:   2016-10-04T21:35:42Z

    [TRAFODION-2251] Fix upd stats issues with long char/varchar columns

----


> update stats does not work for large Hive varchar columns
> ---------------------------------------------------------
>
>                 Key: TRAFODION-2251
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.1-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> As shown below, update statistics currently returns various errors once the 
> length of a column exceeds 200000. But with the large varchar support, 
> theoretically the total row size of 64MB is the only limit. A long varchar 
> column length can easily go over the 200000 limit.
> It's not clear if we intend to support update statistics for large varchar. 
> If we don't plan to support it, perhaps update statistics can at least skip 
> long varchar columns (much like the way it skips LOB columns right now), so 
> that it does not fail when a user runs update statistics on 'every column' 
> that includes long varchars.
> The following sequence of execution also showed a second problem, which may 
> or may not be related. The first update statistics had failed with 'every 
> column' which includes long varchar columns. Showstats didn't show any stats 
> aferwards. The second update statistics was run on the long varchar columns 
> again, but it didn't return any errors. Showstats afterwards showed no stats 
> still.
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ hive -f hive.sql
> WARNING: Use "yarn jar" to launch YARN applications.
> Logging initialized using configuration in 
> file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
> OK
> Time taken: 1.455 seconds
> OK
> Time taken: 0.185 seconds
> OK
> Time taken: 0.304 seconds
> OK
> Time taken: 0.371 seconds
> Loading data to table mysch.mytable
> Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
> OK
> Time taken: 0.967 seconds
> $ sqlci -i mytest.sql
> EsgynDB Advanced Conversational Interface 2.2.0
> Copyright (c) 2015-2016 Esgyn Corporation
> >>log mytest.log clear;
> >>cqd HIVE_MAX_STRING_LENGTH '10485760';
> --- SQL operation complete.
> >>set schema hive.mysch;
> --- SQL operation complete.
> >>drop external table if exists mytable for hive.mysch.mytable;
> --- SQL operation complete.
> >>create external table mytable (c_int int, c_string1 varchar(10485760), 
> >>c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on every column sample random 50 
> >>percent;
> *** ERROR[9214] Object 
> TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 
> could not be created.
> *** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size 
> allowed (200000) for column C_STRING1.
> *** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an 
> error (8609) from statement Process_Query.
> *** ERROR[8609] Waited rollback performed without starting a transaction.
> --- SQL operation failed with errors.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on c_int to c_string2 sample random 50 
> >>percent;
> --- SQL operation complete.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> To reproduce, create 3 files with the following contents and run them in this 
> order:
> chmod 755 ./data_gen.py
> ./data_gen.py data.1row_10MB.txt 1 2 10485760
> hive -f hive.sql
> sqlci -i mytest.sql
> $ cat data_gen.py
> #! /usr/bin/env python
> import sys
> if len(sys.argv) != 5 or \
>    sys.argv[1].lower() == '-h' or \
>    sys.argv[1].lower() == '-help':
>     print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar 
> columns> <varchar column length>'
>     sys.exit()
> f = open(sys.argv[1], "w+")
> marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
> for num_rows in range(0, int(sys.argv[2])):
>     f.write(str(num_rows) + '|')
>     for num_cols in range(0, int(sys.argv[3])):
>         f.write(marker[num_rows%len(marker)])
>         for i in range (1, int(sys.argv[4])):
>             f.write(str(i % 10))
>         f.write('|')
>     f.write(str(num_rows))
>     f.write('\n')
> f.close()
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ cat mytest.sql
> log mytest.log clear;
> cqd HIVE_MAX_STRING_LENGTH '10485760';
> set schema hive.mysch;
> drop external table if exists mytable for hive.mysch.mytable;
> create external table mytable (c_int int, c_string1 varchar(10485760), 
> c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> update statistics for table mytable on every column sample random 50 percent;
> showstats for table mytable on c_int to c_string2;
> update statistics for table mytable on c_int to c_string2 sample random 50 
> percent;
> showstats for table mytable on c_int to c_string2;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

Reply via email to