[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

David Wayne Birdsall (JIRA) Thu, 29 Sep 2016 15:56:11 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534336#comment-15534336
 ]


David Wayne Birdsall commented on TRAFODION-2251:
-------------------------------------------------

This is proving tricky to fix.

The reason the sample table creation fails is that to create the sample table, 
we use "CREATE TABLE LIKE". And CREATE TABLE LIKE will simply use the datatypes 
of the source columns to create the target. Unfortunately we support larger 
char/varchar columns with Hive tables than we do with native Trafodion tables 
(the sample table is created as a native Trafodion table).

One choice might be to relax all restrictions on char/varchar column lengths in 
Trafodion, but this might be a lot of work.

So instead we could do as the description above suggests: prohibit UPDATE STATS 
on long char/varchar columns, in the same way that we prohibit it for LOB data 
types. Unfortunately, we still hang up here on the "CREATE TABLE LIKE" 
business, because we create the sample table with *all* the columns from the 
source table, supported or not. It works for LOBs because they are supported in 
native Trafodion tables.

So we are faced with some choices:
1. Replace the ustat-level "CREATE TABLE LIKE" logic with something that picks 
just the supported columns. Note that the logic to populate the table will need 
to change accordingly.
2. Change the CREATE TABLE LIKE DDL code to check column length, and if it is 
an unsupported length, use the maximum supported length instead. Note that this 
also implies a change to the populate logic, as we'll have to do a SUBSTRING 
operation in order to avoid errors when writing rows.

> update stats does not work for large Hive varchar columns
> ---------------------------------------------------------
>
>                 Key: TRAFODION-2251
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.1-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> As shown below, update statistics currently returns various errors once the 
> length of a column exceeds 200000. But with the large varchar support, 
> theoretically the total row size of 64MB is the only limit. A long varchar 
> column length can easily go over the 200000 limit.
> It's not clear if we intend to support update statistics for large varchar. 
> If we don't plan to support it, perhaps update statistics can at least skip 
> long varchar columns (much like the way it skips LOB columns right now), so 
> that it does not fail when a user runs update statistics on 'every column' 
> that includes long varchars.
> The following sequence of execution also showed a second problem, which may 
> or may not be related. The first update statistics had failed with 'every 
> column' which includes long varchar columns. Showstats didn't show any stats 
> aferwards. The second update statistics was run on the long varchar columns 
> again, but it didn't return any errors. Showstats afterwards showed no stats 
> still.
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ hive -f hive.sql
> WARNING: Use "yarn jar" to launch YARN applications.
> Logging initialized using configuration in 
> file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
> OK
> Time taken: 1.455 seconds
> OK
> Time taken: 0.185 seconds
> OK
> Time taken: 0.304 seconds
> OK
> Time taken: 0.371 seconds
> Loading data to table mysch.mytable
> Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
> OK
> Time taken: 0.967 seconds
> $ sqlci -i mytest.sql
> EsgynDB Advanced Conversational Interface 2.2.0
> Copyright (c) 2015-2016 Esgyn Corporation
> >>log mytest.log clear;
> >>cqd HIVE_MAX_STRING_LENGTH '10485760';
> --- SQL operation complete.
> >>set schema hive.mysch;
> --- SQL operation complete.
> >>drop external table if exists mytable for hive.mysch.mytable;
> --- SQL operation complete.
> >>create external table mytable (c_int int, c_string1 varchar(10485760), 
> >>c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on every column sample random 50 
> >>percent;
> *** ERROR[9214] Object 
> TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 
> could not be created.
> *** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size 
> allowed (200000) for column C_STRING1.
> *** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an 
> error (8609) from statement Process_Query.
> *** ERROR[8609] Waited rollback performed without starting a transaction.
> --- SQL operation failed with errors.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on c_int to c_string2 sample random 50 
> >>percent;
> --- SQL operation complete.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> To reproduce, create 3 files with the following contents and run them in this 
> order:
> chmod 755 ./data_gen.py
> ./data_gen.py data.1row_10MB.txt 1 2 10485760
> hive -f hive.sql
> sqlci -i mytest.sql
> $ cat data_gen.py
> #! /usr/bin/env python
> import sys
> if len(sys.argv) != 5 or \
>    sys.argv[1].lower() == '-h' or \
>    sys.argv[1].lower() == '-help':
>     print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar 
> columns> <varchar column length>'
>     sys.exit()
> f = open(sys.argv[1], "w+")
> marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
> for num_rows in range(0, int(sys.argv[2])):
>     f.write(str(num_rows) + '|')
>     for num_cols in range(0, int(sys.argv[3])):
>         f.write(marker[num_rows%len(marker)])
>         for i in range (1, int(sys.argv[4])):
>             f.write(str(i % 10))
>         f.write('|')
>     f.write(str(num_rows))
>     f.write('\n')
> f.close()
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ cat mytest.sql
> log mytest.log clear;
> cqd HIVE_MAX_STRING_LENGTH '10485760';
> set schema hive.mysch;
> drop external table if exists mytable for hive.mysch.mytable;
> create external table mytable (c_int int, c_string1 varchar(10485760), 
> c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> update statistics for table mytable on every column sample random 50 percent;
> showstats for table mytable on c_int to c_string2;
> update statistics for table mytable on c_int to c_string2 sample random 50 
> percent;
> showstats for table mytable on c_int to c_string2;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

Reply via email to