[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

David Wayne Birdsall (JIRA) Thu, 29 Sep 2016 16:24:38 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534394#comment-15534394
 ]


David Wayne Birdsall commented on TRAFODION-2251:
-------------------------------------------------

Just chatted with Anoop Sharma. There is a flavor of CREATE TABLE that we can 
use that will exclude the columns we don't want. CREATE TABLE AS SELECT ... NO 
LOAD (I might not have spelled this quite right). So in the SELECT part I can 
just select the columns I want, skipping the ones that are too long.

> update stats does not work for large Hive varchar columns
> ---------------------------------------------------------
>
>                 Key: TRAFODION-2251
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.1-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> As shown below, update statistics currently returns various errors once the 
> length of a column exceeds 200000. But with the large varchar support, 
> theoretically the total row size of 64MB is the only limit. A long varchar 
> column length can easily go over the 200000 limit.
> It's not clear if we intend to support update statistics for large varchar. 
> If we don't plan to support it, perhaps update statistics can at least skip 
> long varchar columns (much like the way it skips LOB columns right now), so 
> that it does not fail when a user runs update statistics on 'every column' 
> that includes long varchars.
> The following sequence of execution also showed a second problem, which may 
> or may not be related. The first update statistics had failed with 'every 
> column' which includes long varchar columns. Showstats didn't show any stats 
> aferwards. The second update statistics was run on the long varchar columns 
> again, but it didn't return any errors. Showstats afterwards showed no stats 
> still.
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ hive -f hive.sql
> WARNING: Use "yarn jar" to launch YARN applications.
> Logging initialized using configuration in 
> file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
> OK
> Time taken: 1.455 seconds
> OK
> Time taken: 0.185 seconds
> OK
> Time taken: 0.304 seconds
> OK
> Time taken: 0.371 seconds
> Loading data to table mysch.mytable
> Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
> OK
> Time taken: 0.967 seconds
> $ sqlci -i mytest.sql
> EsgynDB Advanced Conversational Interface 2.2.0
> Copyright (c) 2015-2016 Esgyn Corporation
> >>log mytest.log clear;
> >>cqd HIVE_MAX_STRING_LENGTH '10485760';
> --- SQL operation complete.
> >>set schema hive.mysch;
> --- SQL operation complete.
> >>drop external table if exists mytable for hive.mysch.mytable;
> --- SQL operation complete.
> >>create external table mytable (c_int int, c_string1 varchar(10485760), 
> >>c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on every column sample random 50 
> >>percent;
> *** ERROR[9214] Object 
> TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 
> could not be created.
> *** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size 
> allowed (200000) for column C_STRING1.
> *** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an 
> error (8609) from statement Process_Query.
> *** ERROR[8609] Waited rollback performed without starting a transaction.
> --- SQL operation failed with errors.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on c_int to c_string2 sample random 50 
> >>percent;
> --- SQL operation complete.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> To reproduce, create 3 files with the following contents and run them in this 
> order:
> chmod 755 ./data_gen.py
> ./data_gen.py data.1row_10MB.txt 1 2 10485760
> hive -f hive.sql
> sqlci -i mytest.sql
> $ cat data_gen.py
> #! /usr/bin/env python
> import sys
> if len(sys.argv) != 5 or \
>    sys.argv[1].lower() == '-h' or \
>    sys.argv[1].lower() == '-help':
>     print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar 
> columns> <varchar column length>'
>     sys.exit()
> f = open(sys.argv[1], "w+")
> marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
> for num_rows in range(0, int(sys.argv[2])):
>     f.write(str(num_rows) + '|')
>     for num_cols in range(0, int(sys.argv[3])):
>         f.write(marker[num_rows%len(marker)])
>         for i in range (1, int(sys.argv[4])):
>             f.write(str(i % 10))
>         f.write('|')
>     f.write(str(num_rows))
>     f.write('\n')
> f.close()
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ cat mytest.sql
> log mytest.log clear;
> cqd HIVE_MAX_STRING_LENGTH '10485760';
> set schema hive.mysch;
> drop external table if exists mytable for hive.mysch.mytable;
> create external table mytable (c_int int, c_string1 varchar(10485760), 
> c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> update statistics for table mytable on every column sample random 50 percent;
> showstats for table mytable on c_int to c_string2;
> update statistics for table mytable on c_int to c_string2 sample random 50 
> percent;
> showstats for table mytable on c_int to c_string2;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

Reply via email to