[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

ASF GitHub Bot (JIRA) Mon, 10 Oct 2016 09:30:36 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562739#comment-15562739
 ]


ASF GitHub Bot commented on TRAFODION-2251:
-------------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-trafodion/pull/744


> update stats does not work for large Hive varchar columns
> ---------------------------------------------------------
>
>                 Key: TRAFODION-2251
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: 2.1-incubating
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> As shown below, update statistics currently returns various errors once the 
> length of a column exceeds 200000. But with the large varchar support, 
> theoretically the total row size of 64MB is the only limit. A long varchar 
> column length can easily go over the 200000 limit.
> It's not clear if we intend to support update statistics for large varchar. 
> If we don't plan to support it, perhaps update statistics can at least skip 
> long varchar columns (much like the way it skips LOB columns right now), so 
> that it does not fail when a user runs update statistics on 'every column' 
> that includes long varchars.
> The following sequence of execution also showed a second problem, which may 
> or may not be related. The first update statistics had failed with 'every 
> column' which includes long varchar columns. Showstats didn't show any stats 
> aferwards. The second update statistics was run on the long varchar columns 
> again, but it didn't return any errors. Showstats afterwards showed no stats 
> still.
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ hive -f hive.sql
> WARNING: Use "yarn jar" to launch YARN applications.
> Logging initialized using configuration in 
> file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
> OK
> Time taken: 1.455 seconds
> OK
> Time taken: 0.185 seconds
> OK
> Time taken: 0.304 seconds
> OK
> Time taken: 0.371 seconds
> Loading data to table mysch.mytable
> Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
> OK
> Time taken: 0.967 seconds
> $ sqlci -i mytest.sql
> EsgynDB Advanced Conversational Interface 2.2.0
> Copyright (c) 2015-2016 Esgyn Corporation
> >>log mytest.log clear;
> >>cqd HIVE_MAX_STRING_LENGTH '10485760';
> --- SQL operation complete.
> >>set schema hive.mysch;
> --- SQL operation complete.
> >>drop external table if exists mytable for hive.mysch.mytable;
> --- SQL operation complete.
> >>create external table mytable (c_int int, c_string1 varchar(10485760), 
> >>c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on every column sample random 50 
> >>percent;
> *** ERROR[9214] Object 
> TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 
> could not be created.
> *** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size 
> allowed (200000) for column C_STRING1.
> *** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an 
> error (8609) from statement Process_Query.
> *** ERROR[8609] Waited rollback performed without starting a transaction.
> --- SQL operation failed with errors.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> >>
> >>update statistics for table mytable on c_int to c_string2 sample random 50 
> >>percent;
> --- SQL operation complete.
> >>showstats for table mytable on c_int to c_string2;
> Histogram data for Table HIVE.MYSCH.MYTABLE
> Table ID: 6638414188583073746
>    Hist ID # Ints Rowcount UEC Colname(s)
> ========== ====== =========== =========== ===========================
> No Histograms exist for the requested columns or groups
> --- SQL operation complete.
> To reproduce, create 3 files with the following contents and run them in this 
> order:
> chmod 755 ./data_gen.py
> ./data_gen.py data.1row_10MB.txt 1 2 10485760
> hive -f hive.sql
> sqlci -i mytest.sql
> $ cat data_gen.py
> #! /usr/bin/env python
> import sys
> if len(sys.argv) != 5 or \
>    sys.argv[1].lower() == '-h' or \
>    sys.argv[1].lower() == '-help':
>     print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar 
> columns> <varchar column length>'
>     sys.exit()
> f = open(sys.argv[1], "w+")
> marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
> for num_rows in range(0, int(sys.argv[2])):
>     f.write(str(num_rows) + '|')
>     for num_cols in range(0, int(sys.argv[3])):
>         f.write(marker[num_rows%len(marker)])
>         for i in range (1, int(sys.argv[4])):
>             f.write(str(i % 10))
>         f.write('|')
>     f.write(str(num_rows))
>     f.write('\n')
> f.close()
> $ cat hive.sql
> drop database if exists mysch cascade;
> create database mysch;
> use mysch;
> create table mytable(c_int int, c_string1 string, c_string2 string ,p_int 
> int) row format delimited fields terminated by '|';
> load data local inpath './data.1row_10MB.txt' overwrite into table mytable;
> $ cat mytest.sql
> log mytest.log clear;
> cqd HIVE_MAX_STRING_LENGTH '10485760';
> set schema hive.mysch;
> drop external table if exists mytable for hive.mysch.mytable;
> create external table mytable (c_int int, c_string1 varchar(10485760), 
> c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;
> update statistics for table mytable on every column sample random 50 percent;
> showstats for table mytable on c_int to c_string2;
> update statistics for table mytable on c_int to c_string2 sample random 50 
> percent;
> showstats for table mytable on c_int to c_string2;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2251) update stats does not work for large Hive varchar columns

Reply via email to