David Wayne Birdsall created TRAFODION-2251:
-----------------------------------------------

             Summary: update stats does not work for large Hive varchar columns
                 Key: TRAFODION-2251
                 URL: https://issues.apache.org/jira/browse/TRAFODION-2251
             Project: Apache Trafodion
          Issue Type: Bug
          Components: sql-cmp
    Affects Versions: 2.1-incubating
         Environment: All

            Reporter: David Wayne Birdsall
            Assignee: David Wayne Birdsall


As shown below, update statistics currently returns various errors once the 
length of a column exceeds 200000. But with the large varchar support, 
theoretically the total row size of 64MB is the only limit. A long varchar 
column length can easily go over the 200000 limit.

It's not clear if we intend to support update statistics for large varchar. If 
we don't plan to support it, perhaps update statistics can at least skip long 
varchar columns (much like the way it skips LOB columns right now), so that it 
does not fail when a user runs update statistics on 'every column' that 
includes long varchars.

The following sequence of execution also showed a second problem, which may or 
may not be related. The first update statistics had failed with 'every column' 
which includes long varchar columns. Showstats didn't show any stats aferwards. 
The second update statistics was run on the long varchar columns again, but it 
didn't return any errors. Showstats afterwards showed no stats still.

$ cat hive.sql
drop database if exists mysch cascade;
create database mysch;
use mysch;
create table mytable(c_int int, c_string1 string, c_string2 string ,p_int int) 
row format delimited fields terminated by '|';
load data local inpath './data.1row_10MB.txt' overwrite into table mytable;

$ hive -f hive.sql
WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in 
file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
OK
Time taken: 1.455 seconds
OK
Time taken: 0.185 seconds
OK
Time taken: 0.304 seconds
OK
Time taken: 0.371 seconds
Loading data to table mysch.mytable
Table mysch.mytable stats: [numFiles=1, totalSize=20971526]
OK
Time taken: 0.967 seconds

$ sqlci -i mytest.sql
EsgynDB Advanced Conversational Interface 2.2.0
Copyright (c) 2015-2016 Esgyn Corporation
>>log mytest.log clear;
>>cqd HIVE_MAX_STRING_LENGTH '10485760';

--- SQL operation complete.
>>set schema hive.mysch;

--- SQL operation complete.
>>drop external table if exists mytable for hive.mysch.mytable;

--- SQL operation complete.
>>create external table mytable (c_int int, c_string1 varchar(10485760), 
>>c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;

--- SQL operation complete.
>>
>>update statistics for table mytable on every column sample random 50 percent;

*** ERROR[9214] Object 
TRAFODION."_HIVESTATS_".TRAF_SAMPLE_6638414188583073746_1473710297_39124 could 
not be created.

*** ERROR[4247] Specified size in bytes (10485760) exceeds the maximum size 
allowed (200000) for column C_STRING1.

*** ERROR[9200] UPDATE STATISTICS for table HIVE.MYSCH.MYTABLE encountered an 
error (8609) from statement Process_Query.

*** ERROR[8609] Waited rollback performed without starting a transaction.

--- SQL operation failed with errors.
>>showstats for table mytable on c_int to c_string2;

Histogram data for Table HIVE.MYSCH.MYTABLE
Table ID: 6638414188583073746

   Hist ID # Ints Rowcount UEC Colname(s)
========== ====== =========== =========== ===========================

No Histograms exist for the requested columns or groups


--- SQL operation complete.
>>
>>update statistics for table mytable on c_int to c_string2 sample random 50 
>>percent;

--- SQL operation complete.
>>showstats for table mytable on c_int to c_string2;

Histogram data for Table HIVE.MYSCH.MYTABLE
Table ID: 6638414188583073746

   Hist ID # Ints Rowcount UEC Colname(s)
========== ====== =========== =========== ===========================

No Histograms exist for the requested columns or groups


--- SQL operation complete.

To reproduce, create 3 files with the following contents and run them in this 
order:

chmod 755 ./data_gen.py
./data_gen.py data.1row_10MB.txt 1 2 10485760
hive -f hive.sql
sqlci -i mytest.sql

$ cat data_gen.py
#! /usr/bin/env python

import sys

if len(sys.argv) != 5 or \
   sys.argv[1].lower() == '-h' or \
   sys.argv[1].lower() == '-help':
    print 'Usage: ' + sys.argv[0] + ' <file> <num of rows> <num of varchar 
columns> <varchar column length>'
    sys.exit()

f = open(sys.argv[1], "w+")

marker=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
for num_rows in range(0, int(sys.argv[2])):
    f.write(str(num_rows) + '|')
    for num_cols in range(0, int(sys.argv[3])):
        f.write(marker[num_rows%len(marker)])
        for i in range (1, int(sys.argv[4])):
            f.write(str(i % 10))
        f.write('|')
    f.write(str(num_rows))
    f.write('\n')

f.close()

$ cat hive.sql
drop database if exists mysch cascade;
create database mysch;
use mysch;
create table mytable(c_int int, c_string1 string, c_string2 string ,p_int int) 
row format delimited fields terminated by '|';
load data local inpath './data.1row_10MB.txt' overwrite into table mytable;


$ cat mytest.sql
log mytest.log clear;
cqd HIVE_MAX_STRING_LENGTH '10485760';
set schema hive.mysch;
drop external table if exists mytable for hive.mysch.mytable;
create external table mytable (c_int int, c_string1 varchar(10485760), 
c_string2 varchar(10485760), p_int int) for hive.mysch.mytable;

update statistics for table mytable on every column sample random 50 percent;
showstats for table mytable on c_int to c_string2;

update statistics for table mytable on c_int to c_string2 sample random 50 
percent;
showstats for table mytable on c_int to c_string2;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to