[
https://issues.apache.org/jira/browse/HIVE-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ganesha Shreedhara updated HIVE-21428:
--------------------------------------
Description:
*Steps to reproduce:*
// create a partitioned table
create external table src (c1 string, c2 string, c3 string) partitioned by
(part string)
location '/tmp/src';
// create data file with data present only in 2 columns and separated by tab,
put it in table's external location
echo "d1\td2" >> data.txt;
hadoop dfs -put data.txt /tmp/src/part=part1/;
MSCK REPAIR TABLE src;
// Alter partition's property to have field delimiter as tab ('\t')
ALTER TABLE src PARTITION (part='part1')
SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string',
'field.delim'='\t');
// Now write the data from src table to a dest table
create table dest (c1 string, c2 string, c3 string, c4 string);
insert overwrite table dest select * from src;
// Retrieve data from dest table
select * from dest;
*Result* (wrong)*:*
d1 d2 NULL NULL part1
// Now disable schema evolution, write data again from src table to dest table
and retrieve the data
set hive.exec.schema.evolution=false;
insert overwrite table dest select * from src;
select * from dest;
*Result* (Correct)*:*
d1 d2 NULL part1
This is because "d1\td2" is getting considered as single column because the
filed delimiter used by deserialiser is *^A* instead of *\t* which is set at
partition level.
It is working fine if I alter the field delimiter of serde for the entire table.
So, looks like serde properties in TableDesc is taking precedence over serde
properties in PartitionDesc. This issue is only when
hive.exec.schema.evolution is enabled (enabled by default) and its not there in
2.x versions.
was:
*Steps to reproduce:*
create external table src (c1 string, c2, string, c3 string) partitioned by
(part string)
location '/tmp/src';
echo "d1\td2" >> data.txt;
hadoop dfs -put data.txt /tmp/src/part=part1/;
MSCK REPAIR TABLE src;
ALTER TABLE src PARTITION (part='part1')
SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string',
'field.delim'='\t');
create table dest (c1 string, c2 string, c3 string, c4 string);
insert overwrite table dest select * from src;
select * from dest;
*Result* (wrong)*:*
d1 d2 NULL NULL part1
set hive.vectorized.execution.enabled=false;
insert overwrite table dest select * from src;
select * from dest;
*Result* (Correct)*:*
d1 d2 NULL part1
This is because "d1\td2" is getting considered as single column because the
filed delimiter used by deserialiser is *^A* instead of *\t* which is set at
partition level.
It is working fine if I alter the field delimiter of serde for the entire table.
So, looks like serde properties in TableDesc is taking precedence over serde
properties in PartitionDesc. This issue is only when
hive.exec.schema.evolution is enabled and its not there in 2.x versions.
> field delimiter property set at partition level is not getting respected when
> schema evolution is enabled
> ---------------------------------------------------------------------------------------------------------
>
> Key: HIVE-21428
> URL: https://issues.apache.org/jira/browse/HIVE-21428
> Project: Hive
> Issue Type: Bug
> Affects Versions: 3.1.1
> Reporter: Ganesha Shreedhara
> Priority: Major
>
> *Steps to reproduce:*
> // create a partitioned table
> create external table src (c1 string, c2 string, c3 string) partitioned by
> (part string)
> location '/tmp/src';
>
> // create data file with data present only in 2 columns and separated by tab,
> put it in table's external location
> echo "d1\td2" >> data.txt;
> hadoop dfs -put data.txt /tmp/src/part=part1/;
>
> MSCK REPAIR TABLE src;
>
> // Alter partition's property to have field delimiter as tab ('\t')
> ALTER TABLE src PARTITION (part='part1')
> SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string',
> 'field.delim'='\t');
>
> // Now write the data from src table to a dest table
> create table dest (c1 string, c2 string, c3 string, c4 string);
> insert overwrite table dest select * from src;
>
> // Retrieve data from dest table
> select * from dest;
> *Result* (wrong)*:*
> d1 d2 NULL NULL part1
>
> // Now disable schema evolution, write data again from src table to dest
> table and retrieve the data
> set hive.exec.schema.evolution=false;
> insert overwrite table dest select * from src;
> select * from dest;
> *Result* (Correct)*:*
> d1 d2 NULL part1
>
> This is because "d1\td2" is getting considered as single column because the
> filed delimiter used by deserialiser is *^A* instead of *\t* which is set at
> partition level.
> It is working fine if I alter the field delimiter of serde for the entire
> table.
> So, looks like serde properties in TableDesc is taking precedence over serde
> properties in PartitionDesc. This issue is only when
> hive.exec.schema.evolution is enabled (enabled by default) and its not there
> in 2.x versions.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)