[ 
https://issues.apache.org/jira/browse/HIVE-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-21428:
--------------------------------------
    Description: 
*Steps to reproduce:*

// create a partitioned table

create external table src (c1 string, c2 string, c3 string) partitioned by 
(part string)

location '/tmp/src';

 

// create data file with data present only in 2 columns and separated by tab, 
put it in table's external location 

echo "d1\td2"  >> data.txt;

hadoop dfs -put  data.txt /tmp/src/part=part1/;

 

MSCK REPAIR TABLE src;

 

// Alter partition's property to have field delimiter as tab ('\t')

ALTER TABLE src PARTITION (part='part1')

SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string', 
'field.delim'='\t');

 

// Now write the data from src table to a dest table

create table dest (c1 string, c2 string, c3 string, c4 string);

insert overwrite table dest select * from src;

 

// Retrieve data from dest table

select * from dest; 

*Result* (wrong)*:*

d1 d2 NULL NULL part1

 

// Now disable schema evolution, write data again from src table to dest table 
and retrieve the data

set hive.exec.schema.evolution=false;

insert overwrite table dest select * from src;

select * from dest;

*Result* (Correct)*:*

d1 d2 NULL part1

 

This is because "d1\td2" is getting considered as single column because the 
filed delimiter used by deserialiser is  *^A* instead of *\t* which is set at 
partition level.

It is working fine if I alter the field delimiter of serde for the entire table.

So, looks like serde properties in TableDesc is taking precedence over serde 
properties in PartitionDesc.  This issue is only when 
hive.exec.schema.evolution is enabled (enabled by default) and its not there in 
2.x versions. 

 

  was:
 

 

*Steps to reproduce:*

create external table src (c1 string, c2, string, c3 string) partitioned by 
(part string)

location '/tmp/src';

 

 

echo "d1\td2"  >> data.txt;

hadoop dfs -put  data.txt /tmp/src/part=part1/;

 

MSCK REPAIR TABLE src;

 

ALTER TABLE src PARTITION (part='part1')

SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string', 
'field.delim'='\t');

 

create table dest (c1 string, c2 string, c3 string, c4 string);

insert overwrite table dest select * from src;

select * from dest;

 

*Result* (wrong)*:*

d1 d2 NULL NULL part1

 

set hive.vectorized.execution.enabled=false;

insert overwrite table dest select * from src;

select * from dest;

 

*Result* (Correct)*:*

d1 d2 NULL part1

 

This is because "d1\td2" is getting considered as single column because the 
filed delimiter used by deserialiser is  *^A* instead of *\t* which is set at 
partition level.

It is working fine if I alter the field delimiter of serde for the entire table.

So, looks like serde properties in TableDesc is taking precedence over serde 
properties in PartitionDesc.  This issue is only when 
hive.exec.schema.evolution is enabled and its not there in 2.x versions. 

 

 

 

 

 

 


> field delimiter property set at partition level is not getting respected when 
> schema evolution is enabled
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21428
>                 URL: https://issues.apache.org/jira/browse/HIVE-21428
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Ganesha Shreedhara
>            Priority: Major
>
> *Steps to reproduce:*
> // create a partitioned table
> create external table src (c1 string, c2 string, c3 string) partitioned by 
> (part string)
> location '/tmp/src';
>  
> // create data file with data present only in 2 columns and separated by tab, 
> put it in table's external location 
> echo "d1\td2"  >> data.txt;
> hadoop dfs -put  data.txt /tmp/src/part=part1/;
>  
> MSCK REPAIR TABLE src;
>  
> // Alter partition's property to have field delimiter as tab ('\t')
> ALTER TABLE src PARTITION (part='part1')
> SET SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES ('columns'='c1,c2', 'column.types' ='string,string', 
> 'field.delim'='\t');
>  
> // Now write the data from src table to a dest table
> create table dest (c1 string, c2 string, c3 string, c4 string);
> insert overwrite table dest select * from src;
>  
> // Retrieve data from dest table
> select * from dest; 
> *Result* (wrong)*:*
> d1 d2 NULL NULL part1
>  
> // Now disable schema evolution, write data again from src table to dest 
> table and retrieve the data
> set hive.exec.schema.evolution=false;
> insert overwrite table dest select * from src;
> select * from dest;
> *Result* (Correct)*:*
> d1 d2 NULL part1
>  
> This is because "d1\td2" is getting considered as single column because the 
> filed delimiter used by deserialiser is  *^A* instead of *\t* which is set at 
> partition level.
> It is working fine if I alter the field delimiter of serde for the entire 
> table.
> So, looks like serde properties in TableDesc is taking precedence over serde 
> properties in PartitionDesc.  This issue is only when 
> hive.exec.schema.evolution is enabled (enabled by default) and its not there 
> in 2.x versions. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to