Sergey Shelukhin commented on HIVE-18763:

Ok I was finally able to return to this.
Looks like it's a much bigger change than I expected.

I included the test that returns different results for different cases and in 
fact fails in LLAP IO with either ORC or text...
I cannot repro the issue with Parquet anymore, at least not for these 

For now I won't be working on this anymore... I think ConvertTreeReader-s 
conversion aspects need to be moved from ORC to Hive, since they should not be 
ORC specific. Or at least conversion logic should be refactored to be reusable 
in wider range of cases. cc [~mmccline]

> VectorMapOperator should take into account partition->table serde conversion 
> for all cases
> ------------------------------------------------------------------------------------------
>                 Key: HIVE-18763
>                 URL: https://issues.apache.org/jira/browse/HIVE-18763
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HIVE-18763.WIP.patch
> When table and partition schema differ, non-vectorized MapOperator does row 
> by row conversion from whatever is read to the table schema.
> VectorMapOperator is less consistent... it does the conversion as part of 
> populating VRBs in row/serde modes (used to read e.g. text row-by-row or 
> natively, and make VRBs); see  VectorDeserializeRow class convert... methods 
> for an example. However, the native VRB mode relies on ORC 
> ConvertTreeReader... stuff that lives in ORC, and so never converts anything 
> nside VMO.
> So, anything running in native VRB mode that is not the vanilla ORC reader 
> will produce data with incorrect schema if there were schema changes and 
> partitions are present  - there are two such cases right now, LLAP IO with 
> ORC or text data, and Parquet. 
> It's possible to extend ConvertTreeReader... stuff to LLAP IO ORC that 
> already uses TreeReader-s for everything; LLAP IO text and (non-LLAP) 
> Parquet, as well as any future users however will have to invent their own 
> conversion.
> Therefore, I think the best fix for this is to treat all inputs in VMO the 
> same and convert them by default, like the regular MapOperator; and make ORC 
> special mode an exception that allows it to bypass the conversion. 
> cc [~mmccline]
> Test case - varchar column length should be limited after alter table but it 
> isn't.
> {noformat}
> CREATE TABLE schema_evolution_data(insert_num int, boolean1 boolean, tinyint1 
> tinyint, smallint1 smallint, int1 int, bigint1 bigint, decimal1 
> decimal(38,18), float1 float, double1 double, string1 varchar(50), string2 
> varchar(50), date1 date, timestamp1 timestamp, boolean_str string, 
> tinyint_str string, smallint_str string, int_str string, bigint_str string, 
> decimal_str string, float_str string, double_str string, date_str string, 
> timestamp_str string, filler string)
> row format delimited fields terminated by '|' stored as textfile;
> load data local inpath 
> '../../data/files/schema_evolution/schema_evolution_data.txt' overwrite into 
> table schema_evolution_data;
> drop table if exists vsp;
> create table vsp(vs varchar(50)) partitioned by(s varchar(50)) stored as 
> textfile;
> insert into table vsp partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp change column vs vs varchar(3);
> drop table if exists vsp_orc;
> create table vsp_orc(vs varchar(50)) partitioned by(s varchar(50)) stored as 
> orc;
> insert into table vsp_orc partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp_orc change column vs vs varchar(3);
> drop table if exists vsp_parquet;
> create table vsp_parquet(vs varchar(50)) partitioned by(s varchar(50)) stored 
> as parquet;
> insert into table vsp_parquet partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp_parquet change column vs vs varchar(3);
> SET hive.llap.io.enabled=true;
> -- BAD results from all queries; parquet affected regardless of IO.
> select length(vs) from vsp; 
> select length(vs) from vsp_orc;
> select length(vs) from vsp_parquet;
> SET hive.llap.io.enabled=false;
> select length(vs) from vsp; -- ok
> select length(vs) from vsp_orc; -- ok
> select length(vs) from vsp_parquet; -- still bad
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to