[jira] [Comment Edited] (HIVE-18763) VectorMapOperator should take into account partition->table serde conversion for all cases

Sergey Shelukhin (JIRA) Thu, 22 Feb 2018 11:13:32 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373272#comment-16373272
 ]


Sergey Shelukhin edited comment on HIVE-18763 at 2/22/18 7:12 PM:
------------------------------------------------------------------

Hey, I was planning to work on it, but not for a while (after some other bugs), 
so feel free if you want to.
My thinking after looking at the row converters was that the actual changes for 
most types are simple and insignificant (varchar lengths is one), so the change 
would be mostly wiring. It should be easy to add a separate VRB converter 
rather than trying to shoehorn this into VectorDeserializeRow or OI converters, 
since both of these deal with rows (row->row or row->VRB). All the necessary 
schema and stuff are available in VectorMapOperator - it is already used for 
the non-native-VRB case.
Then making sure it doesn't apply to cases that do their own conversion (the 
only one right now is ORC) is another small wiring issue. 



was (Author: sershe):
Hey, I was planning to work on it, but not for a while (after some other bugs), 
so feel free if you want to.
My thinking after looking at the row converters was that the actual changes for 
most types are simple and insignificant (varchar lengths is one), so the change 
would be mostly wiring. It should be easy to add a separate VRB converter 
rather than trying to shoehorn this into VectorDeserializeRow or OI converters, 
since both of these deal with rows (row->row or row->VRB). All the necessary 
schema and stuff are available in VectorMapOperator - it is already used for 
the non-native-VRB case.
Then making sure it doesn't apply to cases that do their own conversion (the 
only one right now is ORC) is another small wiring issue. 
Under current plan I have for stuff I can do it next week probably... 

> VectorMapOperator should take into account partition->table serde conversion 
> for all cases
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-18763
>                 URL: https://issues.apache.org/jira/browse/HIVE-18763
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> When table and partition schema differ, non-vectorized MapOperator does row 
> by row conversion from whatever is read to the table schema.
> VectorMapOperator is less consistent... it does the conversion as part of 
> populating VRBs in row/serde modes (used to read e.g. text row-by-row or 
> natively, and make VRBs); see  VectorDeserializeRow class convert... methods 
> for an example. However, the native VRB mode relies on ORC 
> ConvertTreeReader... stuff that lives in ORC, and so never converts anything 
> nside VMO.
> So, anything running in native VRB mode that is not the vanilla ORC reader 
> will produce data with incorrect schema if there were schema changes and 
> partitions are present  - there are two such cases right now, LLAP IO with 
> ORC or text data, and Parquet. 
> It's possible to extend ConvertTreeReader... stuff to LLAP IO ORC that 
> already uses TreeReader-s for everything; LLAP IO text and (non-LLAP) 
> Parquet, as well as any future users however will have to invent their own 
> conversion.
> Therefore, I think the best fix for this is to treat all inputs in VMO the 
> same and convert them by default, like the regular MapOperator; and make ORC 
> special mode an exception that allows it to bypass the conversion. 
> cc [~mmccline]
> Test case - varchar column length should be limited after alter table but it 
> isn't.
> {noformat}
> CREATE TABLE schema_evolution_data(insert_num int, boolean1 boolean, tinyint1 
> tinyint, smallint1 smallint, int1 int, bigint1 bigint, decimal1 
> decimal(38,18), float1 float, double1 double, string1 varchar(50), string2 
> varchar(50), date1 date, timestamp1 timestamp, boolean_str string, 
> tinyint_str string, smallint_str string, int_str string, bigint_str string, 
> decimal_str string, float_str string, double_str string, date_str string, 
> timestamp_str string, filler string)
> row format delimited fields terminated by '|' stored as textfile;
> load data local inpath 
> '../../data/files/schema_evolution/schema_evolution_data.txt' overwrite into 
> table schema_evolution_data;
> drop table if exists vsp;
> create table vsp(vs varchar(50)) partitioned by(s varchar(50)) stored as 
> textfile;
> insert into table vsp partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp change column vs vs varchar(3);
> drop table if exists vsp_orc;
> create table vsp_orc(vs varchar(50)) partitioned by(s varchar(50)) stored as 
> orc;
> insert into table vsp_orc partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp_orc change column vs vs varchar(3);
> drop table if exists vsp_parquet;
> create table vsp_parquet(vs varchar(50)) partitioned by(s varchar(50)) stored 
> as parquet;
> insert into table vsp_parquet partition(s='positive') select string1 from 
> schema_evolution_data;
> alter table vsp_parquet change column vs vs varchar(3);
> SET hive.llap.io.enabled=true;
> -- BAD results from all queries; parquet affected regardless of IO.
> select length(vs) from vsp; 
> select length(vs) from vsp_orc;
> select length(vs) from vsp_parquet;
> SET hive.llap.io.enabled=false;
> select length(vs) from vsp; -- ok
> select length(vs) from vsp_orc; -- ok
> select length(vs) from vsp_parquet; -- still bad
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-18763) VectorMapOperator should take into account partition->table serde conversion for all cases

Reply via email to