[
https://issues.apache.org/jira/browse/HIVE-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978476#comment-14978476
]
Elliot West commented on HIVE-11092:
------------------------------------
Fixed by HIVE-4243 apparently. I'll confirm and close.
> First delta of an ORC ACID table contains non-descriptive schema
> ----------------------------------------------------------------
>
> Key: HIVE-11092
> URL: https://issues.apache.org/jira/browse/HIVE-11092
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: Elliot West
> Assignee: Elliot West
> Priority: Minor
> Labels: orc, orcfile, transaction, transactions
>
> I've been reading ORC ACID data that backs transactional tables from a
> process external to Hive. Initially I tried to use 'schema on read' but found
> some inconsistencies in the schema returned from the initial delta file and
> subsequent delta and base files. To reproduce the issue by example:
> {code}
> CREATE TABLE base_table ( id int, message string )
> PARTITIONED BY ( continent string, country string )
> CLUSTERED BY (id) INTO 1 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ('transactional' = 'true');
>
> INSERT INTO TABLE base_table PARTITION (continent = 'Asia', country = 'India')
> VALUES (1, 'x'), (2, 'y'), (3, 'z');
> UPDATE base_table SET message = 'updated' WHERE id = 1;
> {code}
> Now examining the raw data with the {{orcfiledump}} utility (edited for
> brevity):
> {code}
> cd hive/warehouse/base_table/continent=Asia/country=India/
> hive --orcfiledump delta_0000001_0000001/bucket_00000
> Type:
> struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<_col0:int,_col1:string>>
>
>
> hive --orcfiledump delta_0000002_0000002/bucket_00000
> Type:
> struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<id:int,message:string>>
>
> {code}
> The row schema for the first delta that resulted from the inserts has its
> field names erased: {{row:struct<_col0:int,_col1:string>}}, whereas the delta
> for the update reports the correct schema:
> {{row:struct<id:int,message:string>}}. I have also checked this with my own
> reader code so am confident that {{FileDump}} is not at fault.
> I believe that the row field names, and hence schema, should be consistent
> across all ORC files in the ACID data set. This will enable schema on read
> with field access by name (not index), which is currently not possible.
> Therefore I'd like to get this issue resolved.
> I'm happy to work on this, however after working through {{OrcRecordUpdater}}
> and {{FileSinkOperator}} and related tests I've failed to reproduce or
> isolate the issue at a smaller scale. I'd be grateful for some suggestions on
> where to look next.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)