[
https://issues.apache.org/jira/browse/IMPALA-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on IMPALA-9290 started by Norbert Luksa.
---------------------------------------------
> ORC scanner should support schema evolution between date and timestamp types
> ----------------------------------------------------------------------------
>
> Key: IMPALA-9290
> URL: https://issues.apache.org/jira/browse/IMPALA-9290
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 3.3.0
> Reporter: Gabor Kaszab
> Assignee: Norbert Luksa
> Priority: Major
> Labels: orc
>
> *This is the desired use case:*
> 1. Create an ORC table TBL1 with a DATE column.
> 2. Create an ORC table TBL2 with a TIMESTAMP column that has the same
> location as TBL1.
> 3. Insert some DATE values into TBL1 and some TIMESTAMP values into TBL2.
> 4. select from TBL1 returns both DATE and TIMESTAMP values (converted to
> DATE).
> 5. select from TBL2 returns both DATE and TIMESTAMPS values. The DATE values
> are converted to TIMESTAMP.
> Without this feature Impala return an error:
> {code:java}
> ERROR: Type mismatch: table column DATE is map to column timestamp in ORC
> file 'hdfs://localhost:20500/test-warehouse/orc_date_tbl/000000_0_copy_1'
> {code}
> *Note:*
> With https://issues.apache.org/jira/browse/IMPALA-8801 implementing Date type
> for ORC it is possible to read date values in ORC format. However, writing is
> still not supported and has to be done by Hive.
> *Let me copy-paste a code review comment from IMPALA-8801 as a suggestion for
> the implementation:*
> We can modify OrcTimestampReader to support reading orc::TimestampVectorBatch
> into Date type slots. In its constructor it knows which kind of slots
> (timestamp or date) it's writting to. So in ReadValue() it can have different
> behaviors based on different modes (timestamp values => timestamp slots /
> timestamp values => date slots). We can do the same on OrcDateColumnReader to
> let it support reading ORC Date values into Timestamp type slots.
> Note that the life cycle of a OrcColumnReader is within the life cycle of the
> HdfsOrcScanner which only reads a split of an ORC file, and an ORC file can't
> have two types for one column (e.g. column1 is timestamp in stripe1 and is
> date in stripe2). So we don't need to deal with different batch types in
> UpdateInputBatch().
> BTW, It'd be better to add test coverage for this type compactibility check
> in test_scanners.py (See TestOrc.test_type_conversions).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]