szlta opened a new pull request #2613:
URL: https://github.com/apache/iceberg/pull/2613


   Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat 
for reading data from Iceberg. This in turn relies on the internal reader 
implementations for the various file formats, such as GenericOrcReader(s). 
Unfortunately for Hive, the whole read pipeline is record based, whereas 
GenericRecord instances hold the information on which Hive has to use the 
Iceberg object inspectors to retrieve the encoded data from.
   
   From a design point of view I see 3 layers where during reading data flows 
upwards:
   - Execution engine: Hive
   - Table format: Iceberg
   - File format: ORC/Parquet/Avro
   
   The current design however combines the two bottom layers, and although Hive 
already has support for vectorized execution, and already has implementations 
for vectorized reading from various file formats; with Iceberg in the picture 
it can't have the advantages of vectorization.
   
   In this change I propose to swap out the bottom layer with the 
implementations Hive already has, so in the case of ORC: with 
VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive 
expects as its vectorized format. This could be used as the 'Hive' in memory 
data model, something IcebergInputFormat already distinguishes upon.
   
   One big obstacle is that Iceberg depends on a type of ORC that has the 
nohive classifier. This means that classes that are actually coming from 
Hive/storage-api are shaded inside ORC with an ORC classname. In order to 
revert this while also causing the minimal disruption to other parts of Iceberg 
I propose to create an intermediary module for iceberg-hive3 to depend on 
(rather then depending on iceberg-orc and iceberg-data)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to