[GitHub] [iceberg] pvary opened a new pull request #1407: Hive: HiveIcebergOutputFormat first implementation - WIP

GitBox Tue, 01 Sep 2020 05:59:44 -0700


pvary opened a new pull request #1407:
URL: https://github.com/apache/iceberg/pull/1407



   The goal of the patch is to have a PoC implementation of Hive writes to 
Iceberg tables.
   
   The patch changes:
   - _HiveIcebergOutputFormat / RecordWriter_ to write out data to Iceberg 
tables
   - _HiveIcebergSerDe.serialize_ so the data in the Writables are converted to 
the correct Iceberg values
   - Tests for the happy path of the things above
   
   The tests were run successfully. Also after adding the change after the uber 
jar patch I was able to create a Hive table above an unpartitioned Iceberg 
table and write and read back data from it. The table was created with the 
following command:
   ```
   CREATE EXTERNAL TABLE purchases STORED BY 
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   LOCATION '/tmp/hive/iceberg/test-run-3885506338417686444/purchases'
   TBLPROPERTIES ('iceberg.mr.write.file.format'='orc') ;
   ```
   
   Findings:
   - When writing UUID type fields Parquet writer requires byte[], on the other 
hand ORC and Avro requires UUID object. I think this should be generic 
throughout the FileFormats.
   - The Hive interface does not provide the possibility to "commit" all the 
writes in the query at once. Will do more investigation but I have not found a 
way to do it it one run. The current implementation immediately commits the 
changes upon closing the writer. This is _suboptimal and not correct_ when we 
write data which ends up on multiple FileSinks.
   
   Missing stuff:
   - Hive commit as described above
   - Way to handle partitioned tables. Do I have to write multiple DataFiles 
for it, and add them manually to the commit, or there is some API helping me 
out? What about multicolumn partitioning - is it possible?
   - More tests
   - Error handling
   - Logging
   - Did I mention more tests? :)
   
   Any suggestions / ideas / thoughts are welcome.
   
   Thanks,
   Peter
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary opened a new pull request #1407: Hive: HiveIcebergOutputFormat first implementation - WIP

Reply via email to