pvary opened a new pull request #1407:
URL: https://github.com/apache/iceberg/pull/1407
The goal of the patch is to have a PoC implementation of Hive writes to
Iceberg tables.
The patch changes:
- _HiveIcebergOutputFormat / RecordWriter_ to write out data to Iceberg
tables
- _HiveIcebergSerDe.serialize_ so the data in the Writables are converted to
the correct Iceberg values
- Tests for the happy path of the things above
The tests were run successfully. Also after adding the change after the uber
jar patch I was able to create a Hive table above an unpartitioned Iceberg
table and write and read back data from it. The table was created with the
following command:
```
CREATE EXTERNAL TABLE purchases STORED BY
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION '/tmp/hive/iceberg/test-run-3885506338417686444/purchases'
TBLPROPERTIES ('iceberg.mr.write.file.format'='orc') ;
```
Findings:
- When writing UUID type fields Parquet writer requires byte[], on the other
hand ORC and Avro requires UUID object. I think this should be generic
throughout the FileFormats.
- The Hive interface does not provide the possibility to "commit" all the
writes in the query at once. Will do more investigation but I have not found a
way to do it it one run. The current implementation immediately commits the
changes upon closing the writer. This is _suboptimal and not correct_ when we
write data which ends up on multiple FileSinks.
Missing stuff:
- Hive commit as described above
- Way to handle partitioned tables. Do I have to write multiple DataFiles
for it, and add them manually to the commit, or there is some API helping me
out? What about multicolumn partitioning - is it possible?
- More tests
- Error handling
- Logging
- Did I mention more tests? :)
Any suggestions / ideas / thoughts are welcome.
Thanks,
Peter
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]