massdosage commented on issue #843: [WIP] InputFormat support for Iceberg
URL: https://github.com/apache/incubator-iceberg/pull/843#issuecomment-599672081
 
 
   OK, so the `InputFormat` we've been working on 
(https://github.com/ExpediaGroup/hiveberg/blob/master/src/main/java/com/expediagroup/hiveberg/IcebergInputFormat.java)
 implements the `org.apache.hadoop.mapred.InputFormat` API. I see the one above 
implements the `org.apache.hadoop.mapreduce.InputFormat` API. Which version of 
Hive is the latter targetting? I'm pretty sure it won't work on Hive 2.x as we 
tried that first and then moved to the `mapred` API which we can confirm is 
working for a simple read path (and we also have a Hive unit test for this at 
https://github.com/ExpediaGroup/hiveberg/blob/master/src/test/java/com/expediagroup/hiveberg/TestIcebergInputFormat.java).
 See 
https://stackoverflow.com/questions/33235199/hadoop-hive-serde-input-format-must-implement-inputformat
 for some more information on the two APIs.
   
   One possible approach here would be that we move our code from the 
"hiveberg" repo back into a PR against Iceberg but we would need someone from 
the Iceberg project to help us with the dependency and version hell that this 
leads to (specifically Hive, Guava and Avro version conflicts). We could then 
have a `mapred` input format and a `mapreduce` input format and at that point 
refactor out the commonality from them.
   
   The InputFormat is also only one part of the picture, we're also intending 
to add a SerDe (see 
https://github.com/ExpediaGroup/hiveberg/blob/master/src/main/java/com/expediagroup/hiveberg/IcebergSerDe.java)
 and related classes and ultimately potentially bundle this all into a Hive 
StorageHandler. This and other roadmap we have on mind can be found at 
https://github.com/ExpediaGroup/hiveberg/issues. 
   
   Again, we would happily move all this work over to here and I think that 
would make it a lot easier to have visibility on all the issues and split the 
work to remove duplication etc. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to