massdosage commented on issue #843: [WIP] InputFormat support for Iceberg URL: https://github.com/apache/incubator-iceberg/pull/843#issuecomment-599672081 OK, so the `InputFormat` we've been working on (https://github.com/ExpediaGroup/hiveberg/blob/master/src/main/java/com/expediagroup/hiveberg/IcebergInputFormat.java) implements the `org.apache.hadoop.mapred.InputFormat` API. I see the one above implements the `org.apache.hadoop.mapreduce.InputFormat` API. Which version of Hive is the latter targetting? I'm pretty sure it won't work on Hive 2.x as we tried that first and then moved to the `mapred` API which we can confirm is working for a simple read path (and we also have a Hive unit test for this at https://github.com/ExpediaGroup/hiveberg/blob/master/src/test/java/com/expediagroup/hiveberg/TestIcebergInputFormat.java). See https://stackoverflow.com/questions/33235199/hadoop-hive-serde-input-format-must-implement-inputformat for some more information on the two APIs. One possible approach here would be that we move our code from the "hiveberg" repo back into a PR against Iceberg but we would need someone from the Iceberg project to help us with the dependency and version hell that this leads to (specifically Hive, Guava and Avro version conflicts). We could then have a `mapred` input format and a `mapreduce` input format and at that point refactor out the commonality from them. The InputFormat is also only one part of the picture, we're also intending to add a SerDe (see https://github.com/ExpediaGroup/hiveberg/blob/master/src/main/java/com/expediagroup/hiveberg/IcebergSerDe.java) and related classes and ultimately potentially bundle this all into a Hive StorageHandler. This and other roadmap we have on mind can be found at https://github.com/ExpediaGroup/hiveberg/issues. Again, we would happily move all this work over to here and I think that would make it a lot easier to have visibility on all the issues and split the work to remove duplication etc.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
