Hi Folks, I’d like to bring up a discussion regarding the better read support for Hudi dataset.
At this point, Hudi MOR table is depending on Hive MapredParquetInputFormat with the first generation Hadoop MapReduce APIs(org.apache.mapred.xxx), while Spark DataSource is using the second generation(org.apache.mapreduce.xxx). Since combining two sets of APIs will lead to unexpected behaviors, so we need to decouple the Hudi related logics from MapReduce APIs and have two separate support levels for V1 and V2. With the native Spark Datasource support, we can take advantage of better query performance, more Spark optimizations like VectorizedReader, and more storage format support like ORC. The Abstraction of the InputFormat and RecordReader will also open up the door for more query engines support in the future. I created an RFC with more details https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader. Any feedback is appreciated. Best Regards, Gary Li
