Hi Folks,

I’d like to bring up a discussion regarding the better read support for Hudi 
dataset.

At this point, Hudi MOR table is depending on Hive MapredParquetInputFormat 
with the first generation Hadoop MapReduce APIs(org.apache.mapred.xxx), while 
Spark DataSource is using the second generation(org.apache.mapreduce.xxx). 
Since combining two sets of APIs will lead to unexpected behaviors, so we need 
to decouple the Hudi related logics from MapReduce APIs and have two separate 
support levels for V1 and V2.

With the native Spark Datasource support, we can take advantage of better query 
performance, more Spark optimizations like VectorizedReader, and more storage 
format support like ORC. The Abstraction of the InputFormat and RecordReader 
will also open up the door for more query engines support in the future.

I created an RFC with more details 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader.
 Any feedback is appreciated.

Best Regards,
Gary Li

Reply via email to