Hi gary,
Thanks for your proposal, I read the input format codebase before and
thought it would be optimized, will take a look at the design doc when get
a chance  and feedback.

Gary Li <[email protected]> 于2020年4月22日周三 上午5:43写道:

> Hi Folks,
>
> I’d like to bring up a discussion regarding the better read support for
> Hudi dataset.
>
> At this point, Hudi MOR table is depending on Hive
> MapredParquetInputFormat with the first generation Hadoop MapReduce
> APIs(org.apache.mapred.xxx), while Spark DataSource is using the second
> generation(org.apache.mapreduce.xxx). Since combining two sets of APIs will
> lead to unexpected behaviors, so we need to decouple the Hudi related
> logics from MapReduce APIs and have two separate support levels for V1 and
> V2.
>
> With the native Spark Datasource support, we can take advantage of better
> query performance, more Spark optimizations like VectorizedReader, and more
> storage format support like ORC. The Abstraction of the InputFormat and
> RecordReader will also open up the door for more query engines support in
> the future.
>
> I created an RFC with more details
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader.
> Any feedback is appreciated.
>
> Best Regards,
> Gary Li
>
>

Reply via email to