Hi gary, Thanks for your proposal, I read the input format codebase before and thought it would be optimized, will take a look at the design doc when get a chance and feedback.
Gary Li <[email protected]> 于2020年4月22日周三 上午5:43写道: > Hi Folks, > > I’d like to bring up a discussion regarding the better read support for > Hudi dataset. > > At this point, Hudi MOR table is depending on Hive > MapredParquetInputFormat with the first generation Hadoop MapReduce > APIs(org.apache.mapred.xxx), while Spark DataSource is using the second > generation(org.apache.mapreduce.xxx). Since combining two sets of APIs will > lead to unexpected behaviors, so we need to decouple the Hudi related > logics from MapReduce APIs and have two separate support levels for V1 and > V2. > > With the native Spark Datasource support, we can take advantage of better > query performance, more Spark optimizations like VectorizedReader, and more > storage format support like ORC. The Abstraction of the InputFormat and > RecordReader will also open up the door for more query engines support in > the future. > > I created an RFC with more details > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader. > Any feedback is appreciated. > > Best Regards, > Gary Li > >
