Hi Gary, On COW, today we already let the engines (Spark, Hive, Presto) use their own readers for parquet..
But as we embark on MOR snapshot query (aka realtime inputformat), it may make sense to have abstractions in our own code base to efficiently read base + log files (file slice) out in different common formats - arraywritable for Hive, row for spark and so on... I will review the RFC more closely. But high level, I am +1 to evolve our FileSlice abstraction more cleanly (HUDI-684 's goal is precisely that) and plug it in flexibly under spark.. We need not be encumbered by the mapreduce/InputFormat APIs per se. Thanks Vinoth On Wed, Apr 22, 2020 at 6:06 PM leesf <[email protected]> wrote: > Hi gary, > Thanks for your proposal, I read the input format codebase before and > thought it would be optimized, will take a look at the design doc when get > a chance and feedback. > > Gary Li <[email protected]> 于2020年4月22日周三 上午5:43写道: > > > Hi Folks, > > > > I’d like to bring up a discussion regarding the better read support for > > Hudi dataset. > > > > At this point, Hudi MOR table is depending on Hive > > MapredParquetInputFormat with the first generation Hadoop MapReduce > > APIs(org.apache.mapred.xxx), while Spark DataSource is using the second > > generation(org.apache.mapreduce.xxx). Since combining two sets of APIs > will > > lead to unexpected behaviors, so we need to decouple the Hudi related > > logics from MapReduce APIs and have two separate support levels for V1 > and > > V2. > > > > With the native Spark Datasource support, we can take advantage of better > > query performance, more Spark optimizations like VectorizedReader, and > more > > storage format support like ORC. The Abstraction of the InputFormat and > > RecordReader will also open up the door for more query engines support in > > the future. > > > > I created an RFC with more details > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstract+HoodieInputFormat+and+RecordReader > . > > Any feedback is appreciated. > > > > Best Regards, > > Gary Li > > > > >
