umehrot2 commented on pull request #1722: URL: https://github.com/apache/hudi/pull/1722#issuecomment-643568699
Like @vinothchandar I do agree with the **high level approach** here, and thanks for putting out this PR 👍 However, I would highly recommend both of you to check out https://github.com/apache/hudi/pull/1702/ which is along similar lines, and solves some of the issues I see in this PR: - Here we are instantiating another datasource/relation i.e. `HoodieRealtimeFileFormat` and `spark parquet` relation which has overheads associated with it, like spark having to form index again by listing the paths passed to the datasource. - We are re-using the `ParquetFileFormat` reader and all of its functionalities like **vectorized reading** , **predicate pushdown**, **column pruning** without having to copy the over and maintain it internally. - We do not have to pass the expensive `map from parquet to log files` to each task. Instead it gives complete control over what goes into each task partition, and we send only the file and its corresponding mapping (in our case `external data file`, and in this case `log file`) over to the task. It is the very use to **RDD** interface to have that kind of control over the datasource we are building. Happy to have more in-depth discussion on this and help get this to completion. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
