umehrot2 edited a comment on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-643568699


   Like @vinothchandar I do agree with the **high level approach** here, and 
thanks for putting out this PR 👍  However, I would highly recommend both of you 
to check out https://github.com/apache/hudi/pull/1702/ which is along similar 
lines, and solves some of the issues I see in this PR:
   
   - Here we are instantiating another datasource/relation i.e. 
`HoodieRealtimeFileFormat` and `spark parquet` relation within `Snapshot 
relation` which has overheads associated with it, like spark having to form 
index again by listing the paths passed to `HoodieRealtimeFileFormat` and 
`spark parquet` relations to be able to instantiate them.
   
   - We are re-using the `ParquetFileFormat` reader and all of its 
functionalities like **vectorized reading** , **predicate pushdown**, **column 
pruning** without having to copy the over and maintain it internally.
   
   - We do not have to pass the expensive `map from parquet to log files` to 
each task. Instead it gives complete control over what goes into each task 
partition, and we send only the file and its corresponding mapping (in our case 
`external data file`, and in this case `log file`) over to the task. It is the 
very use to **RDD** interface to have that kind of control over the datasource 
we are building.
   
   Happy to have more in-depth discussion on this and help get this to 
completion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to