[GitHub] [hudi] umehrot2 commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

GitBox Fri, 12 Jun 2020 21:45:06 -0700


umehrot2 commented on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-643568699



   Like @vinothchandar I do agree with the **high level approach** here, and 
thanks for putting out this PR 👍  However, I would highly recommend both of you 
to check out https://github.com/apache/hudi/pull/1702/ which is along similar 
lines, and solves some of the issues I see in this PR:
   
   - Here we are instantiating another datasource/relation i.e. 
`HoodieRealtimeFileFormat` and `spark parquet` relation which has overheads 
associated with it, like spark having to form index again by listing the paths 
passed to the datasource.
   
   - We are re-using the `ParquetFileFormat` reader and all of its 
functionalities like **vectorized reading** , **predicate pushdown**, **column 
pruning** without having to copy the over and maintain it internally.
   
   - We do not have to pass the expensive `map from parquet to log files` to 
each task. Instead it gives complete control over what goes into each task 
partition, and we send only the file and its corresponding mapping (in our case 
`external data file`, and in this case `log file`) over to the task. It is the 
very use to **RDD** interface to have that kind of control over the datasource 
we are building.
   
   Happy to have more in-depth discussion on this and help get this to 
completion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

Reply via email to