garyli1019 commented on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-643901550


   Few major concerns here:
   - Listing files are too expensive.
   Solution: Switch to bootstrap file listing methods once udit's PR merged. 
Move to RFC-15 once it was ready.
   - Broadcasting paths in option hashmap could cause performance issues.
   I am not sure if there is a better way to do this until RFC-15 is ready. 
Search log files from executors could be more expensive since it requires 
TableView and metaClient e.t.c. Even we have thousands of log files, the 
hashmap might be a few MBs, so I guess it could be ok? 
   
   Major follow-ups after this PR:
   - Incremental view on MOR table.
   - Vectorized reader
   - More efficient type conversion.
   - Support custom payload.
   Those four can be done in parallel but all depend on this PR. We can make a 
baseline in this PR and iterate through different topics in parallel.
   Also, we can ask help from the community to test in their production for a 
very large dataset. This could be quite easy if they are using MOR table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to