nsivabalan commented on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-939305990


   @MikeBuh : I understand your have lot of small files in S3, but whats the 
size of writes that you do with hudi? Is it the same. if its going to be of 
similar nature, would recommend trying out MOR. 
   
   Anyways, lets walk through your use-case keeping that aside for now. 
   Is it a ongoing pipeline or is it more of a one time load and you are done 
with it? If its going to be on-going, I would recommend using a tool called 
[Deltastreamer](https://hudi.apache.org/docs/writing_data#deltastreamer) that 
hudi gives you. You can connect a source to Deltastreamer (parquet dfs in this 
case) and you can add a transformer(with which you can do filtering, transform, 
populate keys etc). Deltastreamer will take care of fetching data from source 
and ingesting to hudi at regular cadence(schedule at regular cadence for one 
time sync) or on continuous mode. We know many users in the community use this 
and has been very effective. Just incase you were not aware of it. 
   
   May I know whats the size and # of records with each batch of write to hudi? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to