nsivabalan commented on issue #3751: URL: https://github.com/apache/hudi/issues/3751#issuecomment-939305990
@MikeBuh : I understand your have lot of small files in S3, but whats the size of writes that you do with hudi? Is it the same. if its going to be of similar nature, would recommend trying out MOR. Anyways, lets walk through your use-case keeping that aside for now. Is it a ongoing pipeline or is it more of a one time load and you are done with it? If its going to be on-going, I would recommend using a tool called [Deltastreamer](https://hudi.apache.org/docs/writing_data#deltastreamer) that hudi gives you. You can connect a source to Deltastreamer (parquet dfs in this case) and you can add a transformer(with which you can do filtering, transform, populate keys etc). Deltastreamer will take care of fetching data from source and ingesting to hudi at regular cadence(schedule at regular cadence for one time sync) or on continuous mode. We know many users in the community use this and has been very effective. Just incase you were not aware of it. May I know whats the size and # of records with each batch of write to hudi? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
