[GitHub] [hudi] bvaradar commented on issue #2252: Hudi has high S3 requests

GitBox Mon, 16 Nov 2020 09:49:52 -0800


bvaradar commented on issue #2252:
URL: https://github.com/apache/hudi/issues/2252#issuecomment-728221544



   As you are seeing delete requests, Can you check if you are seeing failures 
or cleaning is kicking in which is inflating the number of S3 requests. Can you 
list your .hoodie folder to see if you have .rollback or .clean files ? For 
comparing apples to apples : You would have to discount them as these are 
additional functionality that Hudi provides in addition to parquet dataset.
   
   W.r.t insert vs bulk-insert PUT and HEAD requests, Also check how many 
number of files got created in Hudi vs parquet dataset. You might want to tune 
parallelism and configure file sizing 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles
   
   Hudi uses optimistic approach for failure handling and prevents writing to 
tmp folder and recopying  which performs badly. For this, it keeps additional 
marker files which are tracked and deleted as part of the commit process. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bvaradar commented on issue #2252: Hudi has high S3 requests

Reply via email to