[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937330#comment-16937330
 ] 

BALAJI VARADARAJAN commented on HUDI-269:
-----------------------------------------

[~XingXPan] :

Thank you for sharing the S3 metrics

Can you confirm if all these requests is for writing to 1 table and no other 
write happened on that bucket. The incremental timeline sync should see 
benefits if you are running for several iterations. Wondering if this test was 
done for only one iteration.

Regarding the embedded timeline-server only mode, you should see reductions 
approximately in the  order of  (Number of Files updated)/(Number of partitions 
touched)

How many partitions do the dataset have ?

If the number of partitions are large, cleaner operations could have produce 
more directory listing calls when trying to find all partitions. Just for 
testing this hypothesis, Can you try disabling cleaner for testing by setting 
hoodie.clean.automatic=false

 

Thanks,

Balaji.V

 

 

> Provide ability to throttle DeltaStreamer sync runs
> ---------------------------------------------------
>
>                 Key: HUDI-269
>                 URL: https://issues.apache.org/jira/browse/HUDI-269
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: BALAJI VARADARAJAN
>            Assignee: Xing Pan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.0
>
>         Attachments: image-2019-09-25-08-51-19-686.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to