[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940892#comment-16940892
 ] 

Vinoth Chandar commented on HUDI-269:
-------------------------------------

[~XingXPan] You should nt have to choose based on this.. 

> if I don't have throttle, delta streamer will send too many requests even no 
> input comming) 
Throttle is a useful feature. Each run delta streamer will try to list the 
target directory for obtaining the previous checkpoint. We had always been 
testing with large volume input streams I think :) 


>But if I have kafka input streaming like 10 records per second, I found that 
>even if I set the 5 seconds throttle, writing hudi with delta stream will 
>cause 10 times more request than do it in data source writer way.

We will fix the delta streamer and get it inline with the data source writer. 
Conceptually cant think of anything here that will make it 10x more calls. Are 
you running the datasource every 5 seconds as well?  Also is there a way for 
you to tell us what these addtional requests are going to at the s3 level i.e 
what file/objects does it access in both cases.. Then one of us can try 
reproducing this and fixinf.. once we identify, it should be a simple fix if 
any. 

Thanks for working thru this with us. 

> Provide ability to throttle DeltaStreamer sync runs
> ---------------------------------------------------
>
>                 Key: HUDI-269
>                 URL: https://issues.apache.org/jira/browse/HUDI-269
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: Balaji Varadarajan
>            Assignee: Xing Pan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.0
>
>         Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to