[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939797#comment-16939797 ]
Xing Pan commented on HUDI-269: ------------------------------- [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? > Provide ability to throttle DeltaStreamer sync runs > --------------------------------------------------- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer > Reporter: Balaji Varadarajan > Assignee: Xing Pan > Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)