[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797 ] Xing Pan edited comment on HUDI-269 at 9/28/19 12:23 AM: - [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. (if I don't have throttle, delta streamer will send too many requests even no input comming) * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? was (Author: xingxpan): [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937358#comment-16937358 ] Xing Pan edited comment on HUDI-269 at 9/25/19 2:56 PM: [~vbalaji] yea, these strange 5K requests are mainly head requests, and cause a lot of s3 4xx error, which is defined as "client side error". I only have one partition "1100/01/01", and attached please find the *hudi_request_test.tar.gz* {code:java} aws s3 ls s3://xxx/output/1100/01/01/ 2019-09-25 01:56:57 93 .hoodie_partition_metadata 2019-09-25 02:12:18 535993 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-111-99_20190925021213.parquet 2019-09-25 02:50:30 679546 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-126-108_20190925025025.parquet 2019-09-25 02:32:27 597943 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023222.parquet 2019-09-25 02:38:03 623372 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023758.parquet 2019-09-25 02:12:48 537971 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-149-130_20190925021243.parquet 2019-09-25 02:50:39 680323 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-161-136_20190925025033.parquet 2019-09-25 02:32:57 599788 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023252.parquet 2019-09-25 02:38:33 625295 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023828.parquet 2019-09-25 02:13:18 540308 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-187-161_20190925021313.parquet 2019-09-25 02:50:47 681076 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-196-164_20190925025042.parquet 2019-09-25 02:31:07 591207 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023057.parquet 2019-09-25 02:36:48 615894 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023637.parquet 2019-09-25 02:50:01 675036 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925024946.parquet 2019-09-25 02:33:27 602011 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023322.parquet 2019-09-25 02:39:03 627524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023858.parquet 2019-09-25 02:13:48 542690 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-225-192_20190925021343.parquet 2019-09-25 02:50:55 681495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-231-192_20190925025049.parquet 2019-09-25 02:33:57 604273 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023352.parquet 2019-09-25 02:39:33 629743 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023928.parquet 2019-09-25 02:14:18 545021 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-263-223_20190925021413.parquet 2019-09-25 02:51:03 682267 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-266-220_20190925025058.parquet 2019-09-25 02:34:27 606495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023422.parquet 2019-09-25 02:40:03 632018 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023958.parquet 2019-09-25 02:51:11 682667 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-248_20190925025106.parquet 2019-09-25 02:14:48 547294 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-254_20190925021443.parquet 2019-09-25 02:34:57 608770 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925023452.parquet 2019-09-25 02:40:33 634280 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925024028.parquet 2019-09-25 02:51:18 683418 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-336-276_20190925025113.parquet 2019-09-25 02:15:18 549588 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-339-285_20190925021513.parquet 2019-09-25 01:56:59 533148 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-35-37_20190925015651.parquet 2019-09-25 02:35:27 610998 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925023522.parquet 2019-09-25 02:41:04 636524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925024058.parquet 2019-09-25 02:51:26 683833 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-371-304_20190925025121.parquet 2019-09-25 02:15:48 551902 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-377-316_20190925021543.parquet 2019-09-25 02:35:57 613259 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925023552.parquet 2019-09-25 02:41:33 638757 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925024128.parquet 2019-09-25 02:51:34 684572 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-406-332_20190925025130.parquet 2019-09-25 02:16:18 553820 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-415-347_20190925021613.parquet 2019-09-25 02:42:03 641007 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-439-365_20190925024158.parquet 2019-09-25 02:51:42 684965 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-441-360_20190925025137.parquet 2019-09-25 02:16:48 556070 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-453-378_20190925021643.parquet 2019-09-25 02:51:49 685729 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-476-388_20190925025144.parquet 2019-09-25
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937360#comment-16937360 ] BALAJI VARADARAJAN edited comment on HUDI-269 at 9/25/19 3:12 AM: -- ok, looks like there is only one file-group (but there could be multiple versions) and one partition in the whole dataset. So, its understandable why you did not see any benefits with timeline server. With larger dataset, you should observe savings. was (Author: vbalaji): ok, looks like there is only one file and one partition in the whole dataset. So, its understandable why you did not see any benefits with timeline server. With larger dataset, you should observe savings. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937302#comment-16937302 ] Xing Pan edited comment on HUDI-269 at 9/25/19 1:05 AM: [~vbalaji] [~vinoth] I just did some simple test on these configs. basically in my scenario, we will get cdc data from a slow change table and sync cdc from kafka to hudi dataset. and as plot in the graph above, throttle DeltaStreamer sync runs will significantly decrease get request per min. but *embed timeline server* and *incr timeline sync* didn't help reduce requests count too much. was (Author: xingxpan): [~vbalaji] [~vinoth] I just did some simple test on these configs. basically in my scenario, we will get cdc data from a slow change table and sync cdc from kafka to hudi dataset. and as plot in the graph above, throttle DeltaStreamer sync runs will significantly decrease get request per min. but `embed timeline server` and `incr timeline sync` didn't help reduce requests count too much. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936542#comment-16936542 ] Xing Pan edited comment on HUDI-269 at 9/24/19 8:27 AM: just updated pull request was (Author: xingxpan): just update pull request > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Priority: Major > Fix For: 0.5.0 > > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)