[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-27 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797
 ] 

Xing Pan edited comment on HUDI-269 at 9/28/19 12:23 AM:
-

[~vinoth]
 I thought it's because of that delta streamer too aggressive too, so I add the 
throttle param to control this, and it helped.
 now the issue is:
 I do significantly reduce requests count by set delta streamer *throttle to 5 
seconds*.
 * when there is *nothing* coming from kafka, request matrix looks acceptable 
both in data source writer way and delta streamer way. (if I don't have 
throttle, delta streamer will send too many requests even no input comming)
 * But if I have kafka input streaming like 10 records per second, I found that 
even if I set the 5 seconds throttle, writing hudi with delta stream will cause 
10 times more request than do it in data source writer way.
 so I would choose data source writer anyway since in this way I always save 
requests count.

so now I am wondering what are the pro and cons for me to choose between spark 
datasource and delta streamer.
 as far as I can see, in my scenario, if I use delta streamer:
 * Delta streamer can help ingested data from kafka
 * It have a self managed checkpoint.
 * I can  set the compaction job weight

and if I use spark data source writer:
 * I have more control of my code, I can have my own kafka ingestion 
implementation
 * I will save money :) (still costs lower if delta streamer have a throttle 
control for now)

 

so if currently if I can't make two request matrix on the same level, I'd use 
data source writer. 
 Any more suggestions on choosing delta streamer?


was (Author: xingxpan):
[~vinoth]
I thought it's because of that delta streamer too aggressive too, so I add the 
throttle param to control this, and it helped.
now the issue is:
I do significantly reduce requests count by set delta streamer *throttle to 5 
seconds*.
 * when there is *nothing* coming from kafka, request matrix looks acceptable 
both in data source writer way and delta streamer way.
 * But if I have kafka input streaming like 10 records per second, I found that 
even if I set the 5 seconds throttle, writing hudi with delta stream will cause 
10 times more request than do it in data source writer way.
so I would choose data source writer anyway since in this way I always save 
requests count.

so now I am wondering what are the pro and cons for me to choose between spark 
datasource and delta streamer.
as far as I can see, in my scenario, if I use delta streamer:
 * Delta streamer can help ingested data from kafka
 * It have a self managed checkpoint.
 * I can  set the compaction job weight

and if I use spark data source writer:
 * I have more control of my code, I can have my own kafka ingestion 
implementation
 * I will save money :) (still costs lower if delta streamer have a throttle 
control for now)

 

so if currently if I can't make two request matrix on the same level, I'd use 
data source writer. 
Any more suggestions on choosing delta streamer?

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937358#comment-16937358
 ] 

Xing Pan edited comment on HUDI-269 at 9/25/19 2:56 PM:


 

[~vbalaji]

yea, these strange 5K requests are mainly head requests, and cause a lot of s3 
4xx error, which is defined as "client side error". 

I only have one partition "1100/01/01", and attached please find the 
*hudi_request_test.tar.gz*
{code:java}
aws s3 ls s3://xxx/output/1100/01/01/
2019-09-25 01:56:57 93 .hoodie_partition_metadata
2019-09-25 02:12:18 535993 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-111-99_20190925021213.parquet
2019-09-25 02:50:30 679546 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-126-108_20190925025025.parquet
2019-09-25 02:32:27 597943 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023222.parquet
2019-09-25 02:38:03 623372 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023758.parquet
2019-09-25 02:12:48 537971 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-149-130_20190925021243.parquet
2019-09-25 02:50:39 680323 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-161-136_20190925025033.parquet
2019-09-25 02:32:57 599788 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023252.parquet
2019-09-25 02:38:33 625295 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023828.parquet
2019-09-25 02:13:18 540308 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-187-161_20190925021313.parquet
2019-09-25 02:50:47 681076 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-196-164_20190925025042.parquet
2019-09-25 02:31:07 591207 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023057.parquet
2019-09-25 02:36:48 615894 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023637.parquet
2019-09-25 02:50:01 675036 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925024946.parquet
2019-09-25 02:33:27 602011 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023322.parquet
2019-09-25 02:39:03 627524 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023858.parquet
2019-09-25 02:13:48 542690 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-225-192_20190925021343.parquet
2019-09-25 02:50:55 681495 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-231-192_20190925025049.parquet
2019-09-25 02:33:57 604273 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023352.parquet
2019-09-25 02:39:33 629743 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023928.parquet
2019-09-25 02:14:18 545021 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-263-223_20190925021413.parquet
2019-09-25 02:51:03 682267 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-266-220_20190925025058.parquet
2019-09-25 02:34:27 606495 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023422.parquet
2019-09-25 02:40:03 632018 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023958.parquet
2019-09-25 02:51:11 682667 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-248_20190925025106.parquet
2019-09-25 02:14:48 547294 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-254_20190925021443.parquet
2019-09-25 02:34:57 608770 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925023452.parquet
2019-09-25 02:40:33 634280 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925024028.parquet
2019-09-25 02:51:18 683418 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-336-276_20190925025113.parquet
2019-09-25 02:15:18 549588 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-339-285_20190925021513.parquet
2019-09-25 01:56:59 533148 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-35-37_20190925015651.parquet
2019-09-25 02:35:27 610998 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925023522.parquet
2019-09-25 02:41:04 636524 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925024058.parquet
2019-09-25 02:51:26 683833 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-371-304_20190925025121.parquet
2019-09-25 02:15:48 551902 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-377-316_20190925021543.parquet
2019-09-25 02:35:57 613259 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925023552.parquet
2019-09-25 02:41:33 638757 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925024128.parquet
2019-09-25 02:51:34 684572 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-406-332_20190925025130.parquet
2019-09-25 02:16:18 553820 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-415-347_20190925021613.parquet
2019-09-25 02:42:03 641007 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-439-365_20190925024158.parquet
2019-09-25 02:51:42 684965 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-441-360_20190925025137.parquet
2019-09-25 02:16:48 556070 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-453-378_20190925021643.parquet
2019-09-25 02:51:49 685729 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-476-388_20190925025144.parquet
2019-09-25 

[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-24 Thread BALAJI VARADARAJAN (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937360#comment-16937360
 ] 

BALAJI VARADARAJAN edited comment on HUDI-269 at 9/25/19 3:12 AM:
--

ok, looks like there is only one file-group (but there could be multiple 
versions) and one partition in the whole dataset. So, its understandable why 
you did not see any benefits with timeline server.  With larger dataset, you 
should observe savings.


was (Author: vbalaji):
ok, looks like there is only one file and one partition in the whole dataset. 
So, its understandable why you did not see any benefits with timeline server.  
With larger dataset, you should observe savings.

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: BALAJI VARADARAJAN
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-24 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937302#comment-16937302
 ] 

Xing Pan edited comment on HUDI-269 at 9/25/19 1:05 AM:


[~vbalaji] [~vinoth] 
 I just did some simple test on these configs.
 basically in my scenario, we will get cdc data from a slow change table and 
sync cdc from kafka to hudi dataset.
 and as plot in the graph above, throttle DeltaStreamer sync runs will 
significantly decrease get request per min.

but *embed timeline server* and *incr timeline sync* didn't help reduce 
requests count too much.


was (Author: xingxpan):
[~vbalaji] [~vinoth] 
I just did some simple test on these configs.
basically in my scenario, we will get cdc data from a slow change table and 
sync cdc from kafka to hudi dataset.
and as plot in the graph above, throttle DeltaStreamer sync runs will 
significantly decrease get request per min.

but `embed timeline server` and `incr timeline sync` didn't help reduce 
requests count too much.

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: BALAJI VARADARAJAN
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: image-2019-09-25-08-51-19-686.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-24 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936542#comment-16936542
 ] 

Xing Pan edited comment on HUDI-269 at 9/24/19 8:27 AM:


just updated pull request


was (Author: xingxpan):
just update pull request

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: BALAJI VARADARAJAN
>Priority: Major
> Fix For: 0.5.0
>
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)