[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: (was: Collibra-DGC-572-Administration-Guide.pdf) > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: (was: Collibra-DGC-572-Administration-Guide.pdf) > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: Collibra-DGC-572-Administration-Guide.pdf > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: Collibra-DGC-572-Administration-Guide.pdf > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: (was: Collibra-DGC-572-Administration-Guide.pdf) > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Attachment: Collibra-DGC-572-Administration-Guide.pdf > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 20m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008528#comment-17008528 ] Xing Pan commented on HUDI-376: --- PR: [https://github.com/apache/incubator-hudi/pull/1189] [~xleesf] > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008482#comment-17008482 ] Xing Pan commented on HUDI-376: --- [~xleesf] sorry for the delay of response. I'd like to send a PR, but I think the script "run_sync_tool.sh" in github repo is different from the script in EMR. I am not sure where the source code of EMR version of "run_sync_tool.sh" is. But surely I can send a PR to add document of aws-configs. > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Xing Pan >Priority: Minor > Fix For: 0.5.1 > > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Description: Hi hudi team, it's really encouraging that Hudi is finally officially supported application on AWS EMR. Great job! I found a *ClassNotFound* exception when using: {code:java} /usr/lib/hudi/bin/run_sync_tool.sh {code} in emr master. And I think is due to demand of aws glue data sdk dependency. (I used aws glue as hive meta data) So I added a line to run_sync_tool.sh to get a quick fix for this: {code:java} HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} not sure if any more jars needed, but these two jar fixed my problem. I think it would be great if take glue in consideration for emr scripts. was: Hi hudi team, it's really encouraging that Hudi is finally officially supported application on AWS EMR. Great job! I found a *ClassNotFound* exception when using: {code:java} /usr/lib/hudi/bin/run_sync_tool.sh {code} in emr master. And I think is due to demand of aws glue data sdk dependency. (I used aws glue as hive meta data) So I added a line to run_sync_tool.sh to get a quick fix for this: {code:java} HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} not sure if any more jars needed, but these two jar fixed my problem. > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: Xing Pan >Priority: Minor > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > > I think it would be great if take glue in consideration for emr scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Issue Type: Improvement (was: Bug) > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: Xing Pan >Priority: Minor > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
[ https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-376: -- Description: Hi hudi team, it's really encouraging that Hudi is finally officially supported application on AWS EMR. Great job! I found a *ClassNotFound* exception when using: {code:java} /usr/lib/hudi/bin/run_sync_tool.sh {code} in emr master. And I think is due to demand of aws glue data sdk dependency. (I used aws glue as hive meta data) So I added a line to run_sync_tool.sh to get a quick fix for this: {code:java} HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} not sure if any more jars needed, but these two jar fixed my problem. was: Hi hudi team, it's really encouraging that Hudi is finally officially supported application on AWS EMR. Great job! I found a *ClassNotFound* exception when using: {code:java} /usr/lib/hudi/bin/run_sync_tool.sh {code} in emr master. And I think is due to demand of aws glue data sdk dependency. So I added a line to run_sync_tool.sh to get a quick fix for this: {code:java} HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} not sure if any more jars needed, but these two jar fixed my problem. > AWS Glue dependency issue for EMR 5.28.0 > > > Key: HUDI-376 > URL: https://issues.apache.org/jira/browse/HUDI-376 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: CLI >Reporter: Xing Pan >Priority: Minor > > Hi hudi team, it's really encouraging that Hudi is finally officially > supported application on AWS EMR. Great job! > I found a *ClassNotFound* exception when using: > {code:java} > /usr/lib/hudi/bin/run_sync_tool.sh > {code} > in emr master. > And I think is due to demand of aws glue data sdk dependency. (I used aws > glue as hive meta data) > So I added a line to run_sync_tool.sh to get a quick fix for this: > {code:java} > HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} > not sure if any more jars needed, but these two jar fixed my problem. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0
Xing Pan created HUDI-376: - Summary: AWS Glue dependency issue for EMR 5.28.0 Key: HUDI-376 URL: https://issues.apache.org/jira/browse/HUDI-376 Project: Apache Hudi (incubating) Issue Type: Bug Components: CLI Reporter: Xing Pan Hi hudi team, it's really encouraging that Hudi is finally officially supported application on AWS EMR. Great job! I found a *ClassNotFound* exception when using: {code:java} /usr/lib/hudi/bin/run_sync_tool.sh {code} in emr master. And I think is due to demand of aws glue data sdk dependency. So I added a line to run_sync_tool.sh to get a quick fix for this: {code:java} HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code} not sure if any more jars needed, but these two jar fixed my problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797 ] Xing Pan edited comment on HUDI-269 at 9/28/19 12:23 AM: - [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. (if I don't have throttle, delta streamer will send too many requests even no input comming) * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? was (Author: xingxpan): [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939797#comment-16939797 ] Xing Pan commented on HUDI-269: --- [~vinoth] I thought it's because of that delta streamer too aggressive too, so I add the throttle param to control this, and it helped. now the issue is: I do significantly reduce requests count by set delta streamer *throttle to 5 seconds*. * when there is *nothing* coming from kafka, request matrix looks acceptable both in data source writer way and delta streamer way. * But if I have kafka input streaming like 10 records per second, I found that even if I set the 5 seconds throttle, writing hudi with delta stream will cause 10 times more request than do it in data source writer way. so I would choose data source writer anyway since in this way I always save requests count. so now I am wondering what are the pro and cons for me to choose between spark datasource and delta streamer. as far as I can see, in my scenario, if I use delta streamer: * Delta streamer can help ingested data from kafka * It have a self managed checkpoint. * I can set the compaction job weight and if I use spark data source writer: * I have more control of my code, I can have my own kafka ingestion implementation * I will save money :) (still costs lower if delta streamer have a throttle control for now) so if currently if I can't make two request matrix on the same level, I'd use data source writer. Any more suggestions on choosing delta streamer? > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938163#comment-16938163 ] Xing Pan commented on HUDI-269: --- I tried to run the same hudi app via hudi spark datasource writer: {code:java} spark .readStream .format("kafka") .option("kafka.bootstrap.servers", KAFKA_SERVER) .option("subscribe", DEMO_11_TOPIC) .load() .select(from_confluent_avro(col("value"), SCHEMA_REGISTRY_CONF) as 'data).select("data.*") .writeStream.format("org.apache.hudi") .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dateStr") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts") .option(HoodieWriteConfig.TABLE_NAME, DEMO_11_TABLE_NAME) .option("checkpointLocation", checkpointPath) .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, DEMO_11_TABLE_NAME) .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default") .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, HIVE_URL) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dateStr") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[SlashEncodedDayPartitionValueExtractor].getCanonicalName) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .outputMode(OutputMode.Append) .trigger(Trigger.ProcessingTime(5000)) .start(outputPath) .awaitTermination() {code} {code:java} spark-submit --class xxx.HudiSpark \ --jars \ xxx/hudi-spark-bundle-0.5.1-SNAPSHOT.jar,\ xxx/abris_2.11-3.0.1.jar,\ xxx/common-utils-5.3.0.jar,xxx/kafka-schema-registry-client-5.3.0.jar,xxx/kafka-avro-serializer-5.3.0.jar,xxx/common-config-5.3.0.jar \ --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3,org.apache.spark:spark-avro_2.11:2.4.3 \ --conf spark.hadoop.fs.s3a.endpoint=s3-ap-east-1.amazonaws.com \ --conf spark.dynamicAllocation.executorIdleTimeout=10s \ --conf hoodie.embed.timeline.server=true \ --conf hoodie.filesystem.view.incr.timeline.sync.enable=true \ --conf hoodie.upsert.shuffle.parallelism=2 \ --executor-memory 1g \ my_test.jar {code} and push 300 records for every second, and the S3 request count is fairly low: !image-2019-09-26-09-02-24-761.png! I am not quite sure about the difference between datasource writer and delta streamer, as far as I know, when there is no data coming, request count is about the same, but if I push some record every second, *datasource writer* costs about 10 times lower request count than delta streamer. [~vinoth] > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-269: -- Attachment: image-2019-09-26-09-02-24-761.png > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: Balaji Varadarajan >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937825#comment-16937825 ] Xing Pan commented on HUDI-269: --- [~vinoth] , I'm planing to use hudi in our data lake project and happy to contribute. Since this naive throttle feature in this ticket will not actually solve the request issue completely, I will do some deeper investigation on this. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937358#comment-16937358 ] Xing Pan edited comment on HUDI-269 at 9/25/19 2:56 PM: [~vbalaji] yea, these strange 5K requests are mainly head requests, and cause a lot of s3 4xx error, which is defined as "client side error". I only have one partition "1100/01/01", and attached please find the *hudi_request_test.tar.gz* {code:java} aws s3 ls s3://xxx/output/1100/01/01/ 2019-09-25 01:56:57 93 .hoodie_partition_metadata 2019-09-25 02:12:18 535993 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-111-99_20190925021213.parquet 2019-09-25 02:50:30 679546 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-126-108_20190925025025.parquet 2019-09-25 02:32:27 597943 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023222.parquet 2019-09-25 02:38:03 623372 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023758.parquet 2019-09-25 02:12:48 537971 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-149-130_20190925021243.parquet 2019-09-25 02:50:39 680323 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-161-136_20190925025033.parquet 2019-09-25 02:32:57 599788 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023252.parquet 2019-09-25 02:38:33 625295 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023828.parquet 2019-09-25 02:13:18 540308 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-187-161_20190925021313.parquet 2019-09-25 02:50:47 681076 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-196-164_20190925025042.parquet 2019-09-25 02:31:07 591207 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023057.parquet 2019-09-25 02:36:48 615894 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023637.parquet 2019-09-25 02:50:01 675036 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925024946.parquet 2019-09-25 02:33:27 602011 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023322.parquet 2019-09-25 02:39:03 627524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023858.parquet 2019-09-25 02:13:48 542690 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-225-192_20190925021343.parquet 2019-09-25 02:50:55 681495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-231-192_20190925025049.parquet 2019-09-25 02:33:57 604273 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023352.parquet 2019-09-25 02:39:33 629743 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023928.parquet 2019-09-25 02:14:18 545021 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-263-223_20190925021413.parquet 2019-09-25 02:51:03 682267 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-266-220_20190925025058.parquet 2019-09-25 02:34:27 606495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023422.parquet 2019-09-25 02:40:03 632018 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023958.parquet 2019-09-25 02:51:11 682667 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-248_20190925025106.parquet 2019-09-25 02:14:48 547294 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-254_20190925021443.parquet 2019-09-25 02:34:57 608770 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925023452.parquet 2019-09-25 02:40:33 634280 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925024028.parquet 2019-09-25 02:51:18 683418 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-336-276_20190925025113.parquet 2019-09-25 02:15:18 549588 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-339-285_20190925021513.parquet 2019-09-25 01:56:59 533148 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-35-37_20190925015651.parquet 2019-09-25 02:35:27 610998 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925023522.parquet 2019-09-25 02:41:04 636524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925024058.parquet 2019-09-25 02:51:26 683833 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-371-304_20190925025121.parquet 2019-09-25 02:15:48 551902 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-377-316_20190925021543.parquet 2019-09-25 02:35:57 613259 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925023552.parquet 2019-09-25 02:41:33 638757 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925024128.parquet 2019-09-25 02:51:34 684572 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-406-332_20190925025130.parquet 2019-09-25 02:16:18 553820 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-415-347_20190925021613.parquet 2019-09-25 02:42:03 641007 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-439-365_20190925024158.parquet 2019-09-25 02:51:42 684965 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-441-360_20190925025137.parquet 2019-09-25 02:16:48 556070 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-453-378_20190925021643.parquet 2019-09-25 02:51:49 685729 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-476-388_20190925025144.parquet 2019-09-25
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937363#comment-16937363 ] Xing Pan commented on HUDI-269: --- [~vbalaji] ic, I was just trying to observe the request count change in a simplest way, so I just generate thousands of records in one partition. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937358#comment-16937358 ] Xing Pan commented on HUDI-269: --- [~vbalaji] yea, these strange 5K requests are mainly head requests, and cause a lot of s3 4xx error, which is defined as "client side error". I only have one partition "1100/01/01", and attached pleas find the *hudi_request_test.tar.gz* {code:java} aws s3 ls s3://xxx/output/1100/01/01/ 2019-09-25 01:56:57 93 .hoodie_partition_metadata 2019-09-25 02:12:18 535993 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-111-99_20190925021213.parquet 2019-09-25 02:50:30 679546 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-126-108_20190925025025.parquet 2019-09-25 02:32:27 597943 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023222.parquet 2019-09-25 02:38:03 623372 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023758.parquet 2019-09-25 02:12:48 537971 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-149-130_20190925021243.parquet 2019-09-25 02:50:39 680323 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-161-136_20190925025033.parquet 2019-09-25 02:32:57 599788 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023252.parquet 2019-09-25 02:38:33 625295 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023828.parquet 2019-09-25 02:13:18 540308 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-187-161_20190925021313.parquet 2019-09-25 02:50:47 681076 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-196-164_20190925025042.parquet 2019-09-25 02:31:07 591207 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023057.parquet 2019-09-25 02:36:48 615894 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023637.parquet 2019-09-25 02:50:01 675036 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925024946.parquet 2019-09-25 02:33:27 602011 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023322.parquet 2019-09-25 02:39:03 627524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023858.parquet 2019-09-25 02:13:48 542690 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-225-192_20190925021343.parquet 2019-09-25 02:50:55 681495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-231-192_20190925025049.parquet 2019-09-25 02:33:57 604273 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023352.parquet 2019-09-25 02:39:33 629743 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023928.parquet 2019-09-25 02:14:18 545021 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-263-223_20190925021413.parquet 2019-09-25 02:51:03 682267 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-266-220_20190925025058.parquet 2019-09-25 02:34:27 606495 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023422.parquet 2019-09-25 02:40:03 632018 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023958.parquet 2019-09-25 02:51:11 682667 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-248_20190925025106.parquet 2019-09-25 02:14:48 547294 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-254_20190925021443.parquet 2019-09-25 02:34:57 608770 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925023452.parquet 2019-09-25 02:40:33 634280 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925024028.parquet 2019-09-25 02:51:18 683418 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-336-276_20190925025113.parquet 2019-09-25 02:15:18 549588 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-339-285_20190925021513.parquet 2019-09-25 01:56:59 533148 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-35-37_20190925015651.parquet 2019-09-25 02:35:27 610998 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925023522.parquet 2019-09-25 02:41:04 636524 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925024058.parquet 2019-09-25 02:51:26 683833 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-371-304_20190925025121.parquet 2019-09-25 02:15:48 551902 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-377-316_20190925021543.parquet 2019-09-25 02:35:57 613259 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925023552.parquet 2019-09-25 02:41:33 638757 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925024128.parquet 2019-09-25 02:51:34 684572 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-406-332_20190925025130.parquet 2019-09-25 02:16:18 553820 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-415-347_20190925021613.parquet 2019-09-25 02:42:03 641007 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-439-365_20190925024158.parquet 2019-09-25 02:51:42 684965 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-441-360_20190925025137.parquet 2019-09-25 02:16:48 556070 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-453-378_20190925021643.parquet 2019-09-25 02:51:49 685729 68d656cc-65a5-47f7-bf28-961315e718bc-0_0-476-388_20190925025144.parquet 2019-09-25 02:42:33 643281
[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-269: -- Attachment: hudi_request_test.tar.gz > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: hudi_request_test.tar.gz, > image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937336#comment-16937336 ] Xing Pan commented on HUDI-269: --- and I ran delta streamer like: {code:java} spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --executor-memory 1g \ --executor-cores 1 \ --conf spark.dynamicAllocation.executorIdleTimeout=10s \ --conf spark.dynamicAllocation.maxExecutors=3 \ xxx/hudi-utilities-bundle-0.5.1-SNAPSHOT.jar \ --target-base-path s3a://xxx \ --target-table default.xxx \ --storage-type MERGE_ON_READ \ --props s3a:///kafka-source.properties \ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \ --source-ordering-field ts \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ --op UPSERT \ --enable-hive-sync \ --continuous \ --min-sync-interval-seconds 0 \ --hoodie-conf hoodie.embed.timeline.server=true \ --hoodie-conf hoodie.filesystem.view.incr.timeline.sync.enable=true {code} > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937334#comment-16937334 ] Xing Pan commented on HUDI-269: --- [~vbalaji] : yeah, this is just a single test hudi app in my sandbox emr, so I'm pretty sure only this one hudi spark job is writing to this bucket. I just have one partition for this table. and I've test it on two scenarios: # nothing come from kafka at all. # 10 records come in each 5 seconds. and I didn't do any query on this table during delta sync, do you expect request count decrease from query end? I've set sync interval = 0 in some tests, so I think it've ran of several iterations. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937302#comment-16937302 ] Xing Pan edited comment on HUDI-269 at 9/25/19 1:05 AM: [~vbalaji] [~vinoth] I just did some simple test on these configs. basically in my scenario, we will get cdc data from a slow change table and sync cdc from kafka to hudi dataset. and as plot in the graph above, throttle DeltaStreamer sync runs will significantly decrease get request per min. but *embed timeline server* and *incr timeline sync* didn't help reduce requests count too much. was (Author: xingxpan): [~vbalaji] [~vinoth] I just did some simple test on these configs. basically in my scenario, we will get cdc data from a slow change table and sync cdc from kafka to hudi dataset. and as plot in the graph above, throttle DeltaStreamer sync runs will significantly decrease get request per min. but `embed timeline server` and `incr timeline sync` didn't help reduce requests count too much. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-269: -- Attachment: (was: request_histogram.png) > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937302#comment-16937302 ] Xing Pan commented on HUDI-269: --- [~vbalaji] [~vinoth] I just did some simple test on these configs. basically in my scenario, we will get cdc data from a slow change table and sync cdc from kafka to hudi dataset. and as plot in the graph above, throttle DeltaStreamer sync runs will significantly decrease get request per min. but `embed timeline server` and `incr timeline sync` didn't help reduce requests count too much. > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png, request_histogram.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937296#comment-16937296 ] Xing Pan commented on HUDI-269: --- !image-2019-09-25-08-51-19-686.png! > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: request_histogram.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-269: -- Attachment: image-2019-09-25-08-51-19-686.png > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: image-2019-09-25-08-51-19-686.png, request_histogram.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xing Pan updated HUDI-269: -- Attachment: request_histogram.png > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Assignee: Xing Pan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: request_histogram.png > > Time Spent: 10m > Remaining Estimate: 0h > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936542#comment-16936542 ] Xing Pan edited comment on HUDI-269 at 9/24/19 8:27 AM: just updated pull request was (Author: xingxpan): just update pull request > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Priority: Major > Fix For: 0.5.0 > > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs
[ https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936542#comment-16936542 ] Xing Pan commented on HUDI-269: --- just update pull request > Provide ability to throttle DeltaStreamer sync runs > --- > > Key: HUDI-269 > URL: https://issues.apache.org/jira/browse/HUDI-269 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: deltastreamer >Reporter: BALAJI VARADARAJAN >Priority: Major > Fix For: 0.5.0 > > > Copied from [https://github.com/apache/incubator-hudi/issues/922] > In some scenario in our cluster, we may want delta streamer to slow down a > bit. > so it's nice to have a parameter to control the min sync interval of each > sync in continuous mode. > this param is default to 0, so this should not affect current logic. > minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921] > the main reason we want to slow it down is that aws s3 is charged by s3 > get/put/list requests. we don't want to pay for too many requests for a > really slow change table. -- This message was sent by Atlassian Jira (v8.3.4#803005)