[jira] [Created] (GOBBLIN-1205) restarting gobblin on yarn fails with error
Jay Sen created GOBBLIN-1205: Summary: restarting gobblin on yarn fails with error Key: GOBBLIN-1205 URL: https://issues.apache.org/jira/browse/GOBBLIN-1205 Project: Apache Gobblin Issue Type: Bug Affects Versions: 0.15.0 Reporter: Jay Sen Fix For: 0.15.0 restarting gobblin deployed on yarn mode occasionally fails starting up with following error, may be the path is still on hold by the previous process, it may just need bit time between stop/start. {code:java} WARN [ZKHelixAdmin] Root directory exists.Cleaning the root directory:/GobblinYarnHelixAppWARN [ZKHelixAdmin] Root directory exists.Cleaning the root directory:/GobblinYarnHelixAppWARN [ZkClient] Failed to delete path /GobblinYarnHelixApp/CONTROLLER! org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /GobblinYarnHelixApp/CONTROLLERERROR [ZkClient] Failed to delete /GobblinYarnHelixApp/CONTROLLERorg.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /GobblinYarnHelixApp/CONTROLLER at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1160) at org.apache.helix.manager.zk.zookeeper.ZkClient.delete(ZkClient.java:1215) at org.apache.helix.manager.zk.zookeeper.ZkClient.deleteRecursively(ZkClient.java:949) at org.apache.helix.manager.zk.zookeeper.ZkClient.deleteRecursively(ZkClient.java:942) at org.apache.helix.manager.zk.ZKHelixAdmin.addCluster(ZKHelixAdmin.java:698) at org.apache.helix.tools.ClusterSetup.addCluster(ClusterSetup.java:162) at org.apache.gobblin.cluster.HelixUtils.createGobblinHelixCluster(HelixUtils.java:96) at org.apache.gobblin.yarn.GobblinYarnAppLauncher.launch(GobblinYarnAppLauncher.java:337) at org.apache.gobblin.yarn.GobblinYarnAppLauncher.main(GobblinYarnAppLauncher.java:1067)Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /GobblinYarnHelixApp/CONTROLLER at org.apache.zookeeper.KeeperException.create(KeeperException.java:128) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:882) at org.apache.helix.manager.zk.zookeeper.ZkConnection.delete(ZkConnection.java:119) at org.apache.helix.manager.zk.zookeeper.ZkClient$9.call(ZkClient.java:1219) at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1150) ... 8 more ==> logs/yarn.err <==Exception in thread "main" org.apache.helix.HelixException: Failed to delete /GobblinYarnHelixApp/CONTROLLER at org.apache.helix.manager.zk.zookeeper.ZkClient.deleteRecursively(ZkClient.java:952) at org.apache.helix.manager.zk.zookeeper.ZkClient.deleteRecursively(ZkClient.java:942) at org.apache.helix.manager.zk.ZKHelixAdmin.addCluster(ZKHelixAdmin.java:698) at org.apache.helix.tools.ClusterSetup.addCluster(ClusterSetup.java:162) at org.apache.gobblin.cluster.HelixUtils.createGobblinHelixCluster(HelixUtils.java:96) at org.apache.gobblin.yarn.GobblinYarnAppLauncher.launch(GobblinYarnAppLauncher.java:337) at org.apache.gobblin.yarn.GobblinYarnAppLauncher.main(GobblinYarnAppLauncher.java:1067)Caused by: org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /GobblinYarnHelixApp/CONTROLLER at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1160) at org.apache.helix.manager.zk.zookeeper.ZkClient.delete(ZkClient.java:1215) at org.apache.helix.manager.zk.zookeeper.ZkClient.deleteRecursively(ZkClient.java:949) ... 6 moreCaused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /GobblinYarnHelixApp/CONTROLLER at org.apache.zookeeper.KeeperException.create(KeeperException.java:128) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:882) at org.apache.helix.manager.zk.zookeeper.ZkConnection.delete(ZkConnection.java:119) at org.apache.helix.manager.zk.zookeeper.ZkClient$9.call(ZkClient.java:1219) at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1150) ... 8 moreException in thread "Thread-6" org.apache.helix.HelixException: HelixManager (ZkClient) is not connected. Call HelixManager#connect() at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:363) at org.apache.helix.manager.zk.ZKHelixManager.getClusterManagmentTo
[jira] [Resolved] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan resolved GOBBLIN-1099. -- Resolution: Fixed > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan closed GOBBLIN-1099. > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411607=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411607 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 04:26 Start Date: 28/Mar/20 04:26 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411607) Time Spent: 2h 40m (was: 2.5h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] asfgit closed pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
asfgit closed pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411595=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411595 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 03:19 Start Date: 28/Mar/20 03:19 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `40.49%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2940 +/- ## - Coverage 44.60% 4.10% -40.50% + Complexity 8980 750 -8230 Files 19361936 Lines 73234 73292 +58 Branches 80838088+5 - Hits 326693012-29657 - Misses37515 69960+32445 + Partials 3050 320 -2730 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (-53.92%)` | `0.00 <0.00> (-26.00)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `0.00% <0.00%> (-64.16%)` | `0.00 <0.00> (-27.00)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `0.00% <0.00%> (-38.27%)` | `0.00 <0.00> (-14.00)` | | | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `0.00% <0.00%> (-18.06%)` | `0.00 <0.00> (-3.00)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0.00% <0.00%> (-21.17%)` | `0.00 <0.00> (-8.00)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `0.00% <0.00%> (-15.28%)` | `0.00 <0.00> (-4.00)` | | | [...c/main/java/org/apache/gobblin/util/FileUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvRmlsZVV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...n/java/org/apache/gobblin/fork/CopyableSchema.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2ZvcmsvQ29weWFibGVTY2hlbWEuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...java/org/apache/gobblin/stream/ControlMessage.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1hcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vc3RyZWFtL0NvbnRyb2xNZXNzYWdlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...va/org/apache/gobblin/dataset/DatasetResolver.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29
[GitHub] [incubator-gobblin] codecov-io edited a comment on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
codecov-io edited a comment on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `40.49%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2940 +/- ## - Coverage 44.60% 4.10% -40.50% + Complexity 8980 750 -8230 Files 19361936 Lines 73234 73292 +58 Branches 80838088+5 - Hits 326693012-29657 - Misses37515 69960+32445 + Partials 3050 320 -2730 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (-53.92%)` | `0.00 <0.00> (-26.00)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `0.00% <0.00%> (-64.16%)` | `0.00 <0.00> (-27.00)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `0.00% <0.00%> (-38.27%)` | `0.00 <0.00> (-14.00)` | | | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `0.00% <0.00%> (-18.06%)` | `0.00 <0.00> (-3.00)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0.00% <0.00%> (-21.17%)` | `0.00 <0.00> (-8.00)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `0.00% <0.00%> (-15.28%)` | `0.00 <0.00> (-4.00)` | | | [...c/main/java/org/apache/gobblin/util/FileUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvRmlsZVV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...n/java/org/apache/gobblin/fork/CopyableSchema.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2ZvcmsvQ29weWFibGVTY2hlbWEuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...java/org/apache/gobblin/stream/ControlMessage.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1hcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vc3RyZWFtL0NvbnRyb2xNZXNzYWdlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...va/org/apache/gobblin/dataset/DatasetResolver.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1hcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZGF0YXNldC9EYXRhc2V0UmVzb2x2ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | ... and [1150 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=continue
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399594383 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); Review comment: Changed it to return void. The dropInstanceIfExists method is intended to swallow HelixException that can only occur due to instance path not being present i.e instance does not exist. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411565 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:34 Start Date: 28/Mar/20 00:34 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399594383 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); Review comment: Changed it to return void. The dropInstanceIfExists method is intended to swallow HelixException that can only occur due to instance path not being present i.e instance does not exist. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411565) Time Spent: 2h 20m (was: 2h 10m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to t
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411560=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411560 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:28 Start Date: 28/Mar/20 00:28 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399593342 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -625,7 +634,7 @@ private boolean shouldStickToTheSameNode(int containerExitStatus) { */ protected void handleContainerCompletion(ContainerStatus containerStatus) { Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); -String completedInstanceName = completedContainerEntry == null? "unknown" : completedContainerEntry.getValue(); +String completedInstanceName = completedContainerEntry == null? UNKNOWN_HELIX_INSTANCE : completedContainerEntry.getValue(); Review comment: Added a comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411560) Time Spent: 2h 10m (was: 2h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399593342 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -625,7 +634,7 @@ private boolean shouldStickToTheSameNode(int containerExitStatus) { */ protected void handleContainerCompletion(ContainerStatus containerStatus) { Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); -String completedInstanceName = completedContainerEntry == null? "unknown" : completedContainerEntry.getValue(); +String completedInstanceName = completedContainerEntry == null? UNKNOWN_HELIX_INSTANCE : completedContainerEntry.getValue(); Review comment: Added a comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411537=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411537 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:11 Start Date: 28/Mar/20 00:11 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590474 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -82,6 +91,7 @@ import org.apache.gobblin.util.JvmUtils; import org.apache.gobblin.util.reflection.GobblinConstructorUtils; + Review comment: Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411537) Time Spent: 2h (was: 1h 50m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411535=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411535 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:11 Start Date: 28/Mar/20 00:11 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590389 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); + +if (this.taskDriverHelixManager.isPresent()) { + String taskDriverCluster = clusterConfig.getString(GobblinClusterConfigurationKeys.TASK_DRIVER_CLUSTER_NAME_KEY); + logger.warn("Dropping instance: {} from task driver cluster: {}", helixInstanceName, taskDriverCluster); + HelixUtils.dropInstanceIfExists(admin, taskDriverCluster, helixInstanceName); +} +admin.close(); + +logger.warn("Reinitializing Helix manager.."); +initHelixManager(); + } + + @VisibleForTesting + void connectHelixManagerWithRetry() { +Callable connectHelixManagerCallable = () -> { + try { +logger.info("Instance: {} attempting to join cluster: {}", helixInstanceName, clusterName); +connectHelixManager(); + } catch (HelixException e) { +logger.error("Exception encountered when joining cluster", e); +onClusterJoinFailure(); +throw e; } -} catch (Exception e) { - logger.error("HelixManager failed to connect", e); + return null; +}; + +Retryer retryer = RetryerBuilder.newBuilder() +.retryIfException() +.withStopStrategy(StopStrategies.stopAfterAttempt(5)).build(); + +try { + retryer.call(connectHelixManagerCallable); +} catch (ExecutionException | RetryException e) { + logger.error("Connecting to Helix manager failed", e); Review comment: Thanks! Will fix it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this serv
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411534=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411534 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:11 Start Date: 28/Mar/20 00:11 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590340 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java ## @@ -479,6 +488,26 @@ void connectHelixManager() { } } + /** + * A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application + * leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use + * the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the + * new cluster. + * + * NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application + * completion, this method should be removed. + */ + void disableLiveHelixInstances() { +String clusterName = this.helixManager.getClusterName(); +HelixAdmin helixAdmin = this.helixManager.getClusterManagmentTool(); +List liveInstances = HelixUtils.getLiveInstances(this.helixManager); +LOGGER.warn("Found {} live instances in the cluster.", liveInstances.size()); +for (String instanceName: liveInstances) { + LOGGER.warn("Disabling instance: {}", instanceName); + helixAdmin.enableInstance(clusterName, instanceName, false); Review comment: No there is none. The instance is part of the cluster, but disabled. It needs to re-enabled to start getting task assignments from Helix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411534) Time Spent: 1h 40m (was: 1.5h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590340 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java ## @@ -479,6 +488,26 @@ void connectHelixManager() { } } + /** + * A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application + * leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use + * the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the + * new cluster. + * + * NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application + * completion, this method should be removed. + */ + void disableLiveHelixInstances() { +String clusterName = this.helixManager.getClusterName(); +HelixAdmin helixAdmin = this.helixManager.getClusterManagmentTool(); +List liveInstances = HelixUtils.getLiveInstances(this.helixManager); +LOGGER.warn("Found {} live instances in the cluster.", liveInstances.size()); +for (String instanceName: liveInstances) { + LOGGER.warn("Disabling instance: {}", instanceName); + helixAdmin.enableInstance(clusterName, instanceName, false); Review comment: No there is none. The instance is part of the cluster, but disabled. It needs to re-enabled to start getting task assignments from Helix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590474 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -82,6 +91,7 @@ import org.apache.gobblin.util.JvmUtils; import org.apache.gobblin.util.reflection.GobblinConstructorUtils; + Review comment: Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590389 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); + +if (this.taskDriverHelixManager.isPresent()) { + String taskDriverCluster = clusterConfig.getString(GobblinClusterConfigurationKeys.TASK_DRIVER_CLUSTER_NAME_KEY); + logger.warn("Dropping instance: {} from task driver cluster: {}", helixInstanceName, taskDriverCluster); + HelixUtils.dropInstanceIfExists(admin, taskDriverCluster, helixInstanceName); +} +admin.close(); + +logger.warn("Reinitializing Helix manager.."); +initHelixManager(); + } + + @VisibleForTesting + void connectHelixManagerWithRetry() { +Callable connectHelixManagerCallable = () -> { + try { +logger.info("Instance: {} attempting to join cluster: {}", helixInstanceName, clusterName); +connectHelixManager(); + } catch (HelixException e) { +logger.error("Exception encountered when joining cluster", e); +onClusterJoinFailure(); +throw e; } -} catch (Exception e) { - logger.error("HelixManager failed to connect", e); + return null; +}; + +Retryer retryer = RetryerBuilder.newBuilder() +.retryIfException() +.withStopStrategy(StopStrategies.stopAfterAttempt(5)).build(); + +try { + retryer.call(connectHelixManagerCallable); +} catch (ExecutionException | RetryException e) { + logger.error("Connecting to Helix manager failed", e); Review comment: Thanks! Will fix it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411531=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411531 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:09 Start Date: 28/Mar/20 00:09 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r39958 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -188,6 +193,7 @@ // into the queue if the container running the instance completes. Unused Helix // instance names get picked up when replacement containers get allocated. private final ConcurrentLinkedQueue unusedHelixInstanceNames = Queues.newConcurrentLinkedQueue(); + private boolean reuseUnusedHelixInstanceNames = true; Review comment: Thanks! Removed this variable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411531) Time Spent: 1.5h (was: 1h 20m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r39958 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -188,6 +193,7 @@ // into the queue if the container running the instance completes. Unused Helix // instance names get picked up when replacement containers get allocated. private final ConcurrentLinkedQueue unusedHelixInstanceNames = Queues.newConcurrentLinkedQueue(); + private boolean reuseUnusedHelixInstanceNames = true; Review comment: Thanks! Removed this variable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411528 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 28/Mar/20 00:04 Start Date: 28/Mar/20 00:04 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399589040 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); Review comment: The intent was to avoid a state where the HelixManager is in an inconsistent state where the underlying ZkClient is connected but there is a failure later. By disconnecting before retrying, we ensure any partial state during the previous connect attempt is cleaned up. The reason we are instantiating a new HelixAdmin instance (separate from HelixManager) is that getClusterManagementTool() requires HelixManager to be connected first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411528) Time Spent: 1h 20m (was: 1h 10m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application resta
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399589040 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); Review comment: The intent was to avoid a state where the HelixManager is in an inconsistent state where the underlying ZkClient is connected but there is a failure later. By disconnecting before retrying, we ensure any partial state during the previous connect attempt is cleaned up. The reason we are instantiating a new HelixAdmin instance (separate from HelixManager) is that getClusterManagementTool() requires HelixManager to be connected first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411517=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411517 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399583366 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -635,10 +644,26 @@ protected void handleContainerCompletion(ContainerStatus containerStatus) { containerStatus.getContainerId(), containerStatus.getDiagnostics())); } -if (this.releasedContainerCache.getIfPresent(containerStatus.getContainerId()) != null) { - LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", - containerStatus.getContainerId()); - return; +if (containerStatus.getExitStatus() == ContainerExitStatus.ABORTED) { + if (this.releasedContainerCache.getIfPresent(containerStatus.getContainerId()) != null) { +LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", containerStatus.getContainerId()); +return; + } else { +LOGGER.info("Container {} aborted due to lost NM", containerStatus.getContainerId()); + // Container release was not requested. Likely, the container was running on a node on which the NM died. + // In this case, RM assumes that the containers are "lost", even though the container process may still be +// running on the node. We need to ensure that the Helix instances running on the orphaned containers +// are fenced off from the Helix cluster to avoid double publishing and state being committed by the +// instances. +if (!UNKNOWN_HELIX_INSTANCE.equals(completedInstanceName)) { + String clusterName = this.helixManager.getClusterName(); + //Disable the orphaned instance. + if (HelixUtils.isInstanceLive(helixManager, completedInstanceName)) { +LOGGER.info("Disabling the Helix instance {}", completedInstanceName); + this.helixManager.getClusterManagmentTool().enableInstance(clusterName, completedInstanceName, false); Review comment: enableInstance, This is a very good API :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411517) Time Spent: 1h 10m (was: 1h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > ------- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411515=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411515 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399580058 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); Review comment: The return value of this method is ignored, is that intentional? Or shall we leave the handling of exception to caller of this static method ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411515) Time Spent: 1h 10m (was: 1h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state &g
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411514=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411514 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399582595 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -625,7 +634,7 @@ private boolean shouldStickToTheSameNode(int containerExitStatus) { */ protected void handleContainerCompletion(ContainerStatus containerStatus) { Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); -String completedInstanceName = completedContainerEntry == null? "unknown" : completedContainerEntry.getValue(); +String completedInstanceName = completedContainerEntry == null? UNKNOWN_HELIX_INSTANCE : completedContainerEntry.getValue(); Review comment: Just curious, What kind of situation result in a containerId is not in containerMap? can we also add javadoc here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411514) Time Spent: 1h 10m (was: 1h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411512=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411512 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399578539 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); Review comment: I couldn't fully get the purpose of disconnect from existing helixManager, creating a HelixAdmin and use the newly-created admin to conduct instance-dropping. Isn't `helixManager.getClusterManagmentTool()` essentially a same HelixAdmin that we can use to drop instance ? What's the gain of disconnecting helixManager? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411512) Time Spent: 1h (was: 50m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399582595 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -625,7 +634,7 @@ private boolean shouldStickToTheSameNode(int containerExitStatus) { */ protected void handleContainerCompletion(ContainerStatus containerStatus) { Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); -String completedInstanceName = completedContainerEntry == null? "unknown" : completedContainerEntry.getValue(); +String completedInstanceName = completedContainerEntry == null? UNKNOWN_HELIX_INSTANCE : completedContainerEntry.getValue(); Review comment: Just curious, What kind of situation result in a containerId is not in containerMap? can we also add javadoc here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399575480 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -82,6 +91,7 @@ import org.apache.gobblin.util.JvmUtils; import org.apache.gobblin.util.reflection.GobblinConstructorUtils; + Review comment: Empty line. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411513=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411513 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399583642 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -188,6 +193,7 @@ // into the queue if the container running the instance completes. Unused Helix // instance names get picked up when replacement containers get allocated. private final ConcurrentLinkedQueue unusedHelixInstanceNames = Queues.newConcurrentLinkedQueue(); + private boolean reuseUnusedHelixInstanceNames = true; Review comment: what's the usage of this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411513) Time Spent: 1h 10m (was: 1h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411510=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411510 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399575480 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -82,6 +91,7 @@ import org.apache.gobblin.util.JvmUtils; import org.apache.gobblin.util.reflection.GobblinConstructorUtils; + Review comment: Empty line. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 411510) Time Spent: 50m (was: 40m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399583366 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -635,10 +644,26 @@ protected void handleContainerCompletion(ContainerStatus containerStatus) { containerStatus.getContainerId(), containerStatus.getDiagnostics())); } -if (this.releasedContainerCache.getIfPresent(containerStatus.getContainerId()) != null) { - LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", - containerStatus.getContainerId()); - return; +if (containerStatus.getExitStatus() == ContainerExitStatus.ABORTED) { + if (this.releasedContainerCache.getIfPresent(containerStatus.getContainerId()) != null) { +LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", containerStatus.getContainerId()); +return; + } else { +LOGGER.info("Container {} aborted due to lost NM", containerStatus.getContainerId()); + // Container release was not requested. Likely, the container was running on a node on which the NM died. + // In this case, RM assumes that the containers are "lost", even though the container process may still be +// running on the node. We need to ensure that the Helix instances running on the orphaned containers +// are fenced off from the Helix cluster to avoid double publishing and state being committed by the +// instances. +if (!UNKNOWN_HELIX_INSTANCE.equals(completedInstanceName)) { + String clusterName = this.helixManager.getClusterName(); + //Disable the orphaned instance. + if (HelixUtils.isInstanceLive(helixManager, completedInstanceName)) { +LOGGER.info("Disabling the Helix instance {}", completedInstanceName); + this.helixManager.getClusterManagmentTool().enableInstance(clusterName, completedInstanceName, false); Review comment: enableInstance, This is a very good API :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411511=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411511 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 23:41 Start Date: 27/Mar/20 23:41 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399579077 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); + +if (this.taskDriverHelixManager.isPresent()) { + String taskDriverCluster = clusterConfig.getString(GobblinClusterConfigurationKeys.TASK_DRIVER_CLUSTER_NAME_KEY); + logger.warn("Dropping instance: {} from task driver cluster: {}", helixInstanceName, taskDriverCluster); + HelixUtils.dropInstanceIfExists(admin, taskDriverCluster, helixInstanceName); +} +admin.close(); + +logger.warn("Reinitializing Helix manager.."); +initHelixManager(); + } + + @VisibleForTesting + void connectHelixManagerWithRetry() { +Callable connectHelixManagerCallable = () -> { + try { +logger.info("Instance: {} attempting to join cluster: {}", helixInstanceName, clusterName); +connectHelixManager(); + } catch (HelixException e) { +logger.error("Exception encountered when joining cluster", e); +onClusterJoinFailure(); +throw e; } -} catch (Exception e) { - logger.error("HelixManager failed to connect", e); + return null; +}; + +Retryer retryer = RetryerBuilder.newBuilder() +.retryIfException() +.withStopStrategy(StopStrategies.stopAfterAttempt(5)).build(); + +try { + retryer.call(connectHelixManagerCallable); +} catch (ExecutionException | RetryException e) { + logger.error("Connecting to Helix manager failed", e); Review comment: let's avoid log.error and having an unchecked exception propagated at the same time: It doubled stack traces which is not very necessary and confused reader on how many times the exception actually happen. This is an automated mess
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399583642 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -188,6 +193,7 @@ // into the queue if the container running the instance completes. Unused Helix // instance names get picked up when replacement containers get allocated. private final ConcurrentLinkedQueue unusedHelixInstanceNames = Queues.newConcurrentLinkedQueue(); + private boolean reuseUnusedHelixInstanceNames = true; Review comment: what's the usage of this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399579077 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); + +if (this.taskDriverHelixManager.isPresent()) { + String taskDriverCluster = clusterConfig.getString(GobblinClusterConfigurationKeys.TASK_DRIVER_CLUSTER_NAME_KEY); + logger.warn("Dropping instance: {} from task driver cluster: {}", helixInstanceName, taskDriverCluster); + HelixUtils.dropInstanceIfExists(admin, taskDriverCluster, helixInstanceName); +} +admin.close(); + +logger.warn("Reinitializing Helix manager.."); +initHelixManager(); + } + + @VisibleForTesting + void connectHelixManagerWithRetry() { +Callable connectHelixManagerCallable = () -> { + try { +logger.info("Instance: {} attempting to join cluster: {}", helixInstanceName, clusterName); +connectHelixManager(); + } catch (HelixException e) { +logger.error("Exception encountered when joining cluster", e); +onClusterJoinFailure(); +throw e; } -} catch (Exception e) { - logger.error("HelixManager failed to connect", e); + return null; +}; + +Retryer retryer = RetryerBuilder.newBuilder() +.retryIfException() +.withStopStrategy(StopStrategies.stopAfterAttempt(5)).build(); + +try { + retryer.call(connectHelixManagerCallable); +} catch (ExecutionException | RetryException e) { + logger.error("Connecting to Helix manager failed", e); Review comment: let's avoid log.error and having an unchecked exception propagated at the same time: It doubled stack traces which is not very necessary and confused reader on how many times the exception actually happen. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399578539 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); Review comment: I couldn't fully get the purpose of disconnect from existing helixManager, creating a HelixAdmin and use the newly-created admin to conduct instance-dropping. Isn't `helixManager.getClusterManagmentTool()` essentially a same HelixAdmin that we can use to drop instance ? What's the gain of disconnecting helixManager? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399584405 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java ## @@ -479,6 +488,26 @@ void connectHelixManager() { } } + /** + * A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application + * leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use + * the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the + * new cluster. + * + * NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application + * completion, this method should be removed. + */ + void disableLiveHelixInstances() { +String clusterName = this.helixManager.getClusterName(); +HelixAdmin helixAdmin = this.helixManager.getClusterManagmentTool(); +List liveInstances = HelixUtils.getLiveInstances(this.helixManager); +LOGGER.warn("Found {} live instances in the cluster.", liveInstances.size()); +for (String instanceName: liveInstances) { + LOGGER.warn("Disabling instance: {}", instanceName); + helixAdmin.enableInstance(clusterName, instanceName, false); Review comment: Are there any mechanism from Helix side to sort of retry-joining cluster after being disabled? Just want to make sure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399580058 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java ## @@ -366,20 +377,74 @@ boolean isStopped() { } @VisibleForTesting - void connectHelixManager() { -try { - this.jobHelixManager.connect(); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, - new ParticipantShutdownMessageHandlerFactory()); - this.jobHelixManager.getMessagingService() - .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), - getUserDefinedMessageHandlerFactory()); - if (this.taskDriverHelixManager.isPresent()) { -this.taskDriverHelixManager.get().connect(); + void connectHelixManager() throws Exception { +this.jobHelixManager.connect(); +//Ensure the instance is enabled. +this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, helixInstanceName, true); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE, +new ParticipantShutdownMessageHandlerFactory()); +this.jobHelixManager.getMessagingService() + .registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(), +getUserDefinedMessageHandlerFactory()); +if (this.taskDriverHelixManager.isPresent()) { + this.taskDriverHelixManager.get().connect(); + //Ensure the instance is enabled. + this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(), helixInstanceName, true); +} + } + + /** + * A method to handle failures joining Helix cluster. The method will perform the following steps before attempting + * to re-join the cluster: + * + * Disconnect from Helix cluster, which would close any open clients + * Drop instance from Helix cluster, to remove any partial instance structure from Helix + * Re-construct helix manager instances, used to re-join the cluster + * + */ + private void onClusterJoinFailure() { +logger.warn("Disconnecting Helix manager.."); +disconnectHelixManager(); + +HelixAdmin admin = new ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY)); +//Drop the helix Instance +logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, clusterName); +HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName); Review comment: The return value of this method is ignored, is that intentional? Or shall we leave the handling of exception to caller of this static method ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=410741=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-410741 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 00:35 Start Date: 27/Mar/20 00:35 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `0.00%`. > The diff coverage is `41.57%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2940 +/- ## - Coverage 44.60% 44.60% -0.01% - Complexity 8980 8981 +1 Files 1936 1936 Lines 7323473296 +62 Branches 8083 8089 +6 + Hits 3266932695 +26 - Misses3751537550 +35 - Partials 3050 3051 +1 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `17.80% <0.00%> (-0.25%)` | `3.00 <0.00> (ø)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `20.66% <0.00%> (-0.51%)` | `8.00 <0.00> (ø)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `34.12% <18.18%> (-4.14%)` | `14.00 <1.00> (ø)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `15.71% <22.22%> (+0.44%)` | `4.00 <0.00> (ø)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `66.01% <75.00%> (+1.85%)` | `29.00 <4.00> (+2.00)` | | | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `53.91% <100.00%> (ø)` | `26.00 <1.00> (ø)` | | | [...ava/org/apache/gobblin/fsm/FiniteStateMachine.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2ZzbS9GaW5pdGVTdGF0ZU1hY2hpbmUuamF2YQ==) | `73.48% <0.00%> (-3.04%)` | `18.00% <0.00%> (-3.00%)` | | | [...lin/restli/throttling/ZookeeperLeaderElection.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2Utc2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3Jlc3RsaS90aHJvdHRsaW5nL1pvb2tlZXBlckxlYWRlckVsZWN0aW9uLmphdmE=) | `70.00% <0.00%> (-2.23%)` | `13.00% <0.00%> (ø%)` | | | ... and [3 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolut
[GitHub] [incubator-gobblin] codecov-io edited a comment on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
codecov-io edited a comment on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `0.00%`. > The diff coverage is `41.57%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2940 +/- ## - Coverage 44.60% 44.60% -0.01% - Complexity 8980 8981 +1 Files 1936 1936 Lines 7323473296 +62 Branches 8083 8089 +6 + Hits 3266932695 +26 - Misses3751537550 +35 - Partials 3050 3051 +1 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `17.80% <0.00%> (-0.25%)` | `3.00 <0.00> (ø)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `20.66% <0.00%> (-0.51%)` | `8.00 <0.00> (ø)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `34.12% <18.18%> (-4.14%)` | `14.00 <1.00> (ø)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `15.71% <22.22%> (+0.44%)` | `4.00 <0.00> (ø)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `66.01% <75.00%> (+1.85%)` | `29.00 <4.00> (+2.00)` | | | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `53.91% <100.00%> (ø)` | `26.00 <1.00> (ø)` | | | [...ava/org/apache/gobblin/fsm/FiniteStateMachine.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2ZzbS9GaW5pdGVTdGF0ZU1hY2hpbmUuamF2YQ==) | `73.48% <0.00%> (-3.04%)` | `18.00% <0.00%> (-3.00%)` | | | [...lin/restli/throttling/ZookeeperLeaderElection.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2Utc2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3Jlc3RsaS90aHJvdHRsaW5nL1pvb2tlZXBlckxlYWRlckVsZWN0aW9uLmphdmE=) | `70.00% <0.00%> (-2.23%)` | `13.00% <0.00%> (ø%)` | | | ... and [3 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=continue). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=footer). Last update [a634612...2915df6](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments). This
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=410737=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-410737 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 27/Mar/20 00:19 Start Date: 27/Mar/20 00:19 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `40.49%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2940 +/- ## - Coverage 44.60% 4.11% -40.50% + Complexity 8980 750 -8230 Files 19361936 Lines 73234 73296 +62 Branches 80838089+6 - Hits 326693016-29653 - Misses37515 69960+32445 + Partials 3050 320 -2730 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (-53.92%)` | `0.00 <0.00> (-26.00)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `0.00% <0.00%> (-64.16%)` | `0.00 <0.00> (-27.00)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `0.00% <0.00%> (-38.27%)` | `0.00 <0.00> (-14.00)` | | | [...n/compaction/action/CompactionWatermarkAction.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jb21wYWN0aW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbXBhY3Rpb24vYWN0aW9uL0NvbXBhY3Rpb25XYXRlcm1hcmtBY3Rpb24uamF2YQ==) | `0.00% <0.00%> (-77.05%)` | `0.00 <0.00> (-11.00)` | | | [...che/gobblin/salesforce/ResultChainingIterator.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1zYWxlc2ZvcmNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NhbGVzZm9yY2UvUmVzdWx0Q2hhaW5pbmdJdGVyYXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `0.00% <0.00%> (-18.06%)` | `0.00 <0.00> (-3.00)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0.00% <0.00%> (-21.17%)` | `0.00 <0.00> (-8.00)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `0.00% <0.00%> (-15.28%)` | `0.00 <0.00> (-4.00)` | | | [...c/main/java/org/apache/gobblin/util/FileUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvRmlsZVV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...n/java/org/apach
[GitHub] [incubator-gobblin] codecov-io commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
codecov-io commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604751963 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=h1) Report > Merging [#2940](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/a63461257c3fcea8f4ff67087f8cb29be25d6baf=desc) will **decrease** coverage by `40.49%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/graphs/tree.svg?width=650=150=pr=4MgURJ0bGc)](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2940 +/- ## - Coverage 44.60% 4.11% -40.50% + Complexity 8980 750 -8230 Files 19361936 Lines 73234 73296 +62 Branches 80838089+6 - Hits 326693016-29653 - Misses37515 69960+32445 + Partials 3050 320 -2730 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2940?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (-53.92%)` | `0.00 <0.00> (-26.00)` | | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `0.00% <0.00%> (-64.16%)` | `0.00 <0.00> (-27.00)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `0.00% <0.00%> (-38.27%)` | `0.00 <0.00> (-14.00)` | | | [...n/compaction/action/CompactionWatermarkAction.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jb21wYWN0aW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbXBhY3Rpb24vYWN0aW9uL0NvbXBhY3Rpb25XYXRlcm1hcmtBY3Rpb24uamF2YQ==) | `0.00% <0.00%> (-77.05%)` | `0.00 <0.00> (-11.00)` | | | [...che/gobblin/salesforce/ResultChainingIterator.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1zYWxlc2ZvcmNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NhbGVzZm9yY2UvUmVzdWx0Q2hhaW5pbmdJdGVyYXRvci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../apache/gobblin/yarn/GobblinApplicationMaster.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbkFwcGxpY2F0aW9uTWFzdGVyLmphdmE=) | `0.00% <0.00%> (-18.06%)` | `0.00 <0.00> (-3.00)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0.00% <0.00%> (-21.17%)` | `0.00 <0.00> (-8.00)` | | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `0.00% <0.00%> (-15.28%)` | `0.00 <0.00> (-4.00)` | | | [...c/main/java/org/apache/gobblin/util/FileUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvRmlsZVV0aWxzLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...n/java/org/apache/gobblin/fork/CopyableSchema.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2ZvcmsvQ29weWFibGVTY2hlbWEuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | ... and [1150 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2940/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubato
[GitHub] [incubator-gobblin] sv2000 commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604735451 @autumnust @htran1 please review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=410697=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-410697 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 26/Mar/20 23:18 Start Date: 26/Mar/20 23:18 Worklog Time Spent: 10m Work Description: sv2000 commented on issue #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#issuecomment-604735451 @autumnust @htran1 please review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 410697) Time Spent: 20m (was: 10m) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=410696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-410696 ] ASF GitHub Bot logged work on GOBBLIN-1099: --- Author: ASF GitHub Bot Created on: 26/Mar/20 23:18 Start Date: 26/Mar/20 23:18 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940 …ters Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-1099 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): A Yarn application may leave behind orphaned containers, which can happen due to lost node managers. The orphaned containers however can continue to run (potentially forever) as participants in the Helix cluster. This can cause the following problems for a Gobblin-on-Yarn application: Double publish of data and commit of state Task failures and partition starvation during application restarts, as Helix may assign tasks to the orphaned containers which have a stale state and configuration Container failures on application restarts due to Helix instance name collisions with orphaned containers This PR incorporates the following changes to handle orphaned containers: 1. Disables live instances during Yarn application start up and shutdown in GobblinYarnAppLauncher. This ensures that any orphaned Helix instances are disabled from joining the cluster on restart and any tasks running on these instances are cancelled. 2. The GobblinApplicationMaster (inside YarnService) ensures that Helix instance name assigned to a newly allocated container is not a currently live instance to avoid instance name collisions. We also handle NM failures/restarts that result in Yarn RM "aborting" the container (i.e. container is deemed dead from Yarn's point of view, even though the container may physically be alive). In this case, we disable the instance to ensure it is fenced off from the Helix cluster. 3. Gobblin workers (i.e. GobblinYarnTaskRunner) retry in case of failure in joining a Helix cluster. The retry logic disconnects from the Helix cluster and drops the instance before reattempting to join the cluster. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Added unit test in GobblinTaskRunnerTest ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 410696) Remaining Estimate: 0h Time Spent: 10m > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > --- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. Th
[GitHub] [incubator-gobblin] sv2000 opened a new pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
sv2000 opened a new pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940 …ters Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-1099 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): A Yarn application may leave behind orphaned containers, which can happen due to lost node managers. The orphaned containers however can continue to run (potentially forever) as participants in the Helix cluster. This can cause the following problems for a Gobblin-on-Yarn application: Double publish of data and commit of state Task failures and partition starvation during application restarts, as Helix may assign tasks to the orphaned containers which have a stale state and configuration Container failures on application restarts due to Helix instance name collisions with orphaned containers This PR incorporates the following changes to handle orphaned containers: 1. Disables live instances during Yarn application start up and shutdown in GobblinYarnAppLauncher. This ensures that any orphaned Helix instances are disabled from joining the cluster on restart and any tasks running on these instances are cancelled. 2. The GobblinApplicationMaster (inside YarnService) ensures that Helix instance name assigned to a newly allocated container is not a currently live instance to avoid instance name collisions. We also handle NM failures/restarts that result in Yarn RM "aborting" the container (i.e. container is deemed dead from Yarn's point of view, even though the container may physically be alive). In this case, we disable the instance to ensure it is fenced off from the Helix cluster. 3. Gobblin workers (i.e. GobblinYarnTaskRunner) retry in case of failure in joining a Helix cluster. The retry logic disconnects from the Helix cluster and drops the instance before reattempting to join the cluster. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Added unit test in GobblinTaskRunnerTest ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (GOBBLIN-1099) Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
Sudarshan Vasudevan created GOBBLIN-1099: Summary: Handle orphaned Yarn containers in Gobblin-on-Yarn clusters Key: GOBBLIN-1099 URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 Project: Apache Gobblin Issue Type: Improvement Components: gobblin-yarn Affects Versions: 0.15.0 Reporter: Sudarshan Vasudevan Assignee: Abhishek Tiwari Fix For: 0.15.0 A Yarn application may leave behind orphaned containers, which can happen due to lost node managers. The orphaned containers however can continue to run (potentially forever) as participants in the Helix cluster. This can cause the following problems for a Gobblin-on-Yarn application: # Double publish of data and commit of state # Task failures and partition starvation during application restarts, as Helix may assign tasks to the orphaned containers which have a stale state and configuration # Container failures on application restarts due to Helix instance name collisions with orphaned containers -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [incubator-gobblin] asfgit closed pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
asfgit closed pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
autumnust commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r378448071 ## File path: gobblin-modules/gobblin-azkaban/src/main/java/org/apache/gobblin/azkaban/EmbeddedGobblinYarnAppLauncher.java ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.azkaban; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.lang.reflect.Field; +import java.util.Map; + +import org.apache.gobblin.testing.AssertWithBackoff; +import org.apache.hadoop.yarn.conf.YarnConfiguration; +import org.apache.hadoop.yarn.server.MiniYARNCluster; +import org.testng.collections.Lists; + +import com.google.common.base.Preconditions; +import com.google.common.base.Predicate; +import com.google.common.collect.ImmutableMap; +import com.google.common.io.Closer; + +import lombok.extern.slf4j.Slf4j; + + +/** + * Given a set up Azkaban job configuration, launch the Gobblin-on-Yarn job in a semi-embedded mode: + * - Uses external Kafka cluster and requires external Zookeeper(Non-embedded TestingServer) to be set up. + * The Kafka Cluster was intentionally set to be external due to the data availability. External ZK was unintentional + * as the helix version (0.9) being used cannot finish state transition in the Embedded ZK. + * TODO: Adding embedded Kafka cluster and set golden datasets for data-validation. + * - Uses MiniYARNCluster so YARN components don't have to be installed. + */ +@Slf4j +public class EmbeddedGobblinYarnAppLauncher extends AzkabanJobRunner { + public static final String DYNAMIC_CONF_PATH = "dynamic.conf"; + public static final String YARN_SITE_XML_PATH = "yarn-site.xml"; + private static String zkString = ""; + private static String fileAddress = ""; + + private static void setup(String[] args) + throws Exception { +// Parsing zk-string +Preconditions.checkArgument(args.length == 1); +zkString = args[0]; + +// Initialize necessary external components: Yarn and Helix +Closer closer = Closer.create(); + +// Set java home in environment since it isn't set on some systems +String javaHome = System.getProperty("java.home"); +setEnv("JAVA_HOME", javaHome); + +final YarnConfiguration clusterConf = new YarnConfiguration(); +clusterConf.set("yarn.resourcemanager.connect.max-wait.ms", "1"); +clusterConf.set("yarn.nodemanager.resource.memory-mb", "512"); +clusterConf.set("yarn.scheduler.maximum-allocation-mb", "1024"); + +MiniYARNCluster miniYARNCluster = closer.register(new MiniYARNCluster("TestCluster", 1, 1, 1)); +miniYARNCluster.init(clusterConf); +miniYARNCluster.start(); + +// YARN client should not be started before the Resource Manager is up +AssertWithBackoff.create().logger(log).timeoutMs(1).assertTrue(new Predicate() { + @Override + public boolean apply(Void input) { +return !clusterConf.get(YarnConfiguration.RM_ADDRESS).contains(":0"); + } +}, "Waiting for RM"); + +try (PrintWriter pw = new PrintWriter(DYNAMIC_CONF_PATH, "UTF-8")) { + File dir = new File("target/dummydir"); + + // dummy directory specified in configuration + if (!dir.mkdir()) { +log.error("The dummy folder's creation is not successful"); + } + dir.deleteOnExit(); + + pw.println("gobblin.cluster.zk.connection.string=\"" + zkString + "\""); + pw.println("jobconf.fullyQualifiedPath=\"" + dir.getAbsolutePath() + "\""); +} + +// YARN config is dynamic and needs to be passed to other processes +try (OutputStream os = new FileOutputStream(new File(YARN_SITE_XML_PATH))) { + clusterConf.writeXml(os); +} + +/**
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r376171137 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java ## @@ -687,9 +691,13 @@ private void addAppLocalFiles(String localFilePathList, Optional
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r376166058 ## File path: gobblin-modules/gobblin-azkaban/src/main/java/org/apache/gobblin/azkaban/EmbeddedGobblinYarnAppLauncher.java ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.azkaban; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.lang.reflect.Field; +import java.util.Map; + +import org.apache.gobblin.testing.AssertWithBackoff; +import org.apache.hadoop.yarn.conf.YarnConfiguration; +import org.apache.hadoop.yarn.server.MiniYARNCluster; +import org.testng.collections.Lists; + +import com.google.common.base.Preconditions; +import com.google.common.base.Predicate; +import com.google.common.collect.ImmutableMap; +import com.google.common.io.Closer; + +import lombok.extern.slf4j.Slf4j; + + +/** + * Given a set up Azkaban job configuration, launch the Gobblin-on-Yarn job in a semi-embedded mode: + * - Uses external Kafka cluster and requires external Zookeeper(Non-embedded TestingServer) to be set up. + * The Kafka Cluster was intentionally set to be external due to the data availability. External ZK was unintentional + * as the helix version (0.9) being used cannot finish state transition in the Embedded ZK. + * TODO: Adding embedded Kafka cluster and set golden datasets for data-validation. + * - Uses MiniYARNCluster so YARN components don't have to be installed. + */ +@Slf4j +public class EmbeddedGobblinYarnAppLauncher extends AzkabanJobRunner { + public static final String DYNAMIC_CONF_PATH = "dynamic.conf"; + public static final String YARN_SITE_XML_PATH = "yarn-site.xml"; + private static String zkString = ""; + private static String fileAddress = ""; + + private static void setup(String[] args) + throws Exception { +// Parsing zk-string +Preconditions.checkArgument(args.length == 1); +zkString = args[0]; + +// Initialize necessary external components: Yarn and Helix +Closer closer = Closer.create(); + +// Set java home in environment since it isn't set on some systems +String javaHome = System.getProperty("java.home"); +setEnv("JAVA_HOME", javaHome); + +final YarnConfiguration clusterConf = new YarnConfiguration(); +clusterConf.set("yarn.resourcemanager.connect.max-wait.ms", "1"); +clusterConf.set("yarn.nodemanager.resource.memory-mb", "512"); +clusterConf.set("yarn.scheduler.maximum-allocation-mb", "1024"); + +MiniYARNCluster miniYARNCluster = closer.register(new MiniYARNCluster("TestCluster", 1, 1, 1)); +miniYARNCluster.init(clusterConf); +miniYARNCluster.start(); + +// YARN client should not be started before the Resource Manager is up +AssertWithBackoff.create().logger(log).timeoutMs(1).assertTrue(new Predicate() { + @Override + public boolean apply(Void input) { +return !clusterConf.get(YarnConfiguration.RM_ADDRESS).contains(":0"); + } +}, "Waiting for RM"); + +try (PrintWriter pw = new PrintWriter(DYNAMIC_CONF_PATH, "UTF-8")) { + File dir = new File("target/dummydir"); + + // dummy directory specified in configuration + if (!dir.mkdir()) { +log.error("The dummy folder's creation is not successful"); + } + dir.deleteOnExit(); + + pw.println("gobblin.cluster.zk.connection.string=\"" + zkString + "\""); + pw.println("jobconf.fullyQualifiedPath=\"" + dir.getAbsolutePath() + "\""); +} + +// YARN config is dynamic and needs to be passed to other processes +try (OutputStream os = new FileOutputStream(new File(YARN_SITE_XML_PATH))) { + clusterConf.writeXml(os); +} + +/** Have to pass the
[GitHub] [incubator-gobblin] codecov-io edited a comment on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
codecov-io edited a comment on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#issuecomment-575821726 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=h1) Report > Merging [#2873](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/ef3003c85531aaeacb760a4e30fca2d3069001aa?src=pr=desc) will **decrease** coverage by `0.04%`. > The diff coverage is `3.5%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2873 +/- ## - Coverage 45.73% 45.68% -0.05% Complexity 9112 9112 Files 1921 1924 +3 Lines 7239172495 +104 Branches 7967 7977 +10 + Hits 3310633120 +14 - Misses3626236353 +91 + Partials 3023 3022 -1 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `54.75% <ø> (ø)` | `28 <0> (ø)` | :arrow_down: | | [...n/java/org/apache/gobblin/util/logs/LogCopier.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvbG9ncy9Mb2dDb3BpZXIuamF2YQ==) | `69.56% <ø> (ø)` | `18 <0> (ø)` | :arrow_down: | | [...che/gobblin/yarn/GobblinYarnConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Db25maWd1cmF0aW9uS2V5cy5qYXZh) | `66.66% <ø> (ø)` | `1 <0> (ø)` | :arrow_down: | | [.../org/apache/gobblin/yarn/GobblinYarnLogSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Mb2dTb3VyY2UuamF2YQ==) | `23.07% <ø> (ø)` | `3 <0> (ø)` | :arrow_down: | | [...e/gobblin/yarn/AbstractYarnAppSecurityManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vQWJzdHJhY3RZYXJuQXBwU2VjdXJpdHlNYW5hZ2VyLmphdmE=) | `48.27% <ø> (ø)` | `6 <0> (ø)` | :arrow_down: | | [...gobblin/azkaban/AzkabanGobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0F6a2FiYW5Hb2JibGluWWFybkFwcExhdW5jaGVyLmphdmE=) | `30.55% <0%> (-2.78%)` | `2 <0> (ø)` | | | [...obblin/azkaban/EmbeddedGobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0VtYmVkZGVkR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...in/source/extractor/extract/kafka/KafkaSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4ta2Fma2EtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NvdXJjZS9leHRyYWN0b3IvZXh0cmFjdC9rYWZrYS9LYWZrYVNvdXJjZS5qYXZh) | `0% <0%> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...in/azkaban/AzkabanGobblinLocalYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0F6a2FiYW5Hb2JibGluTG9jYWxZYXJuQXBwTGF1bmNoZXIuamF2YQ==) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `21.18% <0%> (-0.22%)` | `8 <0> (ø)` | | | ... and [10 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree-more) | | -- [Continue to revie
[GitHub] [incubator-gobblin] autumnust commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
autumnust commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r374996548 ## File path: gobblin-modules/gobblin-azkaban/src/main/java/org/apache/gobblin/azkaban/EmbeddedGobblinYarnAppLauncher.java ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.azkaban; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.lang.reflect.Field; +import java.util.Map; + +import org.apache.gobblin.testing.AssertWithBackoff; +import org.apache.hadoop.yarn.conf.YarnConfiguration; +import org.apache.hadoop.yarn.server.MiniYARNCluster; +import org.testng.collections.Lists; + +import com.google.common.base.Preconditions; +import com.google.common.base.Predicate; +import com.google.common.collect.ImmutableMap; +import com.google.common.io.Closer; + +import lombok.extern.slf4j.Slf4j; + + +/** + * Given a set up Azkaban job configuration, launch the Gobblin-on-Yarn job in a semi-embedded mode: + * - Uses external Kafka cluster and requires external Zookeeper(Non-embedded TestingServer) to be set up. Review comment: Yes, Kafka was intentional from very beginning, as we don't have quick set-up script for kafka and a good way to create a golden datasets there. ZK was kind of intentional, but briefly the Helix behaves unexpectedly unstable on embedded ZK server so I abandoned that eventually. Will add these into comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368765996 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -533,6 +533,7 @@ private void addRemoteAppFiles(String hdfsFileList, Map r } } + // TODO: Not fully understand the purpose of this method. Review comment: What is the TODO here? Drop this comment if unintentional. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368764748 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java ## @@ -93,8 +93,8 @@ * * * - * This class runs in the {@link GobblinClusterManager}. The actual task execution happens in the in the - * {@link GobblinTaskRunner}. + * The instance of this class is instantiated and runing along with the {@link GobblinHelixJobScheduler}. Review comment: Rewording: This class is instantiated by the {@link GobblinHelixJobScheduler} on every job submission to launch the Gobblin job. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368766539 ## File path: gobblin-modules/gobblin-azkaban/src/main/resources/conf/properties/local.properties ## @@ -0,0 +1,23 @@ +# Misc. deployment/cluster specific variables +user.name=kafkaetl +logs.user.name=kafkaetlstream +root.project.name=gobblin-kafka-streaming-local +grid.name=ltx1-holdem Review comment: Change this config. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368766504 ## File path: gobblin-modules/gobblin-azkaban/src/main/resources/conf/properties/local.properties ## @@ -0,0 +1,23 @@ +# Misc. deployment/cluster specific variables +user.name=kafkaetl +logs.user.name=kafkaetlstream +root.project.name=gobblin-kafka-streaming-local +grid.name=ltx1-holdem + + +# Cluster specific directory configurations +# home.dir and root.data.location should be unique per cluster deployment + +home.dir=/jobs/${user.name} +logs.dir=/jobs/${logs.user.name} +root.data.location=/data + +fs.uri=file:/// + +# Yarn Configuration +gobblin.yarn.app.queue=default + +# Gobblin Cluster +#Add a suffix "-1" to the app name becuase of a corrupted Helix cluster znode (LIZK-619). This made the helix cluster Review comment: Drop this comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368766370 ## File path: gobblin-modules/gobblin-azkaban/src/main/java/org/apache/gobblin/azkaban/EmbeddedGobblinYarnAppLauncher.java ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.azkaban; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.lang.reflect.Field; +import java.util.Map; + +import org.apache.gobblin.testing.AssertWithBackoff; +import org.apache.hadoop.yarn.conf.YarnConfiguration; +import org.apache.hadoop.yarn.server.MiniYARNCluster; +import org.testng.collections.Lists; + +import com.google.common.base.Preconditions; +import com.google.common.base.Predicate; +import com.google.common.collect.ImmutableMap; +import com.google.common.io.Closer; + +import lombok.extern.slf4j.Slf4j; + + +/** + * Given a set up Azkaban job configuration, launch the Gobblin-on-Yarn job in a semi-embedded mode: + * - Uses external Kafka cluster and requires external Zookeeper(Non-embedded TestingServer) to be set up. + * - Uses MiniYARNCluster so YARN components don't have to be installed. + */ +@Slf4j +public class EmbeddedGobblinYarnAppLauncher extends AzkabanJobRunner { + public static final String DYNAMIC_CONF_PATH = "dynamic.conf"; + public static final String YARN_SITE_XML_PATH = "yarn-site.xml"; + private static String zkString = "zk-ltx1-gobblin.stg.linkedin.com:6312"; Review comment: Change this line. Shouldn't expose LI internal stuff outside. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368765618 ## File path: gobblin-modules/gobblin-azkaban/src/main/java/org/apache/gobblin/azkaban/EmbeddedGobblinYarnAppLauncher.java ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.azkaban; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.lang.reflect.Field; +import java.util.Map; + +import org.apache.gobblin.testing.AssertWithBackoff; +import org.apache.hadoop.yarn.conf.YarnConfiguration; +import org.apache.hadoop.yarn.server.MiniYARNCluster; +import org.testng.collections.Lists; + +import com.google.common.base.Preconditions; +import com.google.common.base.Predicate; +import com.google.common.collect.ImmutableMap; +import com.google.common.io.Closer; + +import lombok.extern.slf4j.Slf4j; + + +/** + * Given a set up Azkaban job configuration, launch the Gobblin-on-Yarn job in a semi-embedded mode: + * - Uses external Kafka cluster and requires external Zookeeper(Non-embedded TestingServer) to be set up. Review comment: Can you add why we are using an external ZK/Kafka server? And a TODO to set these up in embedded mode if that is the intention? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368766551 ## File path: gobblin-modules/gobblin-azkaban/src/main/resources/conf/properties/local.properties ## @@ -0,0 +1,23 @@ +# Misc. deployment/cluster specific variables +user.name=kafkaetl Review comment: Change this config. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368764662 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterManager.java ## @@ -228,6 +228,8 @@ private void stopAppLauncherAndServices() { /** * Configure Helix quota-based task scheduling + * This Task-quota indicates how many helix-tasks can be running concurrently within a Helix Job. Review comment: More precisely: this config controls the number of tasks that are concurrently assigned to a single Helix instance. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
sv2000 commented on a change in pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#discussion_r368764148 ## File path: gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixJobLauncher.java ## @@ -93,8 +93,8 @@ * * * - * This class runs in the {@link GobblinClusterManager}. The actual task execution happens in the in the - * {@link GobblinTaskRunner}. + * The instance of this class is instantiated and runing along with the {@link GobblinHelixJobScheduler}. + * The actual task execution happens in the in the {@link GobblinTaskRunner}, usually in a different process. Review comment: typo: "in the in the" -> "in the". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] codecov-io edited a comment on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
codecov-io edited a comment on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#issuecomment-575821726 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=h1) Report > Merging [#2873](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/adb0efcc20a306d16b2f59d89c189539bd20dc2d?src=pr=desc) will **decrease** coverage by `0.05%`. > The diff coverage is `3.5%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2873 +/- ## - Coverage 45.8% 45.74% -0.06% - Complexity 9108 9110 +2 Files 1915 1918 +3 Lines 7226672389 +123 Branches 7969 7982 +13 + Hits 3309833112 +14 - Misses3614336250 +107 - Partials 3025 3027 +2 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `53.91% <ø> (ø)` | `27 <0> (ø)` | :arrow_down: | | [...n/java/org/apache/gobblin/util/logs/LogCopier.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvbG9ncy9Mb2dDb3BpZXIuamF2YQ==) | `69.56% <ø> (ø)` | `18 <0> (ø)` | :arrow_down: | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `15.16% <ø> (ø)` | `4 <0> (ø)` | :arrow_down: | | [...che/gobblin/yarn/GobblinYarnConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Db25maWd1cmF0aW9uS2V5cy5qYXZh) | `66.66% <ø> (ø)` | `1 <0> (ø)` | :arrow_down: | | [.../org/apache/gobblin/yarn/GobblinYarnLogSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Mb2dTb3VyY2UuamF2YQ==) | `23.07% <ø> (ø)` | `3 <0> (ø)` | :arrow_down: | | [...e/gobblin/yarn/AbstractYarnAppSecurityManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vQWJzdHJhY3RZYXJuQXBwU2VjdXJpdHlNYW5hZ2VyLmphdmE=) | `48.27% <ø> (ø)` | `6 <0> (ø)` | :arrow_down: | | [...org/apache/gobblin/yarn/GobblinYarnTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5UYXNrUnVubmVyLmphdmE=) | `0% <ø> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...a/org/apache/gobblin/azkaban/AzkabanJobRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0F6a2FiYW5Kb2JSdW5uZXIuamF2YQ==) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...obblin/azkaban/EmbeddedGobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0VtYmVkZGVkR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...in/source/extractor/extract/kafka/KafkaSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4ta2Fma2EtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NvdXJjZS9leHRyYWN0b3IvZXh0cmFjdC9rYWZrYS9LYWZrYVNvdXJjZS5qYXZh) | `0% <0%> (ø)` | `0 <0> (ø)` | :arrow_down: | | ... and [16 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubato
[GitHub] [incubator-gobblin] codecov-io commented on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
codecov-io commented on issue #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873#issuecomment-575821726 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=h1) Report > Merging [#2873](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/adb0efcc20a306d16b2f59d89c189539bd20dc2d?src=pr=desc) will **decrease** coverage by `0.05%`. > The diff coverage is `3.5%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2873 +/- ## - Coverage 45.8% 45.74% -0.06% - Complexity 9108 9110 +2 Files 1915 1918 +3 Lines 7226672389 +123 Branches 7969 7982 +13 + Hits 3309833112 +14 - Misses3614336250 +107 - Partials 3025 3027 +2 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2873?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../apache/gobblin/cluster/GobblinClusterManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJNYW5hZ2VyLmphdmE=) | `53.91% <ø> (ø)` | `27 <0> (ø)` | :arrow_down: | | [...n/java/org/apache/gobblin/util/logs/LogCopier.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvbG9ncy9Mb2dDb3BpZXIuamF2YQ==) | `69.56% <ø> (ø)` | `18 <0> (ø)` | :arrow_down: | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `15.16% <ø> (ø)` | `4 <0> (ø)` | :arrow_down: | | [...che/gobblin/yarn/GobblinYarnConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Db25maWd1cmF0aW9uS2V5cy5qYXZh) | `66.66% <ø> (ø)` | `1 <0> (ø)` | :arrow_down: | | [.../org/apache/gobblin/yarn/GobblinYarnLogSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Mb2dTb3VyY2UuamF2YQ==) | `23.07% <ø> (ø)` | `3 <0> (ø)` | :arrow_down: | | [...e/gobblin/yarn/AbstractYarnAppSecurityManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vQWJzdHJhY3RZYXJuQXBwU2VjdXJpdHlNYW5hZ2VyLmphdmE=) | `48.27% <ø> (ø)` | `6 <0> (ø)` | :arrow_down: | | [...org/apache/gobblin/yarn/GobblinYarnTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5UYXNrUnVubmVyLmphdmE=) | `0% <ø> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...a/org/apache/gobblin/azkaban/AzkabanJobRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0F6a2FiYW5Kb2JSdW5uZXIuamF2YQ==) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...obblin/azkaban/EmbeddedGobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tYXprYWJhbi9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvZ29iYmxpbi9hemthYmFuL0VtYmVkZGVkR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0% <0%> (ø)` | `0 <0> (?)` | | | [...in/source/extractor/extract/kafka/KafkaSource.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4ta2Fma2EtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NvdXJjZS9leHRyYWN0b3IvZXh0cmFjdC9rYWZrYS9LYWZrYVNvdXJjZS5qYXZh) | `0% <0%> (ø)` | `0 <0> (ø)` | :arrow_down: | | ... and [16 more](https://codecov.io/gh/apache/incubator-gobblin/pull/2873/diff?src=pr=tree-more) | | -- [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-gobblin
[GitHub] [incubator-gobblin] autumnust opened a new pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton
autumnust opened a new pull request #2873: [GOBBLIN-1031]Gobblin-on-Yarn locally running Azkaban job skeleton URL: https://github.com/apache/incubator-gobblin/pull/2873 Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [ ] https://issues.apache.org/jira/browse/GOBBLIN-1031 ### Description - [ ] Here are some details about my PR, including screenshots (if applicable): - The major part of this PR is `AzkabanJobRunner.java` which could power gobblin job to be running using the same Azkaban setting that Linkedin is using internally for job-scheduling, locally. - On top of that, is an Embedded Gobblin JobLauncher that starts a Gobblin-Streaming job and submit ApplictionMaster to `MiniYARNCluster` spun up in the same process. - Also provide a skeleton of configurations that can be used for running the whole job locally. The skeleton has many class-definition that have not been open-sourced or only serve Linkedin-specific purpose, the general-purpose job configuration is still under development. ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
gobblin on yarn - job stuck without any error message logged
Hi, I am new to gobblin and trying to use gobblin on yarn to import data from postgres to store it in hdfs as avro files (daily partitioned). (Initially I faced an issue related to jodattime not handling postgres timestamp with microsecond precision in JsonElementConversionFactory.DateConverter which I am yet to find a solution but using watermark type simple with user defined partition to test out remaining configuration.) Right now I am facing another issue: The job gets stuck after completing more than half of workunits without any error logs. The helix debug logs indicate the tasks (for those partition files that are missing in taskoutput dir) changed from INIT to RUNNINg in one task runner - say task runner 1, And then later I can see same transition in another task runner say task runner 4 - to RUNNING. In between there are logs indicating task changed from RUNNING to COMPLETED. - But task output file is missing. And job never finishes it gets stuck. Helix logs suggests it doesn't see any pending tasks and no task assigned to any task runner. The config works for lesser volume of data (less number of partitions. Not sure how can I troubleshoot this one. Appreciate your suggestions. Thanks & Regards, Praveen CONFIDENTIALITY NOTICE: This message is the property of International Game Technology PLC and/or its subsidiaries and may contain proprietary, confidential or trade secret information. This message is intended solely for the use of the addressee. If you are not the intended recipient and have received this message in error, please delete this message from your system. Any unauthorized reading, distribution, copying, or other use of this message or its attachments is strictly prohibited.
[jira] [Closed] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan closed GOBBLIN-996. --- > Add support for managed Helix clusters for Gobblin-on-Yarn applications > --- > > Key: GOBBLIN-996 > URL: https://issues.apache.org/jira/browse/GOBBLIN-996 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-cluster >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > For managed Helix clusters, the GobblinApplicationMaster should skip > connecting to the cluster as CONTROLLER and we should skip cluster creation > from GobblinYarnApplicationLauncher. > > This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan resolved GOBBLIN-996. - Resolution: Fixed > Add support for managed Helix clusters for Gobblin-on-Yarn applications > --- > > Key: GOBBLIN-996 > URL: https://issues.apache.org/jira/browse/GOBBLIN-996 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-cluster >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > For managed Helix clusters, the GobblinApplicationMaster should skip > connecting to the cluster as CONTROLLER and we should skip cluster creation > from GobblinYarnApplicationLauncher. > > This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?focusedWorklogId=354756=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354756 ] ASF GitHub Bot logged work on GOBBLIN-996: -- Author: ASF GitHub Bot Created on: 05/Dec/19 22:42 Start Date: 05/Dec/19 22:42 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2841: GOBBLIN-996: Add support for managed Helix clusters for Gobblin-on-Ya… URL: https://github.com/apache/incubator-gobblin/pull/2841 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354756) Time Spent: 50m (was: 40m) > Add support for managed Helix clusters for Gobblin-on-Yarn applications > --- > > Key: GOBBLIN-996 > URL: https://issues.apache.org/jira/browse/GOBBLIN-996 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-cluster >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > For managed Helix clusters, the GobblinApplicationMaster should skip > connecting to the cluster as CONTROLLER and we should skip cluster creation > from GobblinYarnApplicationLauncher. > > This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?focusedWorklogId=354739=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354739 ] ASF GitHub Bot logged work on GOBBLIN-996: -- Author: ASF GitHub Bot Created on: 05/Dec/19 22:18 Start Date: 05/Dec/19 22:18 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2841: GOBBLIN-996: Add support for managed Helix clusters for Gobblin-on-Ya… URL: https://github.com/apache/incubator-gobblin/pull/2841#issuecomment-562340792 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=h1) Report > Merging [#2841](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/1092a891af7eda5732c948695b92e34540751613?src=pr=desc) will **increase** coverage by `0.01%`. > The diff coverage is `11.11%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2841 +/- ## + Coverage 45.66% 45.68% +0.01% - Complexity 8988 8991 +3 Files 1903 1903 Lines 7132771331 +4 Branches 7871 7873 +2 + Hits 3257232584 +12 + Misses3575435748 -6 + Partials 3001 2999 -2 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...bblin/cluster/GobblinClusterConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJDb25maWd1cmF0aW9uS2V5cy5qYXZh) | `0% <ø> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `19.94% <0%> (-0.11%)` | `7 <0> (ø)` | | | [...ache/gobblin/cluster/GobblinHelixMultiManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4TXVsdGlNYW5hZ2VyLmphdmE=) | `54.16% <100%> (-0.29%)` | `19 <0> (ø)` | | | [...apache/gobblin/salesforce/SalesforceExtractor.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1zYWxlc2ZvcmNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3NhbGVzZm9yY2UvU2FsZXNmb3JjZUV4dHJhY3Rvci5qYXZh) | `0% <0%> (ø)` | `0% <0%> (ø)` | :arrow_down: | | [...lin/restli/throttling/ZookeeperLeaderElection.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2Utc2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3Jlc3RsaS90aHJvdHRsaW5nL1pvb2tlZXBlckxlYWRlckVsZWN0aW9uLmphdmE=) | `72.22% <0%> (+2.22%)` | `13% <0%> (ø)` | :arrow_down: | | [.../limiter/RedirectAwareRestClientRequestSender.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UtY2xpZW50L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3V0aWwvbGltaXRlci9SZWRpcmVjdEF3YXJlUmVzdENsaWVudFJlcXVlc3RTZW5kZXIuamF2YQ==) | `87.2% <0%> (+2.32%)` | `18% <0%> (+1%)` | :arrow_up: | | [...he/gobblin/writer/FineGrainedWatermarkTracker.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jb3JlLWJhc2Uvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vd3JpdGVyL0ZpbmVHcmFpbmVkV2F0ZXJtYXJrVHJhY2tlci5qYXZh) | `84.67% <0%> (+2.41%)` | `29% <0%> (+1%)` | :arrow_up: | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `39.25% <0%> (+3.73%)` | `13% <0%> (+1%)` | :arrow_up: | | [...lin/util/filesystem/FileSystemInstrumentation.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi11dGlsaXR5L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL
[jira] [Work logged] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?focusedWorklogId=354734=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354734 ] ASF GitHub Bot logged work on GOBBLIN-996: -- Author: ASF GitHub Bot Created on: 05/Dec/19 22:08 Start Date: 05/Dec/19 22:08 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2841: GOBBLIN-996: Add support for managed Helix clusters for Gobblin-on-Ya… URL: https://github.com/apache/incubator-gobblin/pull/2841#issuecomment-562340792 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=h1) Report > Merging [#2841](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/1092a891af7eda5732c948695b92e34540751613?src=pr=desc) will **decrease** coverage by `41.51%`. > The diff coverage is `0%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2841 +/- ## - Coverage 45.66% 4.14% -41.52% + Complexity 8988 747 -8241 Files 19031903 Lines 71327 71331+4 Branches 78717873+2 - Hits 325722960-29612 - Misses35754 68052+32298 + Partials 3001 319 -2682 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2841?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...bblin/cluster/GobblinClusterConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkNsdXN0ZXJDb25maWd1cmF0aW9uS2V5cy5qYXZh) | `0% <ø> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...ache/gobblin/cluster/GobblinHelixMultiManager.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4TXVsdGlNYW5hZ2VyLmphdmE=) | `0% <0%> (-54.46%)` | `0 <0> (-19)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `0% <0%> (-20.06%)` | `0 <0> (-7)` | | | [...n/converter/AvroStringFieldDecryptorConverter.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4tY3J5cHRvLXByb3ZpZGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbnZlcnRlci9BdnJvU3RyaW5nRmllbGREZWNyeXB0b3JDb252ZXJ0ZXIuamF2YQ==) | `0% <0%> (-100%)` | `0% <0%> (-2%)` | | | [...he/gobblin/cluster/TaskRunnerSuiteThreadModel.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvVGFza1J1bm5lclN1aXRlVGhyZWFkTW9kZWwuamF2YQ==) | `0% <0%> (-100%)` | `0% <0%> (-5%)` | | | [...n/mapreduce/avro/AvroKeyCompactorOutputFormat.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jb21wYWN0aW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NvbXBhY3Rpb24vbWFwcmVkdWNlL2F2cm8vQXZyb0tleUNvbXBhY3Rvck91dHB1dEZvcm1hdC5qYXZh) | `0% <0%> (-100%)` | `0% <0%> (-3%)` | | | [...apache/gobblin/fork/CopyNotSupportedException.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1hcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2dvYmJsaW4vZm9yay9Db3B5Tm90U3VwcG9ydGVkRXhjZXB0aW9uLmphdmE=) | `0% <0%> (-100%)` | `0% <0%> (-1%)` | | | [.../gobblin/kafka/writer/KafkaWriterCommonConfig.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1tb2R1bGVzL2dvYmJsaW4ta2Fma2EtY29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2thZmthL3dyaXRlci9LYWZrYVdyaXRlckNvbW1vbkNvbmZpZy5qYXZh) | `0% <0%> (-100%)` | `0% <0%> (-7%)` | | | [...ker/task/TaskLevelPolicyCheckerBuilderFactory.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2841/diff?src=pr=tree#diff-Z29iYmxpbi1jb3JlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3F1YWxpdHljaGVja2VyL3Rhc2svVGFza0xldmVsUG9saWN5Q2hlY2tlckJ1aWxkZXJGYWN0b3J5LmphdmE=) | `0% <0%> (-100%)` | `0% <0%> (-2%)` | | | [...bblin/data/
[jira] [Work logged] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?focusedWorklogId=354469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354469 ] ASF GitHub Bot logged work on GOBBLIN-996: -- Author: ASF GitHub Bot Created on: 05/Dec/19 17:04 Start Date: 05/Dec/19 17:04 Worklog Time Spent: 10m Work Description: sv2000 commented on issue #2841: GOBBLIN-996: Add support for managed Helix clusters for Gobblin-on-Ya… URL: https://github.com/apache/incubator-gobblin/pull/2841#issuecomment-562220963 @htran1 @autumnust please review this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354469) Time Spent: 20m (was: 10m) > Add support for managed Helix clusters for Gobblin-on-Yarn applications > --- > > Key: GOBBLIN-996 > URL: https://issues.apache.org/jira/browse/GOBBLIN-996 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-cluster >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > For managed Helix clusters, the GobblinApplicationMaster should skip > connecting to the cluster as CONTROLLER and we should skip cluster creation > from GobblinYarnApplicationLauncher. > > This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-996?focusedWorklogId=354467=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354467 ] ASF GitHub Bot logged work on GOBBLIN-996: -- Author: ASF GitHub Bot Created on: 05/Dec/19 17:04 Start Date: 05/Dec/19 17:04 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2841: GOBBLIN-996: Add support for managed Helix clusters for Gobblin-on-Ya… URL: https://github.com/apache/incubator-gobblin/pull/2841 …rn applications Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-996 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): Add support for managed Helix clusters for Gobblin-on-Yarn applications ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: End-to-end test. ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354467) Remaining Estimate: 0h Time Spent: 10m > Add support for managed Helix clusters for Gobblin-on-Yarn applications > ------- > > Key: GOBBLIN-996 > URL: https://issues.apache.org/jira/browse/GOBBLIN-996 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-cluster >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > For managed Helix clusters, the GobblinApplicationMaster should skip > connecting to the cluster as CONTROLLER and we should skip cluster creation > from GobblinYarnApplicationLauncher. > > This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (GOBBLIN-996) Add support for managed Helix clusters for Gobblin-on-Yarn applications
Sudarshan Vasudevan created GOBBLIN-996: --- Summary: Add support for managed Helix clusters for Gobblin-on-Yarn applications Key: GOBBLIN-996 URL: https://issues.apache.org/jira/browse/GOBBLIN-996 Project: Apache Gobblin Issue Type: Improvement Components: gobblin-cluster Affects Versions: 0.15.0 Reporter: Sudarshan Vasudevan Assignee: Hung Tran Fix For: 0.15.0 For managed Helix clusters, the GobblinApplicationMaster should skip connecting to the cluster as CONTROLLER and we should skip cluster creation from GobblinYarnApplicationLauncher. This PR also bumps up Helix dependencies to 0.9.x version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan resolved GOBBLIN-836. - Resolution: Fixed > Expose container logs location via system property to be used in log4j > configuration for Gobblin-on-Yarn applications > - > > Key: GOBBLIN-836 > URL: https://issues.apache.org/jira/browse/GOBBLIN-836 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This task exposes container log directory and container log file name via > Java System properties, so that they can be referenced in log4j properties. > This in turn allows configuring log rotation of container logs based on size > and maximum number of back ups. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-836?focusedWorklogId=283724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-283724 ] ASF GitHub Bot logged work on GOBBLIN-836: -- Author: ASF GitHub Bot Created on: 27/Jul/19 04:19 Start Date: 27/Jul/19 04:19 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2694: GOBBLIN-836: Expose container logs location via system property to be… URL: https://github.com/apache/incubator-gobblin/pull/2694 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 283724) Time Spent: 0.5h (was: 20m) > Expose container logs location via system property to be used in log4j > configuration for Gobblin-on-Yarn applications > - > > Key: GOBBLIN-836 > URL: https://issues.apache.org/jira/browse/GOBBLIN-836 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This task exposes container log directory and container log file name via > Java System properties, so that they can be referenced in log4j properties. > This in turn allows configuring log rotation of container logs based on size > and maximum number of back ups. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan closed GOBBLIN-836. --- > Expose container logs location via system property to be used in log4j > configuration for Gobblin-on-Yarn applications > - > > Key: GOBBLIN-836 > URL: https://issues.apache.org/jira/browse/GOBBLIN-836 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This task exposes container log directory and container log file name via > Java System properties, so that they can be referenced in log4j properties. > This in turn allows configuring log rotation of container logs based on size > and maximum number of back ups. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-836?focusedWorklogId=283680=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-283680 ] ASF GitHub Bot logged work on GOBBLIN-836: -- Author: ASF GitHub Bot Created on: 26/Jul/19 23:16 Start Date: 26/Jul/19 23:16 Worklog Time Spent: 10m Work Description: codecov-io commented on issue #2694: GOBBLIN-836: Expose container logs location via system property to be… URL: https://github.com/apache/incubator-gobblin/pull/2694#issuecomment-515626110 # [Codecov](https://codecov.io/gh/apache/incubator-gobblin/pull/2694?src=pr=h1) Report > Merging [#2694](https://codecov.io/gh/apache/incubator-gobblin/pull/2694?src=pr=desc) into [master](https://codecov.io/gh/apache/incubator-gobblin/commit/7a74579c05eb5c0bf8965bfd96c3c1036c529ffa?src=pr=desc) will **increase** coverage by `0.04%`. > The diff coverage is `50%`. [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/graphs/tree.svg?width=650=4MgURJ0bGc=150=pr)](https://codecov.io/gh/apache/incubator-gobblin/pull/2694?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2694 +/- ## + Coverage 44.79% 44.84% +0.04% - Complexity 8686 8691 +5 Files 1878 1878 Lines 7005470058 +4 Branches 7701 7701 + Hits 3138431417 +33 + Misses3576435738 -26 + Partials 2906 2903 -3 ``` | [Impacted Files](https://codecov.io/gh/apache/incubator-gobblin/pull/2694?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...che/gobblin/yarn/GobblinYarnConfigurationKeys.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5Db25maWd1cmF0aW9uS2V5cy5qYXZh) | `0% <ø> (ø)` | `0 <0> (ø)` | :arrow_down: | | [...main/java/org/apache/gobblin/yarn/YarnService.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vWWFyblNlcnZpY2UuamF2YQ==) | `14.64% <0%> (-0.09%)` | `3 <0> (ø)` | | | [...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi15YXJuL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3lhcm4vR29iYmxpbllhcm5BcHBMYXVuY2hlci5qYXZh) | `19.89% <100%> (+0.42%)` | `7 <0> (ø)` | :arrow_down: | | [...lin/restli/throttling/ZookeeperLeaderElection.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1yZXN0bGkvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2UvZ29iYmxpbi10aHJvdHRsaW5nLXNlcnZpY2Utc2VydmVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3Jlc3RsaS90aHJvdHRsaW5nL1pvb2tlZXBlckxlYWRlckVsZWN0aW9uLmphdmE=) | `70% <0%> (-2.23%)` | `13% <0%> (ø)` | | | [...in/java/org/apache/gobblin/cluster/HelixUtils.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvSGVsaXhVdGlscy5qYXZh) | `41.83% <0%> (ø)` | `14% <0%> (+1%)` | :arrow_up: | | [.../org/apache/gobblin/cluster/GobblinTaskRunner.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpblRhc2tSdW5uZXIuamF2YQ==) | `66.19% <0%> (+0.46%)` | `29% <0%> (ø)` | :arrow_down: | | [.../apache/gobblin/runtime/api/JobExecutionState.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1ydW50aW1lL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL3J1bnRpbWUvYXBpL0pvYkV4ZWN1dGlvblN0YXRlLmphdmE=) | `80.37% <0%> (+0.93%)` | `24% <0%> (ø)` | :arrow_down: | | [...ache/gobblin/cluster/GobblinHelixJobScheduler.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4Sm9iU2NoZWR1bGVyLmphdmE=) | `40.52% <0%> (+1.3%)` | `6% <0%> (ø)` | :arrow_down: | | [...pache/gobblin/cluster/GobblinHelixJobLauncher.java](https://codecov.io/gh/apache/incubator-gobblin/pull/2694/diff?src=pr=tree#diff-Z29iYmxpbi1jbHVzdGVyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9nb2JibGluL2NsdXN0ZXIvR29iYmxpbkhlbGl4Sm9iTGF1bmNoZXIuamF2YQ==) | `83.33% <0%> (+2.08%)` | `28% <0%> (+2%)` | :arrow_up: | | [.../gobblin/cluster
[jira] [Work logged] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
[ https://issues.apache.org/jira/browse/GOBBLIN-836?focusedWorklogId=283666=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-283666 ] ASF GitHub Bot logged work on GOBBLIN-836: -- Author: ASF GitHub Bot Created on: 26/Jul/19 22:42 Start Date: 26/Jul/19 22:42 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2694: GOBBLIN-836: Expose container logs location via system property to be… URL: https://github.com/apache/incubator-gobblin/pull/2694 … used in log4j configuration for Gobblin-on-Yarn applications Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-836 Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications ### Description - [x] Here are some details about my PR, including screenshots (if applicable): This task exposes container log directory and container log file name via Java System properties, so that they can be referenced in log4j properties. This in turn allows configuring log rotation of container logs based on size and maximum number of back ups. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Tested by deploying a Gobblin-on-Yarn application ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 283666) Time Spent: 10m Remaining Estimate: 0h > Expose container logs location via system property to be used in log4j > configuration for Gobblin-on-Yarn applications > ----- > > Key: GOBBLIN-836 > URL: https://issues.apache.org/jira/browse/GOBBLIN-836 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This task exposes container log directory and container log file name via > Java System properties, so that they can be referenced in log4j properties. > This in turn allows configuring log rotation of container logs based on size > and maximum number of back ups. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (GOBBLIN-836) Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications
Sudarshan Vasudevan created GOBBLIN-836: --- Summary: Expose container logs location via system property to be used in log4j configuration for Gobblin-on-Yarn applications Key: GOBBLIN-836 URL: https://issues.apache.org/jira/browse/GOBBLIN-836 Project: Apache Gobblin Issue Type: Improvement Components: gobblin-yarn Affects Versions: 0.15.0 Reporter: Sudarshan Vasudevan Assignee: Abhishek Tiwari Fix For: 0.15.0 This task exposes container log directory and container log file name via Java System properties, so that they can be referenced in log4j properties. This in turn allows configuring log rotation of container logs based on size and maximum number of back ups. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (GOBBLIN-834) Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs
[ https://issues.apache.org/jira/browse/GOBBLIN-834?focusedWorklogId=283517=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-283517 ] ASF GitHub Bot logged work on GOBBLIN-834: -- Author: ASF GitHub Bot Created on: 26/Jul/19 17:41 Start Date: 26/Jul/19 17:41 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2693: GOBBLIN-834: Provide config for setting ACLs to control visibility of… URL: https://github.com/apache/incubator-gobblin/pull/2693 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 283517) Time Spent: 20m (was: 10m) > Provide config for setting ACLs to control visibility of Gobblin-on-Yarn > application logs > -- > > Key: GOBBLIN-834 > URL: https://issues.apache.org/jira/browse/GOBBLIN-834 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This change provides a config to set ACLs that control the visibility of the > AM and container logs for Gobblin-on-Yarn applications. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (GOBBLIN-834) Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs
[ https://issues.apache.org/jira/browse/GOBBLIN-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan resolved GOBBLIN-834. - Resolution: Fixed > Provide config for setting ACLs to control visibility of Gobblin-on-Yarn > application logs > -- > > Key: GOBBLIN-834 > URL: https://issues.apache.org/jira/browse/GOBBLIN-834 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This change provides a config to set ACLs that control the visibility of the > AM and container logs for Gobblin-on-Yarn applications. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (GOBBLIN-834) Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs
[ https://issues.apache.org/jira/browse/GOBBLIN-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudarshan Vasudevan closed GOBBLIN-834. --- > Provide config for setting ACLs to control visibility of Gobblin-on-Yarn > application logs > -- > > Key: GOBBLIN-834 > URL: https://issues.apache.org/jira/browse/GOBBLIN-834 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This change provides a config to set ACLs that control the visibility of the > AM and container logs for Gobblin-on-Yarn applications. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Work logged] (GOBBLIN-834) Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs
[ https://issues.apache.org/jira/browse/GOBBLIN-834?focusedWorklogId=283475=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-283475 ] ASF GitHub Bot logged work on GOBBLIN-834: -- Author: ASF GitHub Bot Created on: 26/Jul/19 16:52 Start Date: 26/Jul/19 16:52 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2693: GOBBLIN-834: Provide config for setting ACLs to control visibility of… URL: https://github.com/apache/incubator-gobblin/pull/2693 … Gobblin-on-Yarn application logs Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-834 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): This change provides a config to set ACLs that control the visibility of the AM and container logs for Gobblin-on-Yarn applications. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Test logs visible via UI by running a test application. ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 283475) Time Spent: 10m Remaining Estimate: 0h > Provide config for setting ACLs to control visibility of Gobblin-on-Yarn > application logs > -- > > Key: GOBBLIN-834 > URL: https://issues.apache.org/jira/browse/GOBBLIN-834 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn >Affects Versions: 0.15.0 >Reporter: Sudarshan Vasudevan >Assignee: Abhishek Tiwari >Priority: Major > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This change provides a config to set ACLs that control the visibility of the > AM and container logs for Gobblin-on-Yarn applications. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (GOBBLIN-834) Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs
Sudarshan Vasudevan created GOBBLIN-834: --- Summary: Provide config for setting ACLs to control visibility of Gobblin-on-Yarn application logs Key: GOBBLIN-834 URL: https://issues.apache.org/jira/browse/GOBBLIN-834 Project: Apache Gobblin Issue Type: Improvement Components: gobblin-yarn Affects Versions: 0.15.0 Reporter: Sudarshan Vasudevan Assignee: Abhishek Tiwari Fix For: 0.15.0 This change provides a config to set ACLs that control the visibility of the AM and container logs for Gobblin-on-Yarn applications. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hung Tran resolved GOBBLIN-762. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request #2626 [https://github.com/apache/incubator-gobblin/pull/2626] > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=241822=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241822 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 14/May/19 16:06 Start Date: 14/May/19 16:06 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 241822) Time Spent: 3.5h (was: 3h 20m) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Fix For: 0.15.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] [incubator-gobblin] asfgit closed pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN
asfgit closed pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239976=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239976 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 09/May/19 23:11 Start Date: 09/May/19 23:11 Worklog Time Spent: 10m Work Description: jhsenjaliya commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r282698677 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.yarn; + +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +import org.apache.helix.HelixManager; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobDag; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.WorkflowConfig; +import org.apache.helix.task.WorkflowContext; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; +import com.google.common.base.Preconditions; +import com.google.common.util.concurrent.AbstractIdleService; +import com.typesafe.config.Config; + +import lombok.AllArgsConstructor; +import lombok.extern.slf4j.Slf4j; + +import org.apache.gobblin.util.ConfigUtils; +import org.apache.gobblin.util.ExecutorsUtils; + + +/** + * The autoscaling manager is responsible for figuring out how many containers are required for the workload and + * requesting the {@link YarnService} to request that many containers. + */ +@Slf4j +public class YarnAutoScalingManager extends AbstractIdleService { + private final String AUTO_SCALING_PREFIX = GobblinYarnConfigurationKeys.GOBBLIN_YARN_PREFIX + "autoScaling."; + private final String AUTO_SCALING_POLLING_INTERVAL_SECS = + AUTO_SCALING_PREFIX + "pollingIntervalSeconds"; + private final int DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS = 60; + // Only one container will be requested for each N partitions of work + private final String AUTO_SCALING_PARTITIONS_PER_CONTAINER = AUTO_SCALING_PREFIX + "partitionsPerContainer"; + private final int DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER = 1; + + private final Config config; + private final HelixManager helixManager; + private final ScheduledExecutorService autoScalingExecutor; + private final YarnService yarnService; + private final int partitionsPerContainer; + + public YarnAutoScalingManager(GobblinApplicationMaster appMaster) { +this.config = appMaster.getConfig(); +this.helixManager = appMaster.getMultiManager().getJobClusterHelixManager(); +this.yarnService = appMaster.getYarnService(); +this.partitionsPerContainer = ConfigUtils.getInt(this.config, AUTO_SCALING_PARTITIONS_PER_CONTAINER, +DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER); + +Preconditions.checkArgument(this.partitionsPerContainer > 0, +AUTO_SCALING_PARTITIONS_PER_CONTAINER + " needs to be greater than 0"); + +this.autoScalingExecutor = Executors.newSingleThreadScheduledExecutor( +ExecutorsUtils.newThreadFactory(Optional.of(log), Optional.of("AutoScalingExecutor"))); + } + + @Override + protected void startUp() throws Exception { +int scheduleInterval = ConfigUtils.getInt(this.config, AUTO_SCALING_POLLING_INTERVAL_SECS, +DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS); +log.info("Starting the " + YarnAutoScalingManager.class.getSimpleName()); +log.info("Scheduling the auto scaling task with an interval of {} seconds", scheduleInterval); + +this.autoScalingExecutor.scheduleAtFixedRate(new YarnAutoScalingRunnable(new TaskDriver(this.helixManager), +t
[GitHub] [incubator-gobblin] jhsenjaliya commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN
jhsenjaliya commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r282698677 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.yarn; + +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +import org.apache.helix.HelixManager; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobDag; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.WorkflowConfig; +import org.apache.helix.task.WorkflowContext; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; +import com.google.common.base.Preconditions; +import com.google.common.util.concurrent.AbstractIdleService; +import com.typesafe.config.Config; + +import lombok.AllArgsConstructor; +import lombok.extern.slf4j.Slf4j; + +import org.apache.gobblin.util.ConfigUtils; +import org.apache.gobblin.util.ExecutorsUtils; + + +/** + * The autoscaling manager is responsible for figuring out how many containers are required for the workload and + * requesting the {@link YarnService} to request that many containers. + */ +@Slf4j +public class YarnAutoScalingManager extends AbstractIdleService { + private final String AUTO_SCALING_PREFIX = GobblinYarnConfigurationKeys.GOBBLIN_YARN_PREFIX + "autoScaling."; + private final String AUTO_SCALING_POLLING_INTERVAL_SECS = + AUTO_SCALING_PREFIX + "pollingIntervalSeconds"; + private final int DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS = 60; + // Only one container will be requested for each N partitions of work + private final String AUTO_SCALING_PARTITIONS_PER_CONTAINER = AUTO_SCALING_PREFIX + "partitionsPerContainer"; + private final int DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER = 1; + + private final Config config; + private final HelixManager helixManager; + private final ScheduledExecutorService autoScalingExecutor; + private final YarnService yarnService; + private final int partitionsPerContainer; + + public YarnAutoScalingManager(GobblinApplicationMaster appMaster) { +this.config = appMaster.getConfig(); +this.helixManager = appMaster.getMultiManager().getJobClusterHelixManager(); +this.yarnService = appMaster.getYarnService(); +this.partitionsPerContainer = ConfigUtils.getInt(this.config, AUTO_SCALING_PARTITIONS_PER_CONTAINER, +DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER); + +Preconditions.checkArgument(this.partitionsPerContainer > 0, +AUTO_SCALING_PARTITIONS_PER_CONTAINER + " needs to be greater than 0"); + +this.autoScalingExecutor = Executors.newSingleThreadScheduledExecutor( +ExecutorsUtils.newThreadFactory(Optional.of(log), Optional.of("AutoScalingExecutor"))); + } + + @Override + protected void startUp() throws Exception { +int scheduleInterval = ConfigUtils.getInt(this.config, AUTO_SCALING_POLLING_INTERVAL_SECS, +DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS); +log.info("Starting the " + YarnAutoScalingManager.class.getSimpleName()); +log.info("Scheduling the auto scaling task with an interval of {} seconds", scheduleInterval); + +this.autoScalingExecutor.scheduleAtFixedRate(new YarnAutoScalingRunnable(new TaskDriver(this.helixManager), +this.yarnService, this.partitionsPerContainer), 0, +scheduleInterval, TimeUnit.SECONDS); + } + + @Override + protected void shutDown() throws Exception { +log.info("Stopping the " + YarnAutoScalingManager.class.getSimpleName()); + +ExecutorsUtils.shutdownExecutorService(this.autoScalingExecutor, Optional.of(log)); + } + + /** + * A {@link Runnable} that figures out the number of
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239453=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239453 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281835351 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinApplicationMaster.java ## @@ -82,13 +89,22 @@ public GobblinApplicationMaster(String applicationName, ContainerId containerId, .addService(gobblinYarnLogSource.buildLogCopier(this.config, containerId, this.fs, this.appWorkDir)); } -this.applicationLauncher -.addService(buildYarnService(this.config, applicationName, this.applicationId, yarnConfiguration, this.fs)); +this.yarnService = buildYarnService(this.config, applicationName, this.applicationId, yarnConfiguration, this.fs); +this.applicationLauncher.addService(this.yarnService); if (UserGroupInformation.isSecurityEnabled()) { LOGGER.info("Adding YarnContainerSecurityManager since security is enabled"); this.applicationLauncher.addService(buildYarnContainerSecurityManager(this.config, this.fs)); } + +// Add additional services +List serviceClassNames = ConfigUtils.getStringList(this.config, +GobblinYarnConfigurationKeys.APP_MASTER_SERVICE_CLASSES); + +for (String serviceClassName : serviceClassNames) { + Class serviceClass = Class.forName(serviceClassName); + this.applicationLauncher.addService((Service) GobblinConstructorUtils.invokeLongestConstructor(serviceClass, this)); Review comment: This is to support services that take no arguments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239453) Time Spent: 2h 20m (was: 2h 10m) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239460=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239460 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281870422 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.yarn; + +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +import org.apache.helix.HelixManager; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobDag; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.WorkflowConfig; +import org.apache.helix.task.WorkflowContext; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; +import com.google.common.base.Preconditions; +import com.google.common.util.concurrent.AbstractIdleService; +import com.typesafe.config.Config; + +import lombok.AllArgsConstructor; +import lombok.extern.slf4j.Slf4j; + +import org.apache.gobblin.util.ConfigUtils; +import org.apache.gobblin.util.ExecutorsUtils; + + +/** + * The autoscaling manager is responsible for figuring out how many containers are required for the workload and + * requesting the {@link YarnService} to request that many containers. + */ +@Slf4j +public class YarnAutoScalingManager extends AbstractIdleService { + private final String AUTO_SCALING_PREFIX = GobblinYarnConfigurationKeys.GOBBLIN_YARN_PREFIX + "autoScaling."; + private final String AUTO_SCALING_POLLING_INTERVAL_SECS = + AUTO_SCALING_PREFIX + "pollingIntervalSeconds"; + private final int DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS = 60; + // Only one container will be requested for each N partitions of work + private final String AUTO_SCALING_PARTITIONS_PER_CONTAINER = AUTO_SCALING_PREFIX + "partitionsPerContainer"; + private final int DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER = 1; + + private final Config config; + private final HelixManager helixManager; + private final ScheduledExecutorService autoScalingExecutor; + private final YarnService yarnService; + private final int partitionsPerContainer; + + public YarnAutoScalingManager(GobblinApplicationMaster appMaster) { +this.config = appMaster.getConfig(); +this.helixManager = appMaster.getMultiManager().getJobClusterHelixManager(); +this.yarnService = appMaster.getYarnService(); +this.partitionsPerContainer = ConfigUtils.getInt(this.config, AUTO_SCALING_PARTITIONS_PER_CONTAINER, +DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER); + +Preconditions.checkArgument(this.partitionsPerContainer > 0, +AUTO_SCALING_PARTITIONS_PER_CONTAINER + " needs to be greater than 0"); + +this.autoScalingExecutor = Executors.newSingleThreadScheduledExecutor( +ExecutorsUtils.newThreadFactory(Optional.of(log), Optional.of("AutoScalingExecutor"))); + } + + @Override + protected void startUp() throws Exception { +int scheduleInterval = ConfigUtils.getInt(this.config, AUTO_SCALING_POLLING_INTERVAL_SECS, +DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS); +log.info("Starting the " + YarnAutoScalingManager.class.getSimpleName()); +log.info("Scheduling the auto scaling task with an interval of {} seconds", scheduleInterval); + +this.autoScalingExecutor.scheduleAtFixedRate(new YarnAutoScalingRunnable(new TaskDriver(this.helixManager), +this.yarnService, this.
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239461=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239461 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281850457 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -310,10 +344,70 @@ private EventSubmitter buildEventSubmitter() { .build(); } - private void requestInitialContainers(int containersRequested) { -for (int i = 0; i < containersRequested; i++) { + /** + * Request an allocation of containers. If numTargetContainers is larger than the max of current and expected number + * of containers then additional containers are requested. + * + * If numTargetContainers is less than the current number of allocated containers then release free containers. + * Shrinking is relative to the number of currently allocated containers since it takes time for containers + * to be allocated and assigned work and we want to avoid releasing a container prematurely before it is assigned + * work. This means that a container may not be released even though numTargetContainers is less than the requested + * number of containers. The intended usage is for the caller of this method to make periodic calls to attempt to + * adjust the cluster towards the desired number of containers. + * + * @param numTargetContainers the desired number of containers + * @param inUseInstances a set of in use instances + */ + public synchronized void requestTargetNumberOfContainers(int numTargetContainers, Set inUseInstances) { +LOGGER.debug("Requesting numTargetContainers {} current numRequestedContainers {} in use instances {} map size {}", +numTargetContainers, this.numRequestedContainers, inUseInstances, this.containerMap.size()); + +// YARN can allocate more than the requested number of containers, compute additional allocations and deallocations +// based on the max of the requested and actual allocated counts +int numAllocatedContainers = this.containerMap.size(); + +// The number of allocated containers may be higher than the previously requested amount +// and there may be outstanding allocation requests, so the max of both counts is computed here +// and used to decide whether to allocate containers. +int numContainers = Math.max(numRequestedContainers, numAllocatedContainers); + +// Request additional containers if the desired count is higher than the max of the current allocation or previously +// requested amount. +for (int i = numContainers; i < numTargetContainers; i++) { Review comment: Yes, the true allocation count can be higher than the requested count. If YARN happens to give more containers after we compute `numContainers` here then the requested amount can take us higher than the target. This should stabilize over time as auto-scaling brings the count down. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239461) Time Spent: 3h 10m (was: 3h) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239459 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281840066 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.yarn; + +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +import org.apache.helix.HelixManager; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobDag; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.WorkflowConfig; +import org.apache.helix.task.WorkflowContext; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; +import com.google.common.base.Preconditions; +import com.google.common.util.concurrent.AbstractIdleService; +import com.typesafe.config.Config; + +import lombok.AllArgsConstructor; +import lombok.extern.slf4j.Slf4j; + +import org.apache.gobblin.util.ConfigUtils; +import org.apache.gobblin.util.ExecutorsUtils; + + +/** + * The autoscaling manager is responsible for figuring out how many containers are required for the workload and + * requesting the {@link YarnService} to request that many containers. + */ +@Slf4j +public class YarnAutoScalingManager extends AbstractIdleService { + private final String AUTO_SCALING_PREFIX = GobblinYarnConfigurationKeys.GOBBLIN_YARN_PREFIX + "autoScaling."; + private final String AUTO_SCALING_POLLING_INTERVAL_SECS = + AUTO_SCALING_PREFIX + "pollingIntervalSeconds"; + private final int DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS = 60; + // Only one container will be requested for each N partitions of work + private final String AUTO_SCALING_PARTITIONS_PER_CONTAINER = AUTO_SCALING_PREFIX + "partitionsPerContainer"; + private final int DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER = 1; + + private final Config config; + private final HelixManager helixManager; + private final ScheduledExecutorService autoScalingExecutor; + private final YarnService yarnService; + private final int partitionsPerContainer; + + public YarnAutoScalingManager(GobblinApplicationMaster appMaster) { +this.config = appMaster.getConfig(); +this.helixManager = appMaster.getMultiManager().getJobClusterHelixManager(); +this.yarnService = appMaster.getYarnService(); +this.partitionsPerContainer = ConfigUtils.getInt(this.config, AUTO_SCALING_PARTITIONS_PER_CONTAINER, +DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER); + +Preconditions.checkArgument(this.partitionsPerContainer > 0, +AUTO_SCALING_PARTITIONS_PER_CONTAINER + " needs to be greater than 0"); + +this.autoScalingExecutor = Executors.newSingleThreadScheduledExecutor( +ExecutorsUtils.newThreadFactory(Optional.of(log), Optional.of("AutoScalingExecutor"))); + } + + @Override + protected void startUp() throws Exception { +int scheduleInterval = ConfigUtils.getInt(this.config, AUTO_SCALING_POLLING_INTERVAL_SECS, +DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS); +log.info("Starting the " + YarnAutoScalingManager.class.getSimpleName()); +log.info("Scheduling the auto scaling task with an interval of {} seconds", scheduleInterval); + +this.autoScalingExecutor.scheduleAtFixedRate(new YarnAutoScalingRunnable(new TaskDriver(this.helixManager), +this.yarnService, this.
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239457 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281867296 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -468,6 +562,13 @@ private void handleContainerCompletion(ContainerStatus containerStatus) { containerStatus.getContainerId(), containerStatus.getDiagnostics())); } +if (this.releasedContainerSet.contains(containerStatus.getContainerId())) { + LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", + containerStatus.getContainerId()); + this.releasedContainerSet.remove(containerStatus.getContainerId()); Review comment: The existing code assumes `handleContainerCompletion` only gets called once per completion since the code ```Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); String completedInstanceName = completedContainerEntry.getValue(); String completedInstanceName = completedContainerEntry.getValue();``` would otherwise hit NPE. I can't find from the documentation what the guarantee is on this. I made a change to store the released container in a cache with TTL, so I removed the `remove` code, but if `handleContainerCompletion` does get called multiple times then that is an existing bug that needs to be fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239457) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239458=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239458 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281846385 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -159,6 +173,11 @@ private volatile boolean shutdownInProgress = false; + // The number of containers requested Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239458) Time Spent: 3h (was: 2h 50m) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239456=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239456 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281849098 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -229,6 +248,21 @@ public void handleContainerShutdownRequest(ContainerShutdownRequest containerShu } } + /** + * Request the Resource Manager to release the container + * @param containerReleaseRequest containers to release + */ + @Subscribe + public void handleContainerReleaseRequest(ContainerReleaseRequest containerReleaseRequest) { +for (Container container : containerReleaseRequest.getContainers()) { + LOGGER.info(String.format("Releasing container %s running on %s", container.getId(), container.getNodeId())); + + // record that this container was explicitly released so that a new one is not spawned to replace it + this.releasedContainerSet.add(container.getId()); Review comment: Added some comments. Also changed the concurrent set to a `Cache` to address Kuai's concern of cleanup. That case should be rare, but the Cache will clean up after some time if it occurs. The aysnc container operations can interleave and the race conditions cannot be avoided since there is no global ordering or locking mechanism when interacting with YARN. The auto-scaler should be able to get to the desired steady state after several iterations. So if a container happens to die and get replaced as the autoscaler decides to release it then in a future iteration if the new container is unused then another release request will be issued. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239456) Time Spent: 2h 50m (was: 2h 40m) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239454=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239454 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281836523 ## File path: gobblin-yarn/build.gradle ## @@ -65,6 +65,10 @@ dependencies { testCompile externalDependency.hadoopYarnMiniCluster testCompile externalDependency.curatorFramework testCompile externalDependency.curatorTest + + testCompile ('com.google.inject:guice:3.0') { +force = true Review comment: The test mini cluster uses guice APIs that are not compatible with guice 4.0. I think this is a known issue and in this file there is the following line. `force 'com.google.inject:guice:3.0'` for the testRuntime configuration, but that did not seem to work fully. The test would run fine in the IDE, but failed at the command line. Adding the testCompile resolved the dependency issue when running the test from the gradle build command. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 239454) Time Spent: 2.5h (was: 2h 20m) > Add automatic scaling for Gobblin on YARN > - > > Key: GOBBLIN-762 > URL: https://issues.apache.org/jira/browse/GOBBLIN-762 > Project: Apache Gobblin > Issue Type: Task >Reporter: Hung Tran >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > Gobblin on YARN needs a way to scale up and down the containers based on the > workload. > Added `YarnAutoScalingManager` which can be started by the > `GobblinApplicationMaster` by setting the > `gobblin.yarn.app.master.serviceClasses` configuration. This class runs a > scheduled task with a default interval of 60 seconds to detect the number of > required partitions for the workflows submitted to Helix. It will request the > `YarnService` to scale to a computed number of containers. If the requested > number of containers is higher than the YarnService has previously requested > then it will request more containers. If the requested count is less than the > current number of allocated containers then it will free any unused > containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-762) Add automatic scaling for Gobblin on YARN
[ https://issues.apache.org/jira/browse/GOBBLIN-762?focusedWorklogId=239455=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-239455 ] ASF GitHub Bot logged work on GOBBLIN-762: -- Author: ASF GitHub Bot Created on: 08/May/19 20:18 Start Date: 08/May/19 20:18 Worklog Time Spent: 10m Work Description: htran1 commented on pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281838842 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java ## @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.yarn; + +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +import org.apache.helix.HelixManager; +import org.apache.helix.task.JobContext; +import org.apache.helix.task.JobDag; +import org.apache.helix.task.TaskDriver; +import org.apache.helix.task.TaskState; +import org.apache.helix.task.WorkflowConfig; +import org.apache.helix.task.WorkflowContext; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; +import com.google.common.base.Preconditions; +import com.google.common.util.concurrent.AbstractIdleService; +import com.typesafe.config.Config; + +import lombok.AllArgsConstructor; +import lombok.extern.slf4j.Slf4j; + +import org.apache.gobblin.util.ConfigUtils; +import org.apache.gobblin.util.ExecutorsUtils; + + +/** + * The autoscaling manager is responsible for figuring out how many containers are required for the workload and + * requesting the {@link YarnService} to request that many containers. + */ +@Slf4j +public class YarnAutoScalingManager extends AbstractIdleService { + private final String AUTO_SCALING_PREFIX = GobblinYarnConfigurationKeys.GOBBLIN_YARN_PREFIX + "autoScaling."; + private final String AUTO_SCALING_POLLING_INTERVAL_SECS = + AUTO_SCALING_PREFIX + "pollingIntervalSeconds"; + private final int DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS = 60; + // Only one container will be requested for each N partitions of work + private final String AUTO_SCALING_PARTITIONS_PER_CONTAINER = AUTO_SCALING_PREFIX + "partitionsPerContainer"; + private final int DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER = 1; + + private final Config config; + private final HelixManager helixManager; + private final ScheduledExecutorService autoScalingExecutor; + private final YarnService yarnService; + private final int partitionsPerContainer; + + public YarnAutoScalingManager(GobblinApplicationMaster appMaster) { +this.config = appMaster.getConfig(); +this.helixManager = appMaster.getMultiManager().getJobClusterHelixManager(); +this.yarnService = appMaster.getYarnService(); +this.partitionsPerContainer = ConfigUtils.getInt(this.config, AUTO_SCALING_PARTITIONS_PER_CONTAINER, +DEFAULT_AUTO_SCALING_PARTITIONS_PER_CONTAINER); + +Preconditions.checkArgument(this.partitionsPerContainer > 0, +AUTO_SCALING_PARTITIONS_PER_CONTAINER + " needs to be greater than 0"); + +this.autoScalingExecutor = Executors.newSingleThreadScheduledExecutor( +ExecutorsUtils.newThreadFactory(Optional.of(log), Optional.of("AutoScalingExecutor"))); + } + + @Override + protected void startUp() throws Exception { +int scheduleInterval = ConfigUtils.getInt(this.config, AUTO_SCALING_POLLING_INTERVAL_SECS, +DEFAULT_AUTO_SCALING_POLLING_INTERVAL_SECS); +log.info("Starting the " + YarnAutoScalingManager.class.getSimpleName()); +log.info("Scheduling the auto scaling task with an interval of {} seconds", scheduleInterval); + +this.autoScalingExecutor.scheduleAtFixedRate(new YarnAutoScalingRunnable(new TaskDriver(this.helixManager), +this.yarnService, this.
[GitHub] [incubator-gobblin] htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN
htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281836523 ## File path: gobblin-yarn/build.gradle ## @@ -65,6 +65,10 @@ dependencies { testCompile externalDependency.hadoopYarnMiniCluster testCompile externalDependency.curatorFramework testCompile externalDependency.curatorTest + + testCompile ('com.google.inject:guice:3.0') { +force = true Review comment: The test mini cluster uses guice APIs that are not compatible with guice 4.0. I think this is a known issue and in this file there is the following line. `force 'com.google.inject:guice:3.0'` for the testRuntime configuration, but that did not seem to work fully. The test would run fine in the IDE, but failed at the command line. Adding the testCompile resolved the dependency issue when running the test from the gradle build command. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN
htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281867296 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -468,6 +562,13 @@ private void handleContainerCompletion(ContainerStatus containerStatus) { containerStatus.getContainerId(), containerStatus.getDiagnostics())); } +if (this.releasedContainerSet.contains(containerStatus.getContainerId())) { + LOGGER.info("Container release requested, so not spawning a replacement for containerId {}", + containerStatus.getContainerId()); + this.releasedContainerSet.remove(containerStatus.getContainerId()); Review comment: The existing code assumes `handleContainerCompletion` only gets called once per completion since the code ```Map.Entry completedContainerEntry = this.containerMap.remove(containerStatus.getContainerId()); String completedInstanceName = completedContainerEntry.getValue(); String completedInstanceName = completedContainerEntry.getValue();``` would otherwise hit NPE. I can't find from the documentation what the guarantee is on this. I made a change to store the released container in a cache with TTL, so I removed the `remove` code, but if `handleContainerCompletion` does get called multiple times then that is an existing bug that needs to be fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-gobblin] htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN
htran1 commented on a change in pull request #2626: [GOBBLIN-762] Add automatic scaling for Gobblin on YARN URL: https://github.com/apache/incubator-gobblin/pull/2626#discussion_r281850457 ## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java ## @@ -310,10 +344,70 @@ private EventSubmitter buildEventSubmitter() { .build(); } - private void requestInitialContainers(int containersRequested) { -for (int i = 0; i < containersRequested; i++) { + /** + * Request an allocation of containers. If numTargetContainers is larger than the max of current and expected number + * of containers then additional containers are requested. + * + * If numTargetContainers is less than the current number of allocated containers then release free containers. + * Shrinking is relative to the number of currently allocated containers since it takes time for containers + * to be allocated and assigned work and we want to avoid releasing a container prematurely before it is assigned + * work. This means that a container may not be released even though numTargetContainers is less than the requested + * number of containers. The intended usage is for the caller of this method to make periodic calls to attempt to + * adjust the cluster towards the desired number of containers. + * + * @param numTargetContainers the desired number of containers + * @param inUseInstances a set of in use instances + */ + public synchronized void requestTargetNumberOfContainers(int numTargetContainers, Set inUseInstances) { +LOGGER.debug("Requesting numTargetContainers {} current numRequestedContainers {} in use instances {} map size {}", +numTargetContainers, this.numRequestedContainers, inUseInstances, this.containerMap.size()); + +// YARN can allocate more than the requested number of containers, compute additional allocations and deallocations +// based on the max of the requested and actual allocated counts +int numAllocatedContainers = this.containerMap.size(); + +// The number of allocated containers may be higher than the previously requested amount +// and there may be outstanding allocation requests, so the max of both counts is computed here +// and used to decide whether to allocate containers. +int numContainers = Math.max(numRequestedContainers, numAllocatedContainers); + +// Request additional containers if the desired count is higher than the max of the current allocation or previously +// requested amount. +for (int i = numContainers; i < numTargetContainers; i++) { Review comment: Yes, the true allocation count can be higher than the requested count. If YARN happens to give more containers after we compute `numContainers` here then the requested amount can take us higher than the target. This should stabilize over time as auto-scaling brings the count down. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services