[jira] [Created] (YARN-11640) capacity scheduler supports application priority with FairOrderingPolicy
Ming Chen created YARN-11640: Summary: capacity scheduler supports application priority with FairOrderingPolicy Key: YARN-11640 URL: https://issues.apache.org/jira/browse/YARN-11640 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Ming Chen -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-574) PrivateLocalizer does not support parallel resource download via ContainerLocalizer
[ https://issues.apache.org/jira/browse/YARN-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802793#comment-17802793 ] Shilun Fan commented on YARN-574: - Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > PrivateLocalizer does not support parallel resource download via > ContainerLocalizer > --- > > Key: YARN-574 > URL: https://issues.apache.org/jira/browse/YARN-574 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0, 2.8.0, 2.7.1 >Reporter: Omkar Vinit Joshi >Assignee: Ajith S >Priority: Major > Attachments: YARN-574.03.patch, YARN-574.04.patch, YARN-574.05.patch, > YARN-574.1.patch, YARN-574.2.patch > > > At present private resources will be downloaded in parallel only if multiple > containers request the same resource. However otherwise it will be serial. > The protocol between PrivateLocalizer and ContainerLocalizer supports > multiple downloads however it is not used and only one resource is sent for > downloading at a time. > I think we can increase / assure parallelism (even for single container > requesting resource) for private/application resources by making multiple > downloads per ContainerLocalizer. > Total Parallelism before > = number of threads allotted for PublicLocalizer [public resource] + number > of containers[private and application resource] > Total Parallelism after > = number of threads allotted for PublicLocalizer [public resource] + number > of containers * max downloads per container [private and application resource] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-574) PrivateLocalizer does not support parallel resource download via ContainerLocalizer
[ https://issues.apache.org/jira/browse/YARN-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-574: Target Version/s: 3.5.0 (was: 3.4.0) > PrivateLocalizer does not support parallel resource download via > ContainerLocalizer > --- > > Key: YARN-574 > URL: https://issues.apache.org/jira/browse/YARN-574 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.6.0, 2.8.0, 2.7.1 >Reporter: Omkar Vinit Joshi >Assignee: Ajith S >Priority: Major > Attachments: YARN-574.03.patch, YARN-574.04.patch, YARN-574.05.patch, > YARN-574.1.patch, YARN-574.2.patch > > > At present private resources will be downloaded in parallel only if multiple > containers request the same resource. However otherwise it will be serial. > The protocol between PrivateLocalizer and ContainerLocalizer supports > multiple downloads however it is not used and only one resource is sent for > downloading at a time. > I think we can increase / assure parallelism (even for single container > requesting resource) for private/application resources by making multiple > downloads per ContainerLocalizer. > Total Parallelism before > = number of threads allotted for PublicLocalizer [public resource] + number > of containers[private and application resource] > Total Parallelism after > = number of threads allotted for PublicLocalizer [public resource] + number > of containers * max downloads per container [private and application resource] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1564) add workflow YARN services
[ https://issues.apache.org/jira/browse/YARN-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802788#comment-17802788 ] Shilun Fan commented on YARN-1564: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > add workflow YARN services > -- > > Key: YARN-1564 > URL: https://issues.apache.org/jira/browse/YARN-1564 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: oct16-hard > Attachments: YARN-1564-001.patch, YARN-1564-002.patch, > YARN-1564-003.patch > > Original Estimate: 24h > Time Spent: 48h > Remaining Estimate: 0h > > I've been using some alternative composite services to help build workflows > of process execution in a YARN AM. > They and their tests could be moved in YARN for the use by others -this would > make it easier to build aggregate services in an AM -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1426) YARN Components need to unregister their beans upon shutdown
[ https://issues.apache.org/jira/browse/YARN-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-1426: - Target Version/s: 3.5.0 (was: 3.4.0) > YARN Components need to unregister their beans upon shutdown > > > Key: YARN-1426 > URL: https://issues.apache.org/jira/browse/YARN-1426 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.3.0, 3.0.0-alpha1 >Reporter: Jonathan Turner Eagles >Assignee: Jonathan Turner Eagles >Priority: Major > Labels: oct16-easy > Attachments: YARN-1426.2.patch, YARN-1426.patch, YARN-1426.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-867: Target Version/s: 3.5.0 (was: 3.4.0) > Isolation of failures in aux services > -- > > Key: YARN-867 > URL: https://issues.apache.org/jira/browse/YARN-867 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Hitesh Shah >Assignee: Xuan Gong >Priority: Major > Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, > YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, > YARN-867.sampleCode.2.patch > > > Today, a malicious application can bring down the NM by sending bad data to a > service. For example, sending data to the ShuffleService such that it results > any non-IOException will cause the NM's async dispatcher to exit as the > service's INIT APP event is not handled properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1564) add workflow YARN services
[ https://issues.apache.org/jira/browse/YARN-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-1564: - Target Version/s: 3.5.0 (was: 3.4.0) > add workflow YARN services > -- > > Key: YARN-1564 > URL: https://issues.apache.org/jira/browse/YARN-1564 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: oct16-hard > Attachments: YARN-1564-001.patch, YARN-1564-002.patch, > YARN-1564-003.patch > > Original Estimate: 24h > Time Spent: 48h > Remaining Estimate: 0h > > I've been using some alternative composite services to help build workflows > of process execution in a YARN AM. > They and their tests could be moved in YARN for the use by others -this would > make it easier to build aggregate services in an AM -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort
[ https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-1946: - Target Version/s: 3.5.0 (was: 3.4.0) > need Public interface for WebAppUtils.getProxyHostAndPort > - > > Key: YARN-1946 > URL: https://issues.apache.org/jira/browse/YARN-1946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, webapp >Affects Versions: 2.4.0 >Reporter: Thomas Graves >Priority: Major > > ApplicationMasters are supposed to go through the ResourceManager web app > proxy if they have web UI's so they are properly secured. There is currently > no public interface for Application Masters to conveniently get the proxy > host and port. There is a function in WebAppUtils, but that class is > private. > We should provide this as a utility since any properly written AM will need > to do this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1426) YARN Components need to unregister their beans upon shutdown
[ https://issues.apache.org/jira/browse/YARN-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802789#comment-17802789 ] Shilun Fan commented on YARN-1426: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > YARN Components need to unregister their beans upon shutdown > > > Key: YARN-1426 > URL: https://issues.apache.org/jira/browse/YARN-1426 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.3.0, 3.0.0-alpha1 >Reporter: Jonathan Turner Eagles >Assignee: Jonathan Turner Eagles >Priority: Major > Labels: oct16-easy > Attachments: YARN-1426.2.patch, YARN-1426.patch, YARN-1426.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort
[ https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802787#comment-17802787 ] Shilun Fan commented on YARN-1946: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > need Public interface for WebAppUtils.getProxyHostAndPort > - > > Key: YARN-1946 > URL: https://issues.apache.org/jira/browse/YARN-1946 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, webapp >Affects Versions: 2.4.0 >Reporter: Thomas Graves >Priority: Major > > ApplicationMasters are supposed to go through the ResourceManager web app > proxy if they have web UI's so they are properly secured. There is currently > no public interface for Application Masters to conveniently get the proxy > host and port. There is a function in WebAppUtils, but that class is > private. > We should provide this as a utility since any properly written AM will need > to do this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.
[ https://issues.apache.org/jira/browse/YARN-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802785#comment-17802785 ] Shilun Fan commented on YARN-2024: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > IOException in AppLogAggregatorImpl does not give stacktrace and leaves > aggregated TFile in a bad state. > > > Key: YARN-2024 > URL: https://issues.apache.org/jira/browse/YARN-2024 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 0.23.10, 2.4.0 >Reporter: Eric Payne >Assignee: Xuan Gong >Priority: Major > > Multiple issues were encountered when AppLogAggregatorImpl encountered an > IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating > yarn-logs for an application that had very large (>150G each) error logs. > - An IOException was encountered during the LogWriter#append call, and a > message was printed, but no stacktrace was provided. Message: "ERROR: > Couldn't upload logs for container_n_nnn_nn_nn. Skipping > this container." > - After the IOExceptin, the TFile is in a bad state, so subsequent calls to > LogWriter#append fail with the following stacktrace: > 2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LogAggregationService #17907,5,main] threw an Exception. > java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE > at > org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164) > ... > - At this point, the yarn-logs cleaner still thinks the thread is > aggregating, so the huge yarn-logs never get cleaned up for that application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2024) IOException in AppLogAggregatorImpl does not give stacktrace and leaves aggregated TFile in a bad state.
[ https://issues.apache.org/jira/browse/YARN-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2024: - Target Version/s: 3.5.0 (was: 3.4.0) > IOException in AppLogAggregatorImpl does not give stacktrace and leaves > aggregated TFile in a bad state. > > > Key: YARN-2024 > URL: https://issues.apache.org/jira/browse/YARN-2024 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 0.23.10, 2.4.0 >Reporter: Eric Payne >Assignee: Xuan Gong >Priority: Major > > Multiple issues were encountered when AppLogAggregatorImpl encountered an > IOException in AppLogAggregatorImpl#uploadLogsForContainer while aggregating > yarn-logs for an application that had very large (>150G each) error logs. > - An IOException was encountered during the LogWriter#append call, and a > message was printed, but no stacktrace was provided. Message: "ERROR: > Couldn't upload logs for container_n_nnn_nn_nn. Skipping > this container." > - After the IOExceptin, the TFile is in a bad state, so subsequent calls to > LogWriter#append fail with the following stacktrace: > 2014-04-16 13:29:09,772 [LogAggregationService #17907] ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LogAggregationService #17907,5,main] threw an Exception. > java.lang.IllegalStateException: Incorrect state to start a new key: IN_VALUE > at > org.apache.hadoop.io.file.tfile.TFile$Writer.prepareAppendKey(TFile.java:528) > at > org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:262) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:164) > ... > - At this point, the yarn-logs cleaner still thinks the thread is > aggregating, so the huge yarn-logs never get cleaned up for that application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2098: - Target Version/s: 3.5.0 > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2031) YARN Proxy model doesn't support REST APIs in AMs
[ https://issues.apache.org/jira/browse/YARN-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2031: - Target Version/s: 3.5.0 (was: 3.4.0) > YARN Proxy model doesn't support REST APIs in AMs > - > > Key: YARN-2031 > URL: https://issues.apache.org/jira/browse/YARN-2031 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Attachments: YARN-2031-002.patch, YARN-2031-003.patch, > YARN-2031-004.patch, YARN-2031-005.patch, YARN-2031.patch.001 > > > AMs can't support REST APIs because > # the AM filter redirects all requests to the proxy with a 302 response (not > 307) > # the proxy doesn't forward PUT/POST/DELETE verbs > Either the AM filter needs to return 307 and the proxy to forward the verbs, > or Am filter should not filter a REST bit of the web site -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2031) YARN Proxy model doesn't support REST APIs in AMs
[ https://issues.apache.org/jira/browse/YARN-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802784#comment-17802784 ] Shilun Fan commented on YARN-2031: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > YARN Proxy model doesn't support REST APIs in AMs > - > > Key: YARN-2031 > URL: https://issues.apache.org/jira/browse/YARN-2031 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Attachments: YARN-2031-002.patch, YARN-2031-003.patch, > YARN-2031-004.patch, YARN-2031-005.patch, YARN-2031.patch.001 > > > AMs can't support REST APIs because > # the AM filter redirects all requests to the proxy with a 302 response (not > 307) > # the proxy doesn't forward PUT/POST/DELETE verbs > Either the AM filter needs to return 307 and the proxy to forward the verbs, > or Am filter should not filter a REST bit of the web site -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802783#comment-17802783 ] Shilun Fan commented on YARN-2098: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan reopened YARN-2098: -- Assignee: Shilun Fan > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2098: - Target Version/s: 3.5.0 (was: 3.4.0) > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802782#comment-17802782 ] Shilun Fan commented on YARN-2681: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: Nam H. Do >Priority: Major > Attachments: Traffic Control Design.png, YARN-2681.001.patch, > YARN-2681.002.patch, YARN-2681.003.patch, YARN-2681.004.patch, > YARN-2681.005.patch, YARN-2681.patch > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf > Implementation: > http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf > http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl_UML_diagram.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-2098. -- Target Version/s: (was: 3.5.0) Resolution: Done > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2684) FairScheduler: When failing an application due to changes in queue config or placement policy, indicate the cause.
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2684: - Target Version/s: 3.5.0 (was: 3.4.0) > FairScheduler: When failing an application due to changes in queue config or > placement policy, indicate the cause. > -- > > Key: YARN-2684 > URL: https://issues.apache.org/jira/browse/YARN-2684 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Priority: Major > Attachments: 0001-YARN-2684.patch, 0002-YARN-2684.patch > > > YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2681) Support bandwidth enforcement for containers while reading from HDFS
[ https://issues.apache.org/jira/browse/YARN-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2681: - Target Version/s: 3.5.0 (was: 3.4.0) > Support bandwidth enforcement for containers while reading from HDFS > > > Key: YARN-2681 > URL: https://issues.apache.org/jira/browse/YARN-2681 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.1 > Environment: Linux >Reporter: Nam H. Do >Priority: Major > Attachments: Traffic Control Design.png, YARN-2681.001.patch, > YARN-2681.002.patch, YARN-2681.003.patch, YARN-2681.004.patch, > YARN-2681.005.patch, YARN-2681.patch > > > To read/write data from HDFS on data node, applications establise TCP/IP > connections with the datanode. The HDFS read can be controled by setting > Linux Traffic Control (TC) subsystem on the data node to make filters on > appropriate connections. > The current cgroups net_cls concept can not be applied on the node where the > container is launched, netheir on data node since: > - TC hanldes outgoing bandwidth only, so it can be set on container node > (HDFS read = incoming data for the container) > - Since HDFS data node is handled by only one process, it is not possible > to use net_cls to separate connections from different containers to the > datanode. > Tasks: > 1) Extend Resource model to define bandwidth enforcement rate > 2) Monitor TCP/IP connection estabilised by container handling process and > its child processes > 3) Set Linux Traffic Control rules on data node base on address:port pairs in > order to enforce bandwidth of outgoing data > Concept: http://www.hit.bme.hu/~do/papers/EnforcementDesign.pdf > Implementation: > http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl.pdf > http://www.hit.bme.hu/~dohoai/documents/HdfsTrafficControl_UML_diagram.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2748) Upload logs in the sub-folders under the local log dir when aggregating logs
[ https://issues.apache.org/jira/browse/YARN-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2748: - Target Version/s: 3.5.0 (was: 3.4.0) > Upload logs in the sub-folders under the local log dir when aggregating logs > > > Key: YARN-2748 > URL: https://issues.apache.org/jira/browse/YARN-2748 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Varun Saxena >Priority: Major > Attachments: YARN-2748.001.patch, YARN-2748.002.patch, > YARN-2748.03.patch, YARN-2748.04.patch > > > YARN-2734 has a temporal fix to skip sub folders to avoid exception. Ideally, > if the app is creating a sub folder and putting its rolling logs there, we > need to upload these logs as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2836) RM behaviour on token renewal failures is broken
[ https://issues.apache.org/jira/browse/YARN-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802779#comment-17802779 ] Shilun Fan commented on YARN-2836: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RM behaviour on token renewal failures is broken > > > Key: YARN-2836 > URL: https://issues.apache.org/jira/browse/YARN-2836 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Major > > Found this while reviewing YARN-2834. > We now completely ignore token renewal failures. For things like Timeline > tokens which are automatically obtained whether the app needs it or not (we > should fix this to be user driven), we can ignore failures. But for HDFS > Tokens etc, ignoring failures is bad because it (1) wastes resources as AMs > will continue and eventually fail (2) app doesn't know what happened it fails > eventually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users
[ https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802777#comment-17802777 ] Shilun Fan commented on YARN-3232: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Some application states are not necessarily exposed to users > > > Key: YARN-3232 > URL: https://issues.apache.org/jira/browse/YARN-3232 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jian He >Assignee: Varun Saxena >Priority: Major > Attachments: YARN-3232.002.patch, YARN-3232.01.patch, > YARN-3232.02.patch, YARN-3232.v2.01.patch > > > application NEW_SAVING and SUBMITTED states are not necessarily exposed to > users as they mostly internal to the system, transient and not user-facing. > We may deprecate these two states and remove them from the web UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-2836) RM behaviour on token renewal failures is broken
[ https://issues.apache.org/jira/browse/YARN-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-2836: - Target Version/s: 3.5.0 (was: 3.4.0) > RM behaviour on token renewal failures is broken > > > Key: YARN-2836 > URL: https://issues.apache.org/jira/browse/YARN-2836 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Major > > Found this while reviewing YARN-2834. > We now completely ignore token renewal failures. For things like Timeline > tokens which are automatically obtained whether the app needs it or not (we > should fix this to be user driven), we can ignore failures. But for HDFS > Tokens etc, ignoring failures is bad because it (1) wastes resources as AMs > will continue and eventually fail (2) app doesn't know what happened it fails > eventually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3232) Some application states are not necessarily exposed to users
[ https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-3232: - Target Version/s: 3.5.0 (was: 3.4.0) > Some application states are not necessarily exposed to users > > > Key: YARN-3232 > URL: https://issues.apache.org/jira/browse/YARN-3232 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.0 >Reporter: Jian He >Assignee: Varun Saxena >Priority: Major > Attachments: YARN-3232.002.patch, YARN-3232.01.patch, > YARN-3232.02.patch, YARN-3232.v2.01.patch > > > application NEW_SAVING and SUBMITTED states are not necessarily exposed to > users as they mostly internal to the system, transient and not user-facing. > We may deprecate these two states and remove them from the web UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-3514: - Target Version/s: 3.5.0 (was: 3.4.0) > Active directory usernames like domain\login cause YARN failures > > > Key: YARN-3514 > URL: https://issues.apache.org/jira/browse/YARN-3514 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 > Environment: CentOS6 >Reporter: john lilley >Priority: Minor > Labels: oct16-easy > Attachments: YARN-3514.001.patch, YARN-3514.002.patch > > > We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is > Kerberos-enabled and uses an external AD domain controller for the KDC. We > are able to authenticate, browse HDFS, etc. However, YARN fails during > localization because it seems to get confused by the presence of a \ > character in the local user name. > Our AD authentication on the nodes goes through sssd and set configured to > map AD users onto the form domain\username. For example, our test user has a > Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user > "domain\hadoopuser". We have no problem validating that user with PAM, > logging in as that user, su-ing to that user, etc. > However, when we attempt to run a YARN application master, the localization > step fails when setting up the local cache directory for the AM. The error > that comes out of the RM logs: > 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: > ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, > diagnostics='Application application_1429295486450_0001 failed 1 times due to > AM Container for appattempt_1429295486450_0001_01 exited with exitCode: > -1000 due to: Application application_1429295486450_0001 initialization > failed (exitCode=255) with output: main : command provided 0 > main : user is DOMAIN\hadoopuser > main : requested yarn user is domain\hadoopuser > org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create > directory: > /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 > at > org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) > .Failing this attempt.. Failing the application.' > However, when we look on the node launching the AM, we see this: > [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache > [root@rpb-cdh-kerb-2 usercache]# ls -l > drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser > There appears to be different treatment of the \ character in different > places. Something creates the directory as "domain\hadoopuser" but something > else later attempts to use it as "domain%5Chadoopuser". I’m not sure where > or why the URL escapement converts the \ to %5C or why this is not consistent. > I should also mention, for the sake of completeness, our auth_to_local rule > is set up to map u...@domain.com to domain\user: > RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3625) RollingLevelDBTimelineStore Incorrectly Forbids Related Entity in Same Put
[ https://issues.apache.org/jira/browse/YARN-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802774#comment-17802774 ] Shilun Fan commented on YARN-3625: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RollingLevelDBTimelineStore Incorrectly Forbids Related Entity in Same Put > -- > > Key: YARN-3625 > URL: https://issues.apache.org/jira/browse/YARN-3625 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Jonathan Turner Eagles >Assignee: Jonathan Turner Eagles >Priority: Major > Labels: oct16-medium > Attachments: YARN-3625.1.patch, YARN-3625.2.patch > > > RollingLevelDBTimelineStore batches all entities in the same put to improve > performance. This causes an error when relating to an entity in the same put > however. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures
[ https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802775#comment-17802775 ] Shilun Fan commented on YARN-3514: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Active directory usernames like domain\login cause YARN failures > > > Key: YARN-3514 > URL: https://issues.apache.org/jira/browse/YARN-3514 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 > Environment: CentOS6 >Reporter: john lilley >Priority: Minor > Labels: oct16-easy > Attachments: YARN-3514.001.patch, YARN-3514.002.patch > > > We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is > Kerberos-enabled and uses an external AD domain controller for the KDC. We > are able to authenticate, browse HDFS, etc. However, YARN fails during > localization because it seems to get confused by the presence of a \ > character in the local user name. > Our AD authentication on the nodes goes through sssd and set configured to > map AD users onto the form domain\username. For example, our test user has a > Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user > "domain\hadoopuser". We have no problem validating that user with PAM, > logging in as that user, su-ing to that user, etc. > However, when we attempt to run a YARN application master, the localization > step fails when setting up the local cache directory for the AM. The error > that comes out of the RM logs: > 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: > ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, > diagnostics='Application application_1429295486450_0001 failed 1 times due to > AM Container for appattempt_1429295486450_0001_01 exited with exitCode: > -1000 due to: Application application_1429295486450_0001 initialization > failed (exitCode=255) with output: main : command provided 0 > main : user is DOMAIN\hadoopuser > main : requested yarn user is domain\hadoopuser > org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create > directory: > /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10 > at > org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347) > .Failing this attempt.. Failing the application.' > However, when we look on the node launching the AM, we see this: > [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache > [root@rpb-cdh-kerb-2 usercache]# ls -l > drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser > There appears to be different treatment of the \ character in different > places. Something creates the directory as "domain\hadoopuser" but something > else later attempts to use it as "domain%5Chadoopuser". I’m not sure where > or why the URL escapement converts the \ to %5C or why this is not consistent. > I should also mention, for the sake of completeness, our auth_to_local rule > is set up to map u...@domain.com to domain\user: > RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3625) RollingLevelDBTimelineStore Incorrectly Forbids Related Entity in Same Put
[ https://issues.apache.org/jira/browse/YARN-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-3625: - Target Version/s: 3.5.0 (was: 3.4.0) > RollingLevelDBTimelineStore Incorrectly Forbids Related Entity in Same Put > -- > > Key: YARN-3625 > URL: https://issues.apache.org/jira/browse/YARN-3625 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Jonathan Turner Eagles >Assignee: Jonathan Turner Eagles >Priority: Major > Labels: oct16-medium > Attachments: YARN-3625.1.patch, YARN-3625.2.patch > > > RollingLevelDBTimelineStore batches all entities in the same put to improve > performance. This causes an error when relating to an entity in the same put > however. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4485) [Umbrella] Capture per-application and per-queue container allocation latency
[ https://issues.apache.org/jira/browse/YARN-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4485: - Target Version/s: 3.5.0 (was: 3.4.0) > [Umbrella] Capture per-application and per-queue container allocation latency > - > > Key: YARN-4485 > URL: https://issues.apache.org/jira/browse/YARN-4485 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Priority: Major > Labels: supportability, tuning > > Per-application and per-queue container allocation latencies would go a long > way towards help with tuning scheduler queue configs. > This umbrella JIRA tracks adding these metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4435) Add RM Delegation Token DtFetcher Implementation for DtUtil
[ https://issues.apache.org/jira/browse/YARN-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4435: - Target Version/s: 3.5.0 (was: 3.4.0) > Add RM Delegation Token DtFetcher Implementation for DtUtil > --- > > Key: YARN-4435 > URL: https://issues.apache.org/jira/browse/YARN-4435 > Project: Hadoop YARN > Issue Type: Improvement > Components: client, security, yarn >Affects Versions: 3.0.0-alpha2 >Reporter: Matthew Paduano >Assignee: Matthew Paduano >Priority: Major > Labels: oct16-medium > Attachments: YARN-4435-003.patch, YARN-4435-003.patch, > YARN-4435.00.patch.txt, YARN-4435.01.patch, YARN-4435.02.patch, > proposed_solution > > > Add a class to yarn project that implements the DtFetcher interface to return > a RM delegation token object. > I attached a proposed class implementation that does this, but it cannot be > added as a patch until the interface is merged in HADOOP-12563 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3861) Add fav icon to YARN & MR daemons web UI
[ https://issues.apache.org/jira/browse/YARN-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-3861: - Target Version/s: 3.5.0 (was: 3.4.0) > Add fav icon to YARN & MR daemons web UI > > > Key: YARN-3861 > URL: https://issues.apache.org/jira/browse/YARN-3861 > Project: Hadoop YARN > Issue Type: Improvement > Components: webapp >Reporter: Devaraj Kavali >Assignee: Devaraj Kavali >Priority: Major > Labels: oct16-easy > Attachments: RM UI in Chrome-With Patch.png, RM UI in Chrome-Without > Patch.png, RM UI in IE-With Patch.png, RM UI in IE-Without Patch.png.png, > YARN-3861.patch, hadoop-fav-transparent.png, hadoop-fav.png > > > Add fav icon image to all YARN & MR daemons web UI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4485) [Umbrella] Capture per-application and per-queue container allocation latency
[ https://issues.apache.org/jira/browse/YARN-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802770#comment-17802770 ] Shilun Fan commented on YARN-4485: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > [Umbrella] Capture per-application and per-queue container allocation latency > - > > Key: YARN-4485 > URL: https://issues.apache.org/jira/browse/YARN-4485 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Priority: Major > Labels: supportability, tuning > > Per-application and per-queue container allocation latencies would go a long > way towards help with tuning scheduler queue configs. > This umbrella JIRA tracks adding these metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4495) add a way to tell AM container increase/decrease request is invalid
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4495: - Target Version/s: 3.5.0 (was: 3.4.0) > add a way to tell AM container increase/decrease request is invalid > --- > > Key: YARN-4495 > URL: https://issues.apache.org/jira/browse/YARN-4495 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, client >Reporter: sandflee >Priority: Major > Labels: oct16-hard > Attachments: YARN-4495.01.patch > > > now RM may pass InvalidResourceRequestException to AM or just ignore the > change request, the former will cause AMRMClientAsync down. and the latter > will leave AM waiting for the relay. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802764#comment-17802764 ] Shilun Fan commented on YARN-4495: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > add a way to tell AM container increase/decrease request is invalid > --- > > Key: YARN-4495 > URL: https://issues.apache.org/jira/browse/YARN-4495 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, client >Reporter: sandflee >Priority: Major > Labels: oct16-hard > Attachments: YARN-4495.01.patch > > > now RM may pass InvalidResourceRequestException to AM or just ignore the > change request, the former will cause AMRMClientAsync down. and the latter > will leave AM waiting for the relay. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.
[ https://issues.apache.org/jira/browse/YARN-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802763#comment-17802763 ] Shilun Fan commented on YARN-4636: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Make blacklist tracking policy pluggable for more extensions. > - > > Key: YARN-4636 > URL: https://issues.apache.org/jira/browse/YARN-4636 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Sunil G >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.
[ https://issues.apache.org/jira/browse/YARN-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4636: - Target Version/s: 3.5.0 (was: 3.4.0) > Make blacklist tracking policy pluggable for more extensions. > - > > Key: YARN-4636 > URL: https://issues.apache.org/jira/browse/YARN-4636 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Sunil G >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4637) AM launching blacklist purge mechanism (time based)
[ https://issues.apache.org/jira/browse/YARN-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4637: - Target Version/s: 3.5.0 (was: 3.4.0) > AM launching blacklist purge mechanism (time based) > --- > > Key: YARN-4637 > URL: https://issues.apache.org/jira/browse/YARN-4637 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Sunil G >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4713) Warning by unchecked conversion in TestTimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4713: - Target Version/s: 3.5.0 (was: 3.4.0) > Warning by unchecked conversion in TestTimelineWebServices > --- > > Key: YARN-4713 > URL: https://issues.apache.org/jira/browse/YARN-4713 > Project: Hadoop YARN > Issue Type: Test > Components: test >Reporter: Tsuyoshi Ozawa >Priority: Major > Labels: newbie > Attachments: YARN-4713.1.patch, YARN-4713.2.patch > > > [WARNING] > /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java:[123,38] > [unchecked] unchecked conversion > {code} > Enumeration names = mock(Enumeration.class); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4638) Node whitelist support for AM launching
[ https://issues.apache.org/jira/browse/YARN-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4638: - Target Version/s: 3.5.0 (was: 3.4.0) > Node whitelist support for AM launching > > > Key: YARN-4638 > URL: https://issues.apache.org/jira/browse/YARN-4638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4713) Warning by unchecked conversion in TestTimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802759#comment-17802759 ] Shilun Fan commented on YARN-4713: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Warning by unchecked conversion in TestTimelineWebServices > --- > > Key: YARN-4713 > URL: https://issues.apache.org/jira/browse/YARN-4713 > Project: Hadoop YARN > Issue Type: Test > Components: test >Reporter: Tsuyoshi Ozawa >Priority: Major > Labels: newbie > Attachments: YARN-4713.1.patch, YARN-4713.2.patch > > > [WARNING] > /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java:[123,38] > [unchecked] unchecked conversion > {code} > Enumeration names = mock(Enumeration.class); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4758) Enable discovery of AMs by containers
[ https://issues.apache.org/jira/browse/YARN-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4758: - Target Version/s: 3.5.0 (was: 3.4.0) > Enable discovery of AMs by containers > - > > Key: YARN-4758 > URL: https://issues.apache.org/jira/browse/YARN-4758 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Junping Du >Priority: Major > Attachments: YARN-4758. AM Discovery Service for YARN Container.pdf > > > {color:red} > This is already discussed on the umbrella JIRA YARN-1489. > Copying some of my condensed summary from the design doc (section 3.2.10.3) > of YARN-4692. > {color} > Even after the existing work in Workpreserving AM restart (Section 3.1.2 / > YARN-1489), we still haven’t solved the problem of old running containers not > knowing where the new AM starts running after the previous AM crashes. This > is a specifically important problem to be solved for long running services > where we’d like to avoid killing service containers when AMs failover. So > far, we left this as a task for the apps, but solving it in YARN is much > desirable. [(Task) This looks very much like service-registry (YARN-913), > but for appcontainers to discover their own AMs. > Combining this requirement (of any container being able to find their AM > across failovers) with those of services (to be able to find through DNS > where a service container is running - YARN-4757) will put our registry > scalability needs to be much higher than that of just service endpoints. > This calls for a more distributed solution for registry readers something > that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > See comment > https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4758) Enable discovery of AMs by containers
[ https://issues.apache.org/jira/browse/YARN-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802758#comment-17802758 ] Shilun Fan commented on YARN-4758: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Enable discovery of AMs by containers > - > > Key: YARN-4758 > URL: https://issues.apache.org/jira/browse/YARN-4758 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Junping Du >Priority: Major > Attachments: YARN-4758. AM Discovery Service for YARN Container.pdf > > > {color:red} > This is already discussed on the umbrella JIRA YARN-1489. > Copying some of my condensed summary from the design doc (section 3.2.10.3) > of YARN-4692. > {color} > Even after the existing work in Workpreserving AM restart (Section 3.1.2 / > YARN-1489), we still haven’t solved the problem of old running containers not > knowing where the new AM starts running after the previous AM crashes. This > is a specifically important problem to be solved for long running services > where we’d like to avoid killing service containers when AMs failover. So > far, we left this as a task for the apps, but solving it in YARN is much > desirable. [(Task) This looks very much like service-registry (YARN-913), > but for appcontainers to discover their own AMs. > Combining this requirement (of any container being able to find their AM > across failovers) with those of services (to be able to find through DNS > where a service container is running - YARN-4757) will put our registry > scalability needs to be much higher than that of just service endpoints. > This calls for a more distributed solution for registry readers something > that is discussed in the comments section of YARN-1489 and MAPREDUCE-6608. > See comment > https://issues.apache.org/jira/browse/YARN-1489?focusedCommentId=13862359=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13862359 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4808) SchedulerNode can use a few more cosmetic changes
[ https://issues.apache.org/jira/browse/YARN-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802753#comment-17802753 ] Shilun Fan commented on YARN-4808: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > SchedulerNode can use a few more cosmetic changes > - > > Key: YARN-4808 > URL: https://issues.apache.org/jira/browse/YARN-4808 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Major > Attachments: yarn-4808-1.patch, yarn-4808-2.patch > > > We have made some cosmetic changes to SchedulerNode recently. While working > on YARN-4511, realized we could improve it a little more: > # Remove volatile variables - don't see the need for them being volatile > # Some methods end up doing very similar things, so consolidating them > # Renaming totalResource to capacity. YARN-4511 plans to add inflatedCapacity > to include the un-utilized resources, and having two totals can be a little > confusing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4804) [Umbrella] Improve test run duration
[ https://issues.apache.org/jira/browse/YARN-4804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4804: - Target Version/s: 3.5.0 (was: 3.4.0) > [Umbrella] Improve test run duration > > > Key: YARN-4804 > URL: https://issues.apache.org/jira/browse/YARN-4804 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Priority: Major > > Our tests take a long time to run. e.g. the RM tests take 67 minutes. Given > our precommit builds run our tests against two Java versions, this issue is > exacerbated. > Filing this umbrella JIRA to address this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4808) SchedulerNode can use a few more cosmetic changes
[ https://issues.apache.org/jira/browse/YARN-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4808: - Target Version/s: 3.5.0 (was: 3.4.0) > SchedulerNode can use a few more cosmetic changes > - > > Key: YARN-4808 > URL: https://issues.apache.org/jira/browse/YARN-4808 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Major > Attachments: yarn-4808-1.patch, yarn-4808-2.patch > > > We have made some cosmetic changes to SchedulerNode recently. While working > on YARN-4511, realized we could improve it a little more: > # Remove volatile variables - don't see the need for them being volatile > # Some methods end up doing very similar things, so consolidating them > # Renaming totalResource to capacity. YARN-4511 plans to add inflatedCapacity > to include the un-utilized resources, and having two totals can be a little > confusing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9354) Resources should be created with ResourceTypesTestHelper instead of TestUtils
[ https://issues.apache.org/jira/browse/YARN-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-9354: - Target Version/s: 3.4.0 (was: 3.5.0) > Resources should be created with ResourceTypesTestHelper instead of TestUtils > - > > Key: YARN-9354 > URL: https://issues.apache.org/jira/browse/YARN-9354 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Trivial > Labels: newbie, newbie++ > Fix For: 3.3.0, 3.2.2, 3.4.0 > > Attachments: YARN-9354.001.patch, YARN-9354.002.patch, > YARN-9354.003.patch, YARN-9354.004.patch, YARN-9354.branch-3.2.001.patch, > YARN-9354.branch-3.2.002.patch, YARN-9354.branch-3.2.003.patch > > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestUtils#createResource > has not identical, but very similar implementation to > org.apache.hadoop.yarn.resourcetypes.ResourceTypesTestHelper#newResource. > Since these 2 methods are doing the same essentially and > ResourceTypesTestHelper is newer and used more, TestUtils#createResource > should be replaced with ResourceTypesTestHelper#newResource with all > occurrence. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4843) [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to int64
[ https://issues.apache.org/jira/browse/YARN-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4843: - Target Version/s: 3.5.0 (was: 3.4.0) > [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to > int64 > - > > Key: YARN-4843 > URL: https://issues.apache.org/jira/browse/YARN-4843 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 3.0.0-alpha1 >Reporter: Wangda Tan >Priority: Major > > This JIRA is to track all int32 usages in YARN's ProtocolBuffer APIs that we > possibly need to update to int64. > One example is resource API. We use int32 for memory now, if a cluster has > 10k nodes, each node has 210G memory, we will get a negative total cluster > memory. > We may have other fields may need to upgrade from int32 to int64. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4843) [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to int64
[ https://issues.apache.org/jira/browse/YARN-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802752#comment-17802752 ] Shilun Fan commented on YARN-4843: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to > int64 > - > > Key: YARN-4843 > URL: https://issues.apache.org/jira/browse/YARN-4843 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 3.0.0-alpha1 >Reporter: Wangda Tan >Priority: Major > > This JIRA is to track all int32 usages in YARN's ProtocolBuffer APIs that we > possibly need to update to int64. > One example is resource API. We use int32 for memory now, if a cluster has > 10k nodes, each node has 210G memory, we will get a negative total cluster > memory. > We may have other fields may need to upgrade from int32 to int64. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9354) Resources should be created with ResourceTypesTestHelper instead of TestUtils
[ https://issues.apache.org/jira/browse/YARN-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-9354: - Target Version/s: 3.5.0 (was: 3.4.0) > Resources should be created with ResourceTypesTestHelper instead of TestUtils > - > > Key: YARN-9354 > URL: https://issues.apache.org/jira/browse/YARN-9354 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Trivial > Labels: newbie, newbie++ > Fix For: 3.3.0, 3.2.2, 3.4.0 > > Attachments: YARN-9354.001.patch, YARN-9354.002.patch, > YARN-9354.003.patch, YARN-9354.004.patch, YARN-9354.branch-3.2.001.patch, > YARN-9354.branch-3.2.002.patch, YARN-9354.branch-3.2.003.patch > > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestUtils#createResource > has not identical, but very similar implementation to > org.apache.hadoop.yarn.resourcetypes.ResourceTypesTestHelper#newResource. > Since these 2 methods are doing the same essentially and > ResourceTypesTestHelper is newer and used more, TestUtils#createResource > should be replaced with ResourceTypesTestHelper#newResource with all > occurrence. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802749#comment-17802749 ] Shilun Fan commented on YARN-4971: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4988) Limit filter in ApplicationBaseProtocol#getApplications should return latest applications
[ https://issues.apache.org/jira/browse/YARN-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802748#comment-17802748 ] Shilun Fan commented on YARN-4988: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Limit filter in ApplicationBaseProtocol#getApplications should return latest > applications > - > > Key: YARN-4988 > URL: https://issues.apache.org/jira/browse/YARN-4988 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-medium > Attachments: YARN-4988-wip.patch > > > When ever limit filter is used to get application report using > ApplicationBaseProtocol#getApplications, the applications retrieved are not > the latest. The retrieved applications are random based on the hashcode. > The reason for above problem is RM maintains the apps in MAP where in > insertion of application id is based on the hashcode. So if there are 10 > applications from app-1 to app-10 and then limit is 5, then supposed to > expect that applications from app-6 to app-10 should be retrieved. But now > some first 5 apps in the MAP are retrieved. So applications retrieved are > random 5!! > I think limit should retrieve latest applications only. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4953) Delete completed container log folder when rolling log aggregation is enabled
[ https://issues.apache.org/jira/browse/YARN-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4953: - Target Version/s: 3.5.0 (was: 3.4.0) > Delete completed container log folder when rolling log aggregation is enabled > - > > Key: YARN-4953 > URL: https://issues.apache.org/jira/browse/YARN-4953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > > There would be potential bottle neck when cluster is running with very large > number of containers on the same NodeManager for single application. The > linux limits the subfolders count to 32K. If number of containers is greater > than 32K for an application, there would be container launch failure. At this > point of time, there are no more containers can be launched in this node. > Currently log folders are deleted after app is finished. Rolling log > aggregation aggregates logs to hdfs periodically. > I think if aggregation is completed for finished containers, then clean up > can be done i.e deleting log folder for finished containers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4969) Fix more loggings in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802750#comment-17802750 ] Shilun Fan commented on YARN-4969: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Fix more loggings in CapacityScheduler > -- > > Key: YARN-4969 > URL: https://issues.apache.org/jira/browse/YARN-4969 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: oct16-easy > Attachments: YARN-4969.1.patch > > > YARN-3966 did logging cleanup for Capacity Scheduler before, however, > there're some loggings we need to improvement: > Container allocation / complete / reservation / un-reserve messages for every > hierarchy (app/leaf/parent-queue) should be printed at INFO level: > I'm debugging one issue that root queue's resource usage could be negative, > it is very hard to reproduce, so we cannot enable debug logging since RM > start, size of log cannot be fit in a single disk. > Existing CS prints INFO message when container cannot be allocated, such as > re-reservation / node heartbeat, etc. we should avoid printing such message > at INFO level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4988) Limit filter in ApplicationBaseProtocol#getApplications should return latest applications
[ https://issues.apache.org/jira/browse/YARN-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4988: - Target Version/s: 3.5.0 (was: 3.4.0) > Limit filter in ApplicationBaseProtocol#getApplications should return latest > applications > - > > Key: YARN-4988 > URL: https://issues.apache.org/jira/browse/YARN-4988 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > Labels: oct16-medium > Attachments: YARN-4988-wip.patch > > > When ever limit filter is used to get application report using > ApplicationBaseProtocol#getApplications, the applications retrieved are not > the latest. The retrieved applications are random based on the hashcode. > The reason for above problem is RM maintains the apps in MAP where in > insertion of application id is based on the hashcode. So if there are 10 > applications from app-1 to app-10 and then limit is 5, then supposed to > expect that applications from app-6 to app-10 should be retrieved. But now > some first 5 apps in the MAP are retrieved. So applications retrieved are > random 5!! > I think limit should retrieve latest applications only. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4969) Fix more loggings in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4969: - Target Version/s: 3.5.0 (was: 3.4.0) > Fix more loggings in CapacityScheduler > -- > > Key: YARN-4969 > URL: https://issues.apache.org/jira/browse/YARN-4969 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: oct16-easy > Attachments: YARN-4969.1.patch > > > YARN-3966 did logging cleanup for Capacity Scheduler before, however, > there're some loggings we need to improvement: > Container allocation / complete / reservation / un-reserve messages for every > hierarchy (app/leaf/parent-queue) should be printed at INFO level: > I'm debugging one issue that root queue's resource usage could be negative, > it is very hard to reproduce, so we cannot enable debug logging since RM > start, size of log cannot be fit in a single disk. > Existing CS prints INFO message when container cannot be allocated, such as > re-reservation / node heartbeat, etc. we should avoid printing such message > at INFO level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4953) Delete completed container log folder when rolling log aggregation is enabled
[ https://issues.apache.org/jira/browse/YARN-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802751#comment-17802751 ] Shilun Fan commented on YARN-4953: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Delete completed container log folder when rolling log aggregation is enabled > - > > Key: YARN-4953 > URL: https://issues.apache.org/jira/browse/YARN-4953 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S >Priority: Major > > There would be potential bottle neck when cluster is running with very large > number of containers on the same NodeManager for single application. The > linux limits the subfolders count to 32K. If number of containers is greater > than 32K for an application, there would be container launch failure. At this > point of time, there are no more containers can be launched in this node. > Currently log folders are deleted after app is finished. Rolling log > aggregation aggregates logs to hdfs periodically. > I think if aggregation is completed for finished containers, then clean up > can be done i.e deleting log folder for finished containers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters
[ https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-4971: - Target Version/s: 3.5.0 (was: 3.4.0) > RM fails to re-bind to wildcard IP after failover in multi homed clusters > - > > Key: YARN-4971 > URL: https://issues.apache.org/jira/browse/YARN-4971 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-4971.1.patch > > > If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first > time the service becomes active binding to the wildcard works as expected. If > the service has transitioned from active to standby and then becomes active > again after failovers the service only binds to one of the ip addresses. > There is a difference between the services inside the RM: it only seem to > happen for the services listening on ports: 8030 and 8032 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5205) yarn logs for live applications does not provide log files which may have already been aggregated
[ https://issues.apache.org/jira/browse/YARN-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5205: - Target Version/s: 3.5.0 (was: 3.4.0) > yarn logs for live applications does not provide log files which may have > already been aggregated > - > > Key: YARN-5205 > URL: https://issues.apache.org/jira/browse/YARN-5205 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Siddharth Seth >Priority: Major > > With periodic aggregation enabled, the logs which have been partially > aggregated are not always displayed by the yarn logs command. > If the file exists in the log dir for a container - all previously aggregated > files with the same name, along with the current file will be part of the > yarn log output. > Files which have been previously aggregated, for which a file with the same > name does not exists in the container log dir do not show up in the output. > After the app completes, all logs are available. > cc [~xgong] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5414) Integrate NodeQueueLoadMonitor with ClusterNodeTracker
[ https://issues.apache.org/jira/browse/YARN-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802745#comment-17802745 ] Shilun Fan commented on YARN-5414: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Integrate NodeQueueLoadMonitor with ClusterNodeTracker > -- > > Key: YARN-5414 > URL: https://issues.apache.org/jira/browse/YARN-5414 > Project: Hadoop YARN > Issue Type: Sub-task > Components: container-queuing, distributed-scheduling, scheduler >Reporter: Arun Suresh >Assignee: Abhishek Modi >Priority: Major > > The {{ClusterNodeTracker}} tracks the states of clusterNodes and provides > convenience methods like sort and filter. > The {{NodeQueueLoadMonitor}} should use the {{ClusterNodeTracker}} instead of > maintaining its own data-structure of node information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5414) Integrate NodeQueueLoadMonitor with ClusterNodeTracker
[ https://issues.apache.org/jira/browse/YARN-5414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5414: - Target Version/s: 3.5.0 (was: 3.4.0) > Integrate NodeQueueLoadMonitor with ClusterNodeTracker > -- > > Key: YARN-5414 > URL: https://issues.apache.org/jira/browse/YARN-5414 > Project: Hadoop YARN > Issue Type: Sub-task > Components: container-queuing, distributed-scheduling, scheduler >Reporter: Arun Suresh >Assignee: Abhishek Modi >Priority: Major > > The {{ClusterNodeTracker}} tracks the states of clusterNodes and provides > convenience methods like sort and filter. > The {{NodeQueueLoadMonitor}} should use the {{ClusterNodeTracker}} instead of > maintaining its own data-structure of node information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA
[ https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802744#comment-17802744 ] Shilun Fan commented on YARN-5464: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Server-Side NM Graceful Decommissioning with RM HA > -- > > Key: YARN-5464 > URL: https://issues.apache.org/jira/browse/YARN-5464 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, yarn >Reporter: Robert Kanter >Assignee: Gergely Pollák >Priority: Major > Attachments: YARN-5464.001.patch, YARN-5464.002.patch, > YARN-5464.003.patch, YARN-5464.004.patch, YARN-5464.005.patch, > YARN-5464.006.patch, YARN-5464.wip.patch > > > Make sure to remove the note added by YARN-7094 about RM HA failover not > working right. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA
[ https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5464: - Target Version/s: 3.5.0 (was: 3.4.0) > Server-Side NM Graceful Decommissioning with RM HA > -- > > Key: YARN-5464 > URL: https://issues.apache.org/jira/browse/YARN-5464 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, yarn >Reporter: Robert Kanter >Assignee: Gergely Pollák >Priority: Major > Attachments: YARN-5464.001.patch, YARN-5464.002.patch, > YARN-5464.003.patch, YARN-5464.004.patch, YARN-5464.005.patch, > YARN-5464.006.patch, YARN-5464.wip.patch > > > Make sure to remove the note added by YARN-7094 about RM HA failover not > working right. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5194) Avoid adding yarn-site to all Configuration instances created by the JVM
[ https://issues.apache.org/jira/browse/YARN-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5194: - Target Version/s: 3.5.0 (was: 3.4.0) > Avoid adding yarn-site to all Configuration instances created by the JVM > > > Key: YARN-5194 > URL: https://issues.apache.org/jira/browse/YARN-5194 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Siddharth Seth >Priority: Major > > {code} > static { > addDeprecatedKeys(); > Configuration.addDefaultResource(YARN_DEFAULT_CONFIGURATION_FILE); > Configuration.addDefaultResource(YARN_SITE_CONFIGURATION_FILE); > } > {code} > This puts the contents of yarn-default and yarn-site into every configuration > instance created in the VM after YarnConfiguration has been initialized. > This should be changed to a local addResource for the specific > YarnConfiguration instance, instead of polluting every Configuration instance. > Incompatible change. Have set the target version to 3.x. > The same applies to HdfsConfiguration (hdfs-site.xml), and Configuration > (core-site.xml etc). > core-site may be worth including everywhere, however it would be better to > expect users to explicitly add the relevant resources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5536) Multiple format support (JSON, etc.) for exclude node file in NM graceful decommission with timeout
[ https://issues.apache.org/jira/browse/YARN-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802742#comment-17802742 ] Shilun Fan commented on YARN-5536: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Multiple format support (JSON, etc.) for exclude node file in NM graceful > decommission with timeout > --- > > Key: YARN-5536 > URL: https://issues.apache.org/jira/browse/YARN-5536 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful >Reporter: Junping Du >Priority: Major > > Per discussion in YARN-4676, we agree that multiple format (other than xml) > should be supported to decommission nodes with timeout values. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5465) Server-Side NM Graceful Decommissioning subsequent call behavior
[ https://issues.apache.org/jira/browse/YARN-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5465: - Target Version/s: 3.5.0 (was: 3.4.0) > Server-Side NM Graceful Decommissioning subsequent call behavior > > > Key: YARN-5465 > URL: https://issues.apache.org/jira/browse/YARN-5465 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful >Reporter: Robert Kanter >Priority: Major > > The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has > the following behavior when subsequent calls are made: > # Start a long-running job that has containers running on nodeA > # Add nodeA to the exclude file > # Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully > decommissioning nodeA > # Wait 30 seconds > # Add nodeB to the exclude file > # Run {{-refreshNodes -g 30 -server}} (30sec) > # After 30 seconds, both nodeA and nodeB shut down > In a nutshell, issuing a subsequent call to gracefully decommission nodes > updates the timeout for any currently decommissioning nodes. This makes it > impossible to gracefully decommission different sets of nodes with different > timeouts. Though it does let you easily update the timeout of currently > decommissioning nodes. > Another behavior we could do is this: > # {color:grey}Start a long-running job that has containers running on nodeA > # {color:grey}Add nodeA to the exclude file{color} > # {color:grey}Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully > decommissioning nodeA{color} > # {color:grey}Wait 30 seconds{color} > # {color:grey}Add nodeB to the exclude file{color} > # {color:grey}Run {{-refreshNodes -g 30 -server}} (30sec){color} > # After 30 seconds, nodeB shuts down > # After 60 more seconds, nodeA shuts down > This keeps the nodes affected by each call to gracefully decommission nodes > independent. You can now have different sets of decommissioning nodes with > different timeouts. However, to update the timeout of a currently > decommissioning node, you'd have to first recommission it, and then > decommission it again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5536) Multiple format support (JSON, etc.) for exclude node file in NM graceful decommission with timeout
[ https://issues.apache.org/jira/browse/YARN-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5536: - Target Version/s: 3.5.0 (was: 3.4.0) > Multiple format support (JSON, etc.) for exclude node file in NM graceful > decommission with timeout > --- > > Key: YARN-5536 > URL: https://issues.apache.org/jira/browse/YARN-5536 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful >Reporter: Junping Du >Priority: Major > > Per discussion in YARN-4676, we agree that multiple format (other than xml) > should be supported to decommission nodes with timeout values. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5625) FairScheduler should use FSContext more aggressively to avoid constructors with many parameters
[ https://issues.apache.org/jira/browse/YARN-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802741#comment-17802741 ] Shilun Fan commented on YARN-5625: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > FairScheduler should use FSContext more aggressively to avoid constructors > with many parameters > --- > > Key: YARN-5625 > URL: https://issues.apache.org/jira/browse/YARN-5625 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Priority: Major > > YARN-5609 introduces FSContext, a structure to capture basic FairScheduler > information. In addition to preemption details, it could host references to > the scheduler, QueueManager, AllocationConfiguration etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5465) Server-Side NM Graceful Decommissioning subsequent call behavior
[ https://issues.apache.org/jira/browse/YARN-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802743#comment-17802743 ] Shilun Fan commented on YARN-5465: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Server-Side NM Graceful Decommissioning subsequent call behavior > > > Key: YARN-5465 > URL: https://issues.apache.org/jira/browse/YARN-5465 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful >Reporter: Robert Kanter >Priority: Major > > The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has > the following behavior when subsequent calls are made: > # Start a long-running job that has containers running on nodeA > # Add nodeA to the exclude file > # Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully > decommissioning nodeA > # Wait 30 seconds > # Add nodeB to the exclude file > # Run {{-refreshNodes -g 30 -server}} (30sec) > # After 30 seconds, both nodeA and nodeB shut down > In a nutshell, issuing a subsequent call to gracefully decommission nodes > updates the timeout for any currently decommissioning nodes. This makes it > impossible to gracefully decommission different sets of nodes with different > timeouts. Though it does let you easily update the timeout of currently > decommissioning nodes. > Another behavior we could do is this: > # {color:grey}Start a long-running job that has containers running on nodeA > # {color:grey}Add nodeA to the exclude file{color} > # {color:grey}Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully > decommissioning nodeA{color} > # {color:grey}Wait 30 seconds{color} > # {color:grey}Add nodeB to the exclude file{color} > # {color:grey}Run {{-refreshNodes -g 30 -server}} (30sec){color} > # After 30 seconds, nodeB shuts down > # After 60 more seconds, nodeA shuts down > This keeps the nodes affected by each call to gracefully decommission nodes > independent. You can now have different sets of decommissioning nodes with > different timeouts. However, to update the timeout of a currently > decommissioning node, you'd have to first recommission it, and then > decommission it again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5625) FairScheduler should use FSContext more aggressively to avoid constructors with many parameters
[ https://issues.apache.org/jira/browse/YARN-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5625: - Target Version/s: 3.5.0 (was: 3.4.0) > FairScheduler should use FSContext more aggressively to avoid constructors > with many parameters > --- > > Key: YARN-5625 > URL: https://issues.apache.org/jira/browse/YARN-5625 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Priority: Major > > YARN-5609 introduces FSContext, a structure to capture basic FairScheduler > information. In addition to preemption details, it could host references to > the scheduler, QueueManager, AllocationConfiguration etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5674) FairScheduler handles "dots" in user names inconsistently in the config
[ https://issues.apache.org/jira/browse/YARN-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5674: - Target Version/s: 3.5.0 (was: 3.4.0) > FairScheduler handles "dots" in user names inconsistently in the config > --- > > Key: YARN-5674 > URL: https://issues.apache.org/jira/browse/YARN-5674 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > A user name can contain a dot because it could be used as the queue name we > replace the dot with a defined separator. When defining queues in the > configuration for users containing a dot we expect that the dot is replaced > by the "\_dot\_" string. > In the user limits we do not do that and user limits need a normal dot in the > user name. This is confusing when you create a scheduler configuration in > some places you need to replace in others you do not. This can cause issue > when user limits are not enforced as expected. > We should use one way to specify the user and since the queue naming can not > be changed we should also use the same "\_dot\_" in the user limits and > enforce correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5683) Support specifying storage type for per-application local dirs
[ https://issues.apache.org/jira/browse/YARN-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802739#comment-17802739 ] Shilun Fan commented on YARN-5683: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Support specifying storage type for per-application local dirs > -- > > Key: YARN-5683 > URL: https://issues.apache.org/jira/browse/YARN-5683 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: oct16-hard > Attachments: YARN-5683-1.patch, YARN-5683-2.patch, YARN-5683-3.patch, > flow_diagram_for_MapReduce-2.png, flow_diagram_for_MapReduce.png > > > h3. Introduction > * Some applications of various frameworks (Flink, Spark and MapReduce etc) > using local storage (checkpoint, shuffle etc) might require high IO > performance. It's useful to allocate local directories to high performance > storage media for these applications on heterogeneous clusters. > * YARN does not distinguish different storage types and hence applications > cannot selectively use storage media with different performance > characteristics. Adding awareness of storage media can allow YARN to make > better decisions about the placement of local directories. > h3. Approach > * NodeManager will distinguish storage types for local directories. > ** yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs configuration > should allow the cluster administrator to optionally specify the storage type > for each local directories. Example: > [SSD]/disk1/nm-local-dir,/disk2/nm-local-dir,/disk3/nm-local-dir (equals to > [SSD]/disk1/nm-local-dir,[DISK]/disk2/nm-local-dir,[DISK]/disk3/nm-local-dir) > ** StorageType defines DISK/SSD storage types and takes DISK as the default > storage type. > ** StorageLocation separates storage type and directory path, used by > LocalDirAllocator to aware the types of local dirs, the default storage type > is DISK. > ** getLocalPathForWrite method of LocalDirAllcator will prefer to choose the > local directory of the specified storage type, and will fallback to not care > storage type if the requirement can not be satisfied. > ** Support for container related local/log directories by ContainerLaunch. > All application frameworks can set the environment variables > (LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE) to specified the desired storage > type of local/log directories, and choose to not launch container if fallback > through these environment variables (ENSURE_LOCAL_STORAGE_TYPE and > ENSURE_LOG_STORAGE_TYPE). > * Allow specified storage type for various frameworks (Take MapReduce as an > example) > ** Add new configurations should allow application administrator to > optionally specify the storage type of local/log directories and fallback > strategy (MapReduce configurations: mapreduce.job.local-storage-type, > mapreduce.job.log-storage-type, mapreduce.job.ensure-local-storage-type and > mapreduce.job.ensure-log-storage-type). > ** Support for container work directories. Set the environment variables > includes LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE according to configurations > above for ContainerLaunchContext and ApplicationSubmissionContext. (MapReduce > should update YARNRunner and TaskAttemptImpl) > ** Add storage type prefix for request path to support for other local > directories of frameworks (such as shuffle directories for MapReduce). > (MapReduce should update YarnOutputFiles, MROutputFiles and YarnChild to > support for output/work directories) > ** Flow diagram for MapReduce framework > !flow_diagram_for_MapReduce-2.png! > h3. Further Discussion > * Scheduling : The requirement of storage type for local/log directories may > not be satisfied for a part of nodes on heterogeneous clusters. To achieve > global optimum, scheduler should aware and manage disk resources. > ** Approach-1: Based on node attributes (YARN-3409), Scheduler can allocate > containers which have SSD requirement on nodes with attribute:ssd=true. > ** Approach-2: Based on extended resource model (YARN-3926), it's easy to > support scheduling through extending resource models like vdisk and vssd > using this feature, but hard to measure for applications and isolate for > non-CFQ based disks. > * Fallback strategy still needs to be concerned. Certain applications might > not work well when the requirement of storage type is not satisfied. When > none of desired storage type disk are available, should container launching > be failed? let AM handle? We have implemented a fallback strategy that fail > to launch container when none of desired storage type disk
[jira] [Commented] (YARN-5674) FairScheduler handles "dots" in user names inconsistently in the config
[ https://issues.apache.org/jira/browse/YARN-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802740#comment-17802740 ] Shilun Fan commented on YARN-5674: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > FairScheduler handles "dots" in user names inconsistently in the config > --- > > Key: YARN-5674 > URL: https://issues.apache.org/jira/browse/YARN-5674 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > A user name can contain a dot because it could be used as the queue name we > replace the dot with a defined separator. When defining queues in the > configuration for users containing a dot we expect that the dot is replaced > by the "\_dot\_" string. > In the user limits we do not do that and user limits need a normal dot in the > user name. This is confusing when you create a scheduler configuration in > some places you need to replace in others you do not. This can cause issue > when user limits are not enforced as expected. > We should use one way to specify the user and since the queue naming can not > be changed we should also use the same "\_dot\_" in the user limits and > enforce correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5683) Support specifying storage type for per-application local dirs
[ https://issues.apache.org/jira/browse/YARN-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5683: - Target Version/s: 3.5.0 (was: 3.4.0) > Support specifying storage type for per-application local dirs > -- > > Key: YARN-5683 > URL: https://issues.apache.org/jira/browse/YARN-5683 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: oct16-hard > Attachments: YARN-5683-1.patch, YARN-5683-2.patch, YARN-5683-3.patch, > flow_diagram_for_MapReduce-2.png, flow_diagram_for_MapReduce.png > > > h3. Introduction > * Some applications of various frameworks (Flink, Spark and MapReduce etc) > using local storage (checkpoint, shuffle etc) might require high IO > performance. It's useful to allocate local directories to high performance > storage media for these applications on heterogeneous clusters. > * YARN does not distinguish different storage types and hence applications > cannot selectively use storage media with different performance > characteristics. Adding awareness of storage media can allow YARN to make > better decisions about the placement of local directories. > h3. Approach > * NodeManager will distinguish storage types for local directories. > ** yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs configuration > should allow the cluster administrator to optionally specify the storage type > for each local directories. Example: > [SSD]/disk1/nm-local-dir,/disk2/nm-local-dir,/disk3/nm-local-dir (equals to > [SSD]/disk1/nm-local-dir,[DISK]/disk2/nm-local-dir,[DISK]/disk3/nm-local-dir) > ** StorageType defines DISK/SSD storage types and takes DISK as the default > storage type. > ** StorageLocation separates storage type and directory path, used by > LocalDirAllocator to aware the types of local dirs, the default storage type > is DISK. > ** getLocalPathForWrite method of LocalDirAllcator will prefer to choose the > local directory of the specified storage type, and will fallback to not care > storage type if the requirement can not be satisfied. > ** Support for container related local/log directories by ContainerLaunch. > All application frameworks can set the environment variables > (LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE) to specified the desired storage > type of local/log directories, and choose to not launch container if fallback > through these environment variables (ENSURE_LOCAL_STORAGE_TYPE and > ENSURE_LOG_STORAGE_TYPE). > * Allow specified storage type for various frameworks (Take MapReduce as an > example) > ** Add new configurations should allow application administrator to > optionally specify the storage type of local/log directories and fallback > strategy (MapReduce configurations: mapreduce.job.local-storage-type, > mapreduce.job.log-storage-type, mapreduce.job.ensure-local-storage-type and > mapreduce.job.ensure-log-storage-type). > ** Support for container work directories. Set the environment variables > includes LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE according to configurations > above for ContainerLaunchContext and ApplicationSubmissionContext. (MapReduce > should update YARNRunner and TaskAttemptImpl) > ** Add storage type prefix for request path to support for other local > directories of frameworks (such as shuffle directories for MapReduce). > (MapReduce should update YarnOutputFiles, MROutputFiles and YarnChild to > support for output/work directories) > ** Flow diagram for MapReduce framework > !flow_diagram_for_MapReduce-2.png! > h3. Further Discussion > * Scheduling : The requirement of storage type for local/log directories may > not be satisfied for a part of nodes on heterogeneous clusters. To achieve > global optimum, scheduler should aware and manage disk resources. > ** Approach-1: Based on node attributes (YARN-3409), Scheduler can allocate > containers which have SSD requirement on nodes with attribute:ssd=true. > ** Approach-2: Based on extended resource model (YARN-3926), it's easy to > support scheduling through extending resource models like vdisk and vssd > using this feature, but hard to measure for applications and isolate for > non-CFQ based disks. > * Fallback strategy still needs to be concerned. Certain applications might > not work well when the requirement of storage type is not satisfied. When > none of desired storage type disk are available, should container launching > be failed? let AM handle? We have implemented a fallback strategy that fail > to launch container when none of desired storage type disk are available. Is > there some better methods? > This feature has been used for half a year to meet
[jira] [Updated] (YARN-5814) Add druid as storage backend in YARN Timeline Service
[ https://issues.apache.org/jira/browse/YARN-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5814: - Target Version/s: 3.5.0 (was: 3.4.0) > Add druid as storage backend in YARN Timeline Service > -- > > Key: YARN-5814 > URL: https://issues.apache.org/jira/browse/YARN-5814 > Project: Hadoop YARN > Issue Type: New Feature > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Bingxue Qiu >Priority: Major > Attachments: Add-Druid-in-YARN-Timeline-Service.pdf > > > h3. Introduction > I propose to add druid as storage backend in YARN Timeline Service. > We run more than 6000 applications and generate 450 million metrics daily in > Alibaba Clusters with thousands of nodes. We need to collect and store > meta/events/metrics data, online analyze the utilization reports of various > dimensions and display the trends of allocation/usage resources for cluster > by joining and aggregating data. It helps us to manage and optimize the > cluster by tracking resource utilization. > To achieve our goal we have changed to use druid as the storage instead of > HBase and have achieved sub-second OLAP performance in our production > environment for few months. > h3. Analysis > Currently YARN Timeline Service only supports aggregating metrics at a) flow > level by FlowRunCoprocessor and b) application level metrics aggregating by > AppLevelTimelineCollector, offline (time-based periodic) aggregation for > flows/users/queues for reporting and analysis is planned but not yet > implemented. YARN Timeline Service chooses Apache HBase as the primary > storage backend. As we all know that HBase doesn't fit for OLAP. > For arbitrary exploration of data,such as online analyze the utilization > reports of various dimensions(Queue,Flow,Users,Application,CPU,Memory) by > joining and aggregating data, Druid's custom column format enables ad-hoc > queries without pre-computation. The format also enables fast scans on > columns, which is important for good aggregation performance. > To achieve our goal that support to online analyze the utilization reports of > various dimensions, display the variation trends of allocation/usage > resources for cluster, and arbitrary exploration of data, we propose to add > druid storage and implement DruidWriter /DruidReader in YARN Timeline Service. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5814) Add druid as storage backend in YARN Timeline Service
[ https://issues.apache.org/jira/browse/YARN-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802738#comment-17802738 ] Shilun Fan commented on YARN-5814: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Add druid as storage backend in YARN Timeline Service > -- > > Key: YARN-5814 > URL: https://issues.apache.org/jira/browse/YARN-5814 > Project: Hadoop YARN > Issue Type: New Feature > Components: ATSv2 >Affects Versions: 3.0.0-alpha2 >Reporter: Bingxue Qiu >Priority: Major > Attachments: Add-Druid-in-YARN-Timeline-Service.pdf > > > h3. Introduction > I propose to add druid as storage backend in YARN Timeline Service. > We run more than 6000 applications and generate 450 million metrics daily in > Alibaba Clusters with thousands of nodes. We need to collect and store > meta/events/metrics data, online analyze the utilization reports of various > dimensions and display the trends of allocation/usage resources for cluster > by joining and aggregating data. It helps us to manage and optimize the > cluster by tracking resource utilization. > To achieve our goal we have changed to use druid as the storage instead of > HBase and have achieved sub-second OLAP performance in our production > environment for few months. > h3. Analysis > Currently YARN Timeline Service only supports aggregating metrics at a) flow > level by FlowRunCoprocessor and b) application level metrics aggregating by > AppLevelTimelineCollector, offline (time-based periodic) aggregation for > flows/users/queues for reporting and analysis is planned but not yet > implemented. YARN Timeline Service chooses Apache HBase as the primary > storage backend. As we all know that HBase doesn't fit for OLAP. > For arbitrary exploration of data,such as online analyze the utilization > reports of various dimensions(Queue,Flow,Users,Application,CPU,Memory) by > joining and aggregating data, Druid's custom column format enables ad-hoc > queries without pre-computation. The format also enables fast scans on > columns, which is important for good aggregation performance. > To achieve our goal that support to online analyze the utilization reports of > various dimensions, display the variation trends of allocation/usage > resources for cluster, and arbitrary exploration of data, we propose to add > druid storage and implement DruidWriter /DruidReader in YARN Timeline Service. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5902) yarn.scheduler.increment-allocation-mb and yarn.scheduler.increment-allocation-vcores are undocumented in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5902: - Target Version/s: 3.5.0 (was: 3.4.0) > yarn.scheduler.increment-allocation-mb and > yarn.scheduler.increment-allocation-vcores are undocumented in > yarn-default.xml > -- > > Key: YARN-5902 > URL: https://issues.apache.org/jira/browse/YARN-5902 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Major > Attachments: YARN-5902.001.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5902) yarn.scheduler.increment-allocation-mb and yarn.scheduler.increment-allocation-vcores are undocumented in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802735#comment-17802735 ] Shilun Fan commented on YARN-5902: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > yarn.scheduler.increment-allocation-mb and > yarn.scheduler.increment-allocation-vcores are undocumented in > yarn-default.xml > -- > > Key: YARN-5902 > URL: https://issues.apache.org/jira/browse/YARN-5902 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Major > Attachments: YARN-5902.001.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5852) Consolidate CSAssignment, ContainerAllocation, ContainerAllocationContext class in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802736#comment-17802736 ] Shilun Fan commented on YARN-5852: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Consolidate CSAssignment, ContainerAllocation, ContainerAllocationContext > class in CapacityScheduler > > > Key: YARN-5852 > URL: https://issues.apache.org/jira/browse/YARN-5852 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Priority: Major > > Quite a few data structures which wraps container related info with similar > names: CSAssignment, ContainerAllocation, ContainerAllocationContext, And a > bunch of code to convert one from another. we should consolidate those to be > a single one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5852) Consolidate CSAssignment, ContainerAllocation, ContainerAllocationContext class in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5852: - Target Version/s: 3.5.0 (was: 3.4.0) > Consolidate CSAssignment, ContainerAllocation, ContainerAllocationContext > class in CapacityScheduler > > > Key: YARN-5852 > URL: https://issues.apache.org/jira/browse/YARN-5852 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Priority: Major > > Quite a few data structures which wraps container related info with similar > names: CSAssignment, ContainerAllocation, ContainerAllocationContext, And a > bunch of code to convert one from another. we should consolidate those to be > a single one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5995) Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition performance
[ https://issues.apache.org/jira/browse/YARN-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802734#comment-17802734 ] Shilun Fan commented on YARN-5995: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition > performance > --- > > Key: YARN-5995 > URL: https://issues.apache.org/jira/browse/YARN-5995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, resourcemanager >Affects Versions: 2.7.1 > Environment: CentOS7.2 Hadoop-2.7.1 >Reporter: zhangyubiao >Assignee: zhangyubiao >Priority: Major > Labels: patch > Attachments: YARN-5995.0001.patch, YARN-5995.0002.patch, > YARN-5995.0003.patch, YARN-5995.0004.patch, YARN-5995.0005.patch, > YARN-5995.patch > > > Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition > performance -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5995) Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition performance
[ https://issues.apache.org/jira/browse/YARN-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-5995: - Target Version/s: 3.5.0 (was: 3.4.0) > Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition > performance > --- > > Key: YARN-5995 > URL: https://issues.apache.org/jira/browse/YARN-5995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, resourcemanager >Affects Versions: 2.7.1 > Environment: CentOS7.2 Hadoop-2.7.1 >Reporter: zhangyubiao >Assignee: zhangyubiao >Priority: Major > Labels: patch > Attachments: YARN-5995.0001.patch, YARN-5995.0002.patch, > YARN-5995.0003.patch, YARN-5995.0004.patch, YARN-5995.0005.patch, > YARN-5995.patch > > > Add RMStateStore metrics to monitor all RMStateStoreEventTypeTransition > performance -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6147) Blacklisting nodes not happening for AM containers
[ https://issues.apache.org/jira/browse/YARN-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802732#comment-17802732 ] Shilun Fan commented on YARN-6147: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Blacklisting nodes not happening for AM containers > -- > > Key: YARN-6147 > URL: https://issues.apache.org/jira/browse/YARN-6147 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Major > > Black Listing of nodes are not happening in the following scenarios > 1. RMAppattempt is in ALLOCATED and LAUNCH_FAILED event comes when NM is down. > 2. RMAppattempt is in LAUNCHED and EXPIRE event comes when NM is down. > In both these cases AppAttempt goes to *FINAL_SAVING* and eventually to > *FINAL* state before *CONTAINER_FINISHED* event is triggered by > {{RMContainerImpl}} and in the {{FINAL}} state {{CONTAINER_FINISHED}} event > is ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6147) Blacklisting nodes not happening for AM containers
[ https://issues.apache.org/jira/browse/YARN-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6147: - Target Version/s: 3.5.0 (was: 3.4.0) > Blacklisting nodes not happening for AM containers > -- > > Key: YARN-6147 > URL: https://issues.apache.org/jira/browse/YARN-6147 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Major > > Black Listing of nodes are not happening in the following scenarios > 1. RMAppattempt is in ALLOCATED and LAUNCH_FAILED event comes when NM is down. > 2. RMAppattempt is in LAUNCHED and EXPIRE event comes when NM is down. > In both these cases AppAttempt goes to *FINAL_SAVING* and eventually to > *FINAL* state before *CONTAINER_FINISHED* event is triggered by > {{RMContainerImpl}} and in the {{FINAL}} state {{CONTAINER_FINISHED}} event > is ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6088) RM UI has to redirect to AHS for completed applications logs
[ https://issues.apache.org/jira/browse/YARN-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6088: - Target Version/s: 3.5.0 (was: 3.4.0) > RM UI has to redirect to AHS for completed applications logs > > > Key: YARN-6088 > URL: https://issues.apache.org/jira/browse/YARN-6088 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 2.7.3 >Reporter: Sunil G >Priority: Major > > Currently AMContainer logs link in RMAppBlock is hardcoded containers' host > node. If that node unavailable, we will not have enough information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6382: - Target Version/s: 3.5.0 (was: 3.4.0) > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Yousef Abu-Salah >Priority: Major > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6315: - Target Version/s: 3.5.0 (was: 3.4.0) > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch, > YARN-6315.006.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6315) Improve LocalResourcesTrackerImpl#isResourcePresent to return false for corrupted files
[ https://issues.apache.org/jira/browse/YARN-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802727#comment-17802727 ] Shilun Fan commented on YARN-6315: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Improve LocalResourcesTrackerImpl#isResourcePresent to return false for > corrupted files > --- > > Key: YARN-6315 > URL: https://issues.apache.org/jira/browse/YARN-6315 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.3, 2.8.1 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Major > Attachments: YARN-6315.001.patch, YARN-6315.002.patch, > YARN-6315.003.patch, YARN-6315.004.patch, YARN-6315.005.patch, > YARN-6315.006.patch > > > We currently check if a resource is present by making sure that the file > exists locally. There can be a case where the LocalizationTracker thinks that > it has the resource if the file exists but with size 0 or less than the > "expected" size of the LocalResource. This JIRA tracks the change to harden > the isResourcePresent call to address that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush
[ https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802725#comment-17802725 ] Shilun Fan commented on YARN-6382: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Address race condition on TimelineWriter.flush() caused by buffer-sized flush > - > > Key: YARN-6382 > URL: https://issues.apache.org/jira/browse/YARN-6382 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Yousef Abu-Salah >Priority: Major > > YARN-6376 fixes the race condition between putEntities() and periodical > flush() by WriterFlushThread in TimelineCollectorManager, or between > putEntities() in different threads. > However, BufferedMutator can have internal size-based flush as well. We need > to address the resulting race condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802724#comment-17802724 ] Shilun Fan commented on YARN-6409: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Commented] (YARN-6429) Revisit implementation of LocalitySchedulingPlacementSet to avoid invoke methods of AppSchedulingInfo
[ https://issues.apache.org/jira/browse/YARN-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802722#comment-17802722 ] Shilun Fan commented on YARN-6429: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Revisit implementation of LocalitySchedulingPlacementSet to avoid invoke > methods of AppSchedulingInfo > - > > Key: YARN-6429 > URL: https://issues.apache.org/jira/browse/YARN-6429 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > > An example is, LocalitySchedulingPlacementSet#decrementOutstanding: it calls > appSchedulingInfo directly, which could potentially cause trouble since it > tries to modify parent from child. Is it possible to move this logic to > AppSchedulingInfo#allocate. > Need to check other methods as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6426) Compress ZK YARN keys to scale up (especially AppStateData
[ https://issues.apache.org/jira/browse/YARN-6426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6426: - Target Version/s: 3.5.0 (was: 3.4.0) > Compress ZK YARN keys to scale up (especially AppStateData > -- > > Key: YARN-6426 > URL: https://issues.apache.org/jira/browse/YARN-6426 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0-alpha2 >Reporter: Roni Burd >Assignee: Roni Burd >Priority: Major > Labels: patch > Attachments: zkcompression.patch > > > ZK today stores the protobuf files uncompressed. This is not an issue except > that if a customer job has thousands of files, AppStateData will store the > user context as a string with multiple URLs and it is easy to get to 1MB or > more. > This can put unnecessary strain on ZK and make the process slow. > The proposal is to simply compress protobufs before sending them to ZK -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6426) Compress ZK YARN keys to scale up (especially AppStateData
[ https://issues.apache.org/jira/browse/YARN-6426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802723#comment-17802723 ] Shilun Fan commented on YARN-6426: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Compress ZK YARN keys to scale up (especially AppStateData > -- > > Key: YARN-6426 > URL: https://issues.apache.org/jira/browse/YARN-6426 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0-alpha2 >Reporter: Roni Burd >Assignee: Roni Burd >Priority: Major > Labels: patch > Attachments: zkcompression.patch > > > ZK today stores the protobuf files uncompressed. This is not an issue except > that if a customer job has thousands of files, AppStateData will store the > user context as a string with multiple URLs and it is easy to get to 1MB or > more. > This can put unnecessary strain on ZK and make the process slow. > The proposal is to simply compress protobufs before sending them to ZK -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6409: - Target Version/s: 3.5.0 (was: 3.4.0) > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Updated] (YARN-6429) Revisit implementation of LocalitySchedulingPlacementSet to avoid invoke methods of AppSchedulingInfo
[ https://issues.apache.org/jira/browse/YARN-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6429: - Target Version/s: 3.5.0 (was: 3.4.0) > Revisit implementation of LocalitySchedulingPlacementSet to avoid invoke > methods of AppSchedulingInfo > - > > Key: YARN-6429 > URL: https://issues.apache.org/jira/browse/YARN-6429 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > > An example is, LocalitySchedulingPlacementSet#decrementOutstanding: it calls > appSchedulingInfo directly, which could potentially cause trouble since it > tries to modify parent from child. Is it possible to move this logic to > AppSchedulingInfo#allocate. > Need to check other methods as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6466) Provide shaded framework jar for containers
[ https://issues.apache.org/jira/browse/YARN-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6466: - Target Version/s: 3.5.0 (was: 3.4.0) > Provide shaded framework jar for containers > --- > > Key: YARN-6466 > URL: https://issues.apache.org/jira/browse/YARN-6466 > Project: Hadoop YARN > Issue Type: New Feature > Components: build, yarn >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Busbey >Assignee: Haibo Chen >Priority: Major > > We should build on the existing shading work to provide a jar with all of the > bits needed within a YARN application's container to talk to the resource > manager and node manager. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6488) Remove continuous scheduling tests
[ https://issues.apache.org/jira/browse/YARN-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802720#comment-17802720 ] Shilun Fan commented on YARN-6488: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > Remove continuous scheduling tests > -- > > Key: YARN-6488 > URL: https://issues.apache.org/jira/browse/YARN-6488 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > > Remove all continuous scheduling tests from the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6606) The implementation of LocalizationStatus in ContainerStatusProto
[ https://issues.apache.org/jira/browse/YARN-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802717#comment-17802717 ] Shilun Fan commented on YARN-6606: -- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > The implementation of LocalizationStatus in ContainerStatusProto > > > Key: YARN-6606 > URL: https://issues.apache.org/jira/browse/YARN-6606 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.9.0 >Reporter: Bingxue Qiu >Priority: Major > Attachments: YARN-6606.1.patch, YARN-6606.2.patch > > > we have a use case, where the full implementation of localization status in > ContainerStatusProto > [Continuous-resource-localization|https://issues.apache.org/jira/secure/attachment/12825041/Continuous-resource-localization.pdf] >need to be done , so we make it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org