[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570109#comment-13570109
 ] 

Tom White commented on YARN-371:


Looks like there's a misunderstanding here - Sandy talks about _reducing_ the 
memory requirements of the RM. If I understand the proposal correctly, the 
number of resource request objects sent by the AM in MR would be reduced from 
five (three node-local, one rack-local, one ANY) to one resource request with 
an array of locations (host names) of length five.

BTW Arun, immediately vetoing an issue in the first comment is not conducive to 
a balanced discussion!

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-374) Job History Server doesn't show jobs which killed by ClientRMProtocol.forceKillApplication

2013-02-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570285#comment-13570285
 ] 

Thomas Graves commented on YARN-374:


At a high level the job history server is currently a mapreduce specific 
component.  So this behavior is currently expected.  yarn/RM doesn't know about 
any history server, so if you force kill something through the RM it has no 
knowledge on how to handle the history, it simply force kills the application. 
There is another jira that is looking at making the history server generic that 
would help with this issue - see YARN-321

 Job History Server doesn't show jobs which killed by 
 ClientRMProtocol.forceKillApplication
 --

 Key: YARN-374
 URL: https://issues.apache.org/jira/browse/YARN-374
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Affects Versions: 2.0.1-alpha
Reporter: nemon lou

 After i kill a app by typing bin/yarn rmadmin app -kill APP_ID,
 no job info is kept on JHS web page.
 However, when i kill a job by typing  bin/mapred  job -kill JOB_ID ,
 i can see a killed job left on JHS.
 Some hive users are confused by that their jobs been killed but nothing left 
 on JHS ,and killed app's info on RM web page is not enough.(They kill job by 
 clientRMProtocol)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570314#comment-13570314
 ] 

Arun C Murthy commented on YARN-371:


{quote}
Looks like there's a misunderstanding here - Sandy talks about reducing the 
memory requirements of the RM. If I understand the proposal correctly, the 
number of resource request objects sent by the AM in MR would be reduced from 
five (three node-local, one rack-local, one ANY) to one resource request with 
an array of locations (host names) of length five.
{quote}

Please read my explanation again. 

This change is *explicitly* against the design goals of YARN ResourceManager 
and would increase memory requirements of RM by a couple of orders of magnitude.

Hadoop MR applications, routinely, have 100K+ tasks. The proposed change in 
this jira would require 100K+ resource-requests (one per task). Currently, in 
YARN, that can be expressed in O(nodes + racks + 1) resource-requests, which is 
~O(5000) on even the largest clusters known today. 

So, in effect, this change would be a significant regression and result in 
100,000 resource-requests v/s ~5000 needed today.

bq. BTW Arun, immediately vetoing an issue in the first comment is not 
conducive to a balanced discussion!

Tom - You can read it as a veto, or you can read it as *I strongly disagree 
since this is against the goals of the project and a significant regression*. 
IAC, we should allow for people's communication style... and keep discussions 
technical - I'd appreciate that.

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570325#comment-13570325
 ] 

Arun C Murthy commented on YARN-371:


Sandy - please don't let this side discussion distract you, it's an individual 
style thing. I use -1 on grocery list discussions with my wife... 
unfortunately I don't have the luxury of vetos in that context! *smile*

Anyway, there are other good discussion points such as 'allowing requests on a 
specific node/rack' which I have pondered about for a long while too. Maybe we 
can close this jira and open one for specific enhancements?



In future, it would help if jira descriptions are short and propose a specific 
enhancement - this way we can debate solutions separately (maybe even on *-dev 
list). 

On the plus side, this way I can -1 a specific implementation proposal rather 
than the jira too... ;-)

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat

2013-02-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570329#comment-13570329
 ] 

Robert Joseph Evans commented on YARN-371:
--

Tom just like Arun said the memory usage changes based off of the size of the 
cluster vs. the size of the request.  The current approach is on the order of 
the size of the cluster where as the proposed approach is on the order of the 
number of desired containers.  If I have a 100 node cluster and I am requesting 
10 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if 
reducers are included in it. What is more it is probably exactly the same size 
of request for 1 or even 1000 tasks.  Where as the proposed approach would 
grow without bound as the number of tasks also increased.

However, I also agree with Sandy that the current state compression is lossy 
and as such restricts what is possible in the scheduler. I would like to 
understand better what the size differences would be for various requests, both 
in memory and also over the wire.  It seems conceivable to me that if the size 
difference is not too big, especially over the wire, we could allow the 
scheduler itself to decide on its in memory representation.  This would allow 
for the Capacity Scheduler to keep its current layout and allow for others to 
experiment with more advanced scheduling options.  Different groups could 
decide which scheduler best fits their needs and workload.  If the size is 
significantly larger I would like to see hard numbers about how much 
better/worse it makes specific use cases.

I am also very concerned about adding too much complexity to the scheduler.  We 
have run into issues where the RM will get very far behind in scheduling 
because it is trying to do a lot already and eventually OOM as its event queue 
grows too large. 

I also don't want to change the scheduler protocol too much without first 
understanding how that new protocol would impact other potential scheduling 
features.  There are a number of other computing patterns that could benefit 
from specific scheduler support.  Things like gang scheduling where you need 
all of the containers at once or none of them can make any progress, or where 
you want all of the containers to be physically close to one another because 
they are very I/O intensive, but you don't really care where exactly they are.  
Or even something like HBase where you essentially want one process on every 
single node with no duplicates.  Do the proposed changes make these uses case 
trivially simple, or do they require a lot of support on the AM to implement 
them?

  

 Consolidate resource requests in AM-RM heartbeat
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a 

[jira] [Created] (YARN-375) FIFO scheduler may crash due to bugg app

2013-02-04 Thread Eli Collins (JIRA)
Eli Collins created YARN-375:


 Summary: FIFO scheduler may crash due to bugg app  
 Key: YARN-375
 URL: https://issues.apache.org/jira/browse/YARN-375
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Priority: Critical


The following code should check for a 0 return value rather than crash!

{code}
int availableContainers = 
  node.getAvailableResource().getMemory() / capability.getMemory(); // 
TODO: A buggy
// 
application
// with 
this
// zero 
would
// 
crash the
// 
scheduler.
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-370) CapacityScheduler app submission fails when min alloc size not multiple of AM size

2013-02-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570514#comment-13570514
 ] 

Thomas Graves commented on YARN-370:


Sorry the only other thing I can think of that would matter is having security 
on.  I had security on and the code that throws the exception is looking at the 
Token, so if you don't have security on you probably won't see it.

Other then that it was running any simple job - sleep, wordcount.  

 CapacityScheduler app submission fails when min alloc size not multiple of AM 
 size
 --

 Key: YARN-370
 URL: https://issues.apache.org/jira/browse/YARN-370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha
Reporter: Thomas Graves
Assignee: Zhijie Shen
Priority: Blocker

 I was running 2.0.3-SNAPSHOT with the capacity scheduler configured with 
 minimum allocation size 1G. The AM size was set to 1.5G. I didn't specify 
 resource calculator so it was using DefaultResourceCalculator.  The am launch 
 failed with the error below:
 Application application_1359688216672_0001 failed 1 times due to Error 
 launching appattempt_1359688216672_0001_01. Got exception: RemoteTrace: 
 at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 RemoteTrace: at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 Unauthorized request to start container. Expected resource memory:2048, 
 vCores:1 but found memory:1536, vCores:1 at 
 org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
  at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47) at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeRequest(ContainerManagerImpl.java:383)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainer(ContainerManagerImpl.java:400)
  at 
 org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagerPBServiceImpl.startContainer(ContainerManagerPBServiceImpl.java:68)
  at 
 org.apache.hadoop.yarn.proto.ContainerManager$ContainerManagerService$2.callBlockingMethod(ContainerManager.java:83)
  at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1735) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1731) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:415) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1729) at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
  at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
  at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:123)
  at 
 org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:109)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:111)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:255)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722) . Failing the application. 
 It looks like the launchcontext for the app didn't have the resources rounded 
 up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-375) FIFO scheduler may crash due to bugg app

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570545#comment-13570545
 ] 

Arun C Murthy commented on YARN-375:


I believe this doesn't happen at all since there is a check upfront, but I'll 
double-check to make sure.

 FIFO scheduler may crash due to bugg app  
 --

 Key: YARN-375
 URL: https://issues.apache.org/jira/browse/YARN-375
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Arun C Murthy
Priority: Critical

 The following code should check for a 0 return value rather than crash!
 {code}
 int availableContainers = 
   node.getAvailableResource().getMemory() / capability.getMemory(); // 
 TODO: A buggy
 // 
 application
 // 
 with this
 // 
 zero would
 // 
 crash the
 // 
 scheduler.
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-376) Apps that have completed can appear as RUNNING on the NM UI

2013-02-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-376:
---

 Summary: Apps that have completed can appear as RUNNING on the NM 
UI
 Key: YARN-376
 URL: https://issues.apache.org/jira/browse/YARN-376
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha, 0.23.6
Reporter: Jason Lowe


On a busy cluster we've noticed a growing number of applications appear as 
RUNNING on a nodemanager web pages but the applications have long since 
finished.  Looking at the NM logs, it appears the RM never told the nodemanager 
that the application had finished.  This is also reflected in a jstack of the 
NM process, since many more log aggregation threads are running then one would 
expect from the number of actively running applications.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-376) Apps that have completed can appear as RUNNING on the NM UI

2013-02-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570568#comment-13570568
 ] 

Jason Lowe commented on YARN-376:
-

There appears to be a race condition in the RM's handling of finished 
applications that may explain this.  ResourceTrackerService is sending the list 
of finished applications to the node when the node heartbeats and then 
subsequently sending a status update event to the RMNodeImpl that corresponds 
to the node.  The RMNodeImpl clears the entire list of finished applications 
once it has processed the status update.  If an application completes *after* 
the ResourceTrackerService has asynchronously retrieved the list of finished 
applications but *before* the status update event is posted to the RMNodeImpl 
then the application will be added to then cleared from the list of finished 
applications before the ResourceTrackerService had a chance to notify the node 
of the completing application.

 Apps that have completed can appear as RUNNING on the NM UI
 ---

 Key: YARN-376
 URL: https://issues.apache.org/jira/browse/YARN-376
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha, 0.23.6
Reporter: Jason Lowe

 On a busy cluster we've noticed a growing number of applications appear as 
 RUNNING on a nodemanager web pages but the applications have long since 
 finished.  Looking at the NM logs, it appears the RM never told the 
 nodemanager that the application had finished.  This is also reflected in a 
 jstack of the NM process, since many more log aggregation threads are running 
 then one would expect from the number of actively running applications.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-5) Add support for FifoScheduler to schedule CPU along with memory.

2013-02-04 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated YARN-5:
---

Issue Type: New Feature  (was: Sub-task)
Parent: (was: YARN-2)

 Add support for FifoScheduler to schedule CPU along with memory.
 

 Key: YARN-5
 URL: https://issues.apache.org/jira/browse/YARN-5
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Arun C Murthy



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570608#comment-13570608
 ] 

Sandy Ryza commented on YARN-371:
-

Bobby,

I believe that a task-centric request format is necessary for the cases you 
mention, but not entirely sufficient for all of them.  All would likely require 
significant modifications to the scheduler.

For the HBase case, each task request could simply be at a single node (no 
racks or *).  

For the case of applications that want containers located near each other, but 
don't care where, tasks could include a special location value that means try 
to put me near other tasks that share this value.

I believe gang-scheduling would require a task-centric protocol as well, but 
would either require a flag that says the entire heartbeat should be 
gang-scheduled or a grouping of requests within a heartbeat into gangs.

 Resource-centric compression in AM-RM protocol limits scheduling
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-04 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570619#comment-13570619
 ] 

Robert Joseph Evans commented on YARN-371:
--

I didn't really expect them to be trivial :). So I think that there may be some 
value in having a different protocol, but we need some hard numbers to be able 
to really make an informed decision.

I would like to see the size of a request in the following table (both in 
memory size on the RM and size sent over the wire)

||nodes(down)/tasks(across)||1,000||10,000||100,000||500,000||
||100|?|?|?|?|
||1,000|?|?|?|?|
||4,000|?|?|?|?|
||10,000|?|?|?|?| 

It would also be great to see in practice how bad is the scheduling problem 
where the wrong node is sent.

 Resource-centric compression in AM-RM protocol limits scheduling
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-371) Resource-centric compression in AM-RM protocol limits scheduling

2013-02-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570633#comment-13570633
 ] 

Arun C Murthy commented on YARN-371:


Jira isn't a particularly good medium to have a discussion of this kind - we 
should move this to yarn-dev@.



I'm very wary of supporting multiple protocols (task-centric v/s 
resource-centric) to the point of being paranoid about it. Supporting multiple 
protocols or APIs is very expensive - look at the hardships we have had with 
mapred v/s mapreduce apis.

The *task-centric* protocol in the current JobTracker is something we *know* 
doesn't work well at scale (cluster sizes, number of concurrent applications 
etc.), we need to remember that - I have lots of scars I don't want to re-open.

Instead, we should focus on specific use-cases and debate how we can fix them 
in the context of a protocol which we know scales well as it stands. 

 Resource-centric compression in AM-RM protocol limits scheduling
 

 Key: YARN-371
 URL: https://issues.apache.org/jira/browse/YARN-371
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 Each AMRM heartbeat consists of a list of resource requests. Currently, each 
 resource request consists of a container count, a resource vector, and a 
 location, which may be a node, a rack, or *. When an application wishes to 
 request a task run in multiple localtions, it must issue a request for each 
 location.  This means that for a node-local task, it must issue three 
 requests, one at the node-level, one at the rack-level, and one with * (any). 
 These requests are not linked with each other, so when a container is 
 allocated for one of them, the RM has no way of knowing which others to get 
 rid of. When a node-local container is allocated, this is handled by 
 decrementing the number of requests on that node's rack and in *. But when 
 the scheduler allocates a task with a node-local request on its rack, the 
 request on the node is left there.  This can cause delay-scheduling to try to 
 assign a container on a node that nobody cares about anymore.
 Additionally, unless I am missing something, the current model does not allow 
 requests for containers only on a specific node or specific rack. While this 
 is not a use case for MapReduce currently, it is conceivable that it might be 
 something useful to support in the future, for example to schedule 
 long-running services that persist state in a particular location, or for 
 applications that generally care less about latency than data-locality.
 Lastly, the ability to understand which requests are for the same task will 
 possibly allow future schedulers to make more intelligent scheduling 
 decisions, as well as permit a more exact understanding of request load.
 I would propose the tweak of allowing a single ResourceRequest to encapsulate 
 all the location information for a task.  So instead of just a single 
 location, a ResourceRequest would contain an array of locations, including 
 nodes that it would be happy with, racks that it would be happy with, and 
 possibly *.  Side effects of this change would be a reduction in the amount 
 of data that needs to be transferred in a heartbeat, as well in as the RM's 
 memory footprint, becaused what used to be different requests for the same 
 task are now able to share some common data.
 While this change breaks compatibility, if it is going to happen, it makes 
 sense to do it now, before YARN becomes beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-377) Fix test failure for HADOOP-9252

2013-02-04 Thread Tsz Wo (Nicholas), SZE (JIRA)
Tsz Wo (Nicholas), SZE created YARN-377:
---

 Summary: Fix test failure for HADOOP-9252
 Key: YARN-377
 URL: https://issues.apache.org/jira/browse/YARN-377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsz Wo (Nicholas), SZE
Priority: Minor


HADOOP-9252 slightly changes the format of some StringUtils outputs.  It may 
cause test failures.

Also, some methods was deprecated by HADOOP-9252.  The use of them should be 
replaced with the new methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-377) Fix test failure for HADOOP-9252

2013-02-04 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated YARN-377:


Description: 
HADOOP-9252 slightly changes the format of some StringUtils outputs.  It may 
cause test failures.

Also, some methods were deprecated by HADOOP-9252.  The use of them should be 
replaced with the new methods.

  was:
HADOOP-9252 slightly changes the format of some StringUtils outputs.  It may 
cause test failures.

Also, some methods was deprecated by HADOOP-9252.  The use of them should be 
replaced with the new methods.


 Fix test failure for HADOOP-9252
 

 Key: YARN-377
 URL: https://issues.apache.org/jira/browse/YARN-377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsz Wo (Nicholas), SZE
Priority: Minor

 HADOOP-9252 slightly changes the format of some StringUtils outputs.  It may 
 cause test failures.
 Also, some methods were deprecated by HADOOP-9252.  The use of them should be 
 replaced with the new methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-377) Fix test failure for HADOOP-9252

2013-02-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth reassigned YARN-377:
--

Assignee: Chris Nauroth

 Fix test failure for HADOOP-9252
 

 Key: YARN-377
 URL: https://issues.apache.org/jira/browse/YARN-377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Chris Nauroth
Priority: Minor

 HADOOP-9252 slightly changes the format of some StringUtils outputs.  It may 
 cause test failures.
 Also, some methods were deprecated by HADOOP-9252.  The use of them should be 
 replaced with the new methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-360) Allow apps to concurrently register tokens for renewal

2013-02-04 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570746#comment-13570746
 ] 

Siddharth Seth commented on YARN-360:
-

+1. Committing this. Nice unit test btw. The test failure isn't related; passes 
locally on multiple runs.

 Allow apps to concurrently register tokens for renewal
 --

 Key: YARN-360
 URL: https://issues.apache.org/jira/browse/YARN-360
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.0.0-alpha
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: YARN-357.patch, YARN-360.patch


 {{DelegationTokenRenewer#addApplication}} has an unnecessary {{synchronized}} 
 keyword.  This serializes job submissions and can add unnecessary latency 
 and/or hang all submissions if there are problems renewing the token.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-370) CapacityScheduler app submission fails when min alloc size not multiple of AM size

2013-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570749#comment-13570749
 ] 

Vinod Kumar Vavilapalli commented on YARN-370:
--

That's correct, the validation today is only done in secure mode. Extending it 
to non-secure mode is pending - MAPREDUCE-2744.

Also the scheduler? The scheduler in use doesn't seem to be normalizing 
requests correctly.

 CapacityScheduler app submission fails when min alloc size not multiple of AM 
 size
 --

 Key: YARN-370
 URL: https://issues.apache.org/jira/browse/YARN-370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha
Reporter: Thomas Graves
Assignee: Zhijie Shen
Priority: Blocker

 I was running 2.0.3-SNAPSHOT with the capacity scheduler configured with 
 minimum allocation size 1G. The AM size was set to 1.5G. I didn't specify 
 resource calculator so it was using DefaultResourceCalculator.  The am launch 
 failed with the error below:
 Application application_1359688216672_0001 failed 1 times due to Error 
 launching appattempt_1359688216672_0001_01. Got exception: RemoteTrace: 
 at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 RemoteTrace: at LocalTrace: 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 Unauthorized request to start container. Expected resource memory:2048, 
 vCores:1 but found memory:1536, vCores:1 at 
 org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
  at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47) at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeRequest(ContainerManagerImpl.java:383)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainer(ContainerManagerImpl.java:400)
  at 
 org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagerPBServiceImpl.startContainer(ContainerManagerPBServiceImpl.java:68)
  at 
 org.apache.hadoop.yarn.proto.ContainerManager$ContainerManagerService$2.callBlockingMethod(ContainerManager.java:83)
  at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1735) at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1731) at 
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:415) at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1729) at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
  at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
  at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:123)
  at 
 org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:109)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:111)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:255)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:722) . Failing the application. 
 It looks like the launchcontext for the app didn't have the resources rounded 
 up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-373) Allow an AM to reuse the resources allocated to container for a new container

2013-02-04 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570782#comment-13570782
 ] 

Alejandro Abdelnur commented on YARN-373:
-

Hitesh,

Didn't dive into the whole approach yet, first wanted to 'socialize' the idea. 
Now let me answer with my current thoughts.

For this use case I was not thinking about resizing 'inflight' containers, 
while we could resize easily on CPU, for memory would be quite difficult. 

The use case is about shortcutting getting resources for a container by reusing 
the same (or less) resources being freed up by a terminating container in the 
same node. By doing this you don't have to go to all the way to the scheduler 
and compete/wait for those resources to become avail. In short, recycling 
resources the AM already got.

The terminating container would still exit, not changing the notion of 
completion of a container. The container using the recycled resources would be 
a fresh new container process. (Otherwise we could not shrink in memory).

Regarding localized resources, a new resource localization would be done.

 Allow an AM to reuse the resources allocated to container for a new container
 -

 Key: YARN-373
 URL: https://issues.apache.org/jira/browse/YARN-373
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur

 When a container completes, instead the corresponding resources being freed 
 up, it should be possible for the AM to reuse the assigned resources for a 
 new container.
 As part of the reallocation, the AM would notify the RM about partial 
 resources being freed up and the RM would make the necessary corrections in 
 the corresponding node.
 With this functionality, an AM can ensure it gets a container in the same 
 node where previous containers run.
 This will allow getting rid of the ShuffleHandler as a service in the NMs and 
 run it as regular container task of the corresponding AM. In this case, the 
 reallocation would reduce the CPU/MEM obtained for the original container to 
 the what is needed for serving the shuffle. Note that in this example the MR 
 AM would only do this reallocation for one of the many tasks that may have 
 run in a particular node (as a single shuffle task could serve all the map 
 outputs from all map tasks run in that node). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-360) Allow apps to concurrently register tokens for renewal

2013-02-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570797#comment-13570797
 ] 

Hudson commented on YARN-360:
-

Integrated in Hadoop-trunk-Commit #3320 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3320/])
YARN-360. Allow apps to concurrently register tokens for renewal. 
Contributed by Daryn Sharp. (Revision 1442441)

 Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1442441
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java


 Allow apps to concurrently register tokens for renewal
 --

 Key: YARN-360
 URL: https://issues.apache.org/jira/browse/YARN-360
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3, 3.0.0, 2.0.0-alpha
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: YARN-357.patch, YARN-360.patch


 {{DelegationTokenRenewer#addApplication}} has an unnecessary {{synchronized}} 
 keyword.  This serializes job submissions and can add unnecessary latency 
 and/or hang all submissions if there are problems renewing the token.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-366) Add a tracing async dispatcher to simplify debugging

2013-02-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-366:


Attachment: YARN-366.patch

 Add a tracing async dispatcher to simplify debugging
 

 Key: YARN-366
 URL: https://issues.apache.org/jira/browse/YARN-366
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-366.patch


 Exceptions thrown in YARN/MR code with asynchronous event handling do not 
 contain informative stack traces, as all handle() methods sit directly under 
 the dispatcher thread's loop.
 This makes errors very difficult to debug for those who are not intimately 
 familiar with the code, as it is difficult to see which chain of events 
 caused a particular outcome.
 I propose adding an AsyncDispatcher that instruments events with tracing 
 information.  Whenever an event is dispatched during the handling of another 
 event, the dispatcher would annotate that event with a pointer to its parent. 
  When the dispatcher catches an exception, it could reconstruct a stack 
 trace of the chain of events that led to it, and be able to log something 
 informative.
 This would be an experimental feature, off by default, unless extensive 
 testing showed that it did not have a significant performance impact.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-366) Add a tracing async dispatcher to simplify debugging

2013-02-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570856#comment-13570856
 ] 

Sandy Ryza commented on YARN-366:
-

I've attached an initial patch that adds TracingAsyncDispatcher.  The basic 
idea of it is that it maintains a thread local reference to the event currently 
being handled, so that any other events that are fired off while it is being 
handled get tagged with it as a parent.  I've added an EventTrace field to the 
Event class that maintains trace and parentage information - if we want this 
feature to entirely not affect existing code, we could instead maintain a 
mapping inside the dispatcher.

I still need to add in configuration hooks to turn it on.

 Add a tracing async dispatcher to simplify debugging
 

 Key: YARN-366
 URL: https://issues.apache.org/jira/browse/YARN-366
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-366.patch


 Exceptions thrown in YARN/MR code with asynchronous event handling do not 
 contain informative stack traces, as all handle() methods sit directly under 
 the dispatcher thread's loop.
 This makes errors very difficult to debug for those who are not intimately 
 familiar with the code, as it is difficult to see which chain of events 
 caused a particular outcome.
 I propose adding an AsyncDispatcher that instruments events with tracing 
 information.  Whenever an event is dispatched during the handling of another 
 event, the dispatcher would annotate that event with a pointer to its parent. 
  When the dispatcher catches an exception, it could reconstruct a stack 
 trace of the chain of events that led to it, and be able to log something 
 informative.
 This would be an experimental feature, off by default, unless extensive 
 testing showed that it did not have a significant performance impact.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-366) Add a tracing async dispatcher to simplify debugging

2013-02-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570865#comment-13570865
 ] 

Hadoop QA commented on YARN-366:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12567927/YARN-366.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/379//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/379//console

This message is automatically generated.

 Add a tracing async dispatcher to simplify debugging
 

 Key: YARN-366
 URL: https://issues.apache.org/jira/browse/YARN-366
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-366.patch


 Exceptions thrown in YARN/MR code with asynchronous event handling do not 
 contain informative stack traces, as all handle() methods sit directly under 
 the dispatcher thread's loop.
 This makes errors very difficult to debug for those who are not intimately 
 familiar with the code, as it is difficult to see which chain of events 
 caused a particular outcome.
 I propose adding an AsyncDispatcher that instruments events with tracing 
 information.  Whenever an event is dispatched during the handling of another 
 event, the dispatcher would annotate that event with a pointer to its parent. 
  When the dispatcher catches an exception, it could reconstruct a stack 
 trace of the chain of events that led to it, and be able to log something 
 informative.
 This would be an experimental feature, off by default, unless extensive 
 testing showed that it did not have a significant performance impact.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira