[ https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361273#comment-14361273 ]
Bikas Saha commented on TEZ-776: -------------------------------- All options have been sufficiently discussed on this thread and offline. The option of moving all of event handling to edge plugins is a much larger change that shifts a lot of framework responsibility to the user. Secondly, its not clear how future changes/features additions around dynamic graph reconfigurations like changes edges and vertices at runtime may or may not be affected by having given control of event management to user code. Things like event obsoletion which can be done easily by the framework for all edges and IO's would need to be done by every plugin. Every plugin would need to have additional metadata tracking objects which are currently provided by the framework. Each plugin would have to handle versioning of events and speculation like conditions which break the time-sequential nature of version numbers. And probably other stuff. Firstly, that is a much larger change, that is related but orthogonal to the memory issue and must be discussed separately on its own right. Secondly, while at a high level it may seem likely that in some cases edge plugins might do better at CPU, I suspect that after handling event versioning, obsoletion, etc. the argument that plugins can avoid iterating over events may turn out to be specious for CPU efficiency. My suggestion to follow up on that approach separately is based on the above arguments. It's not been effectively established that moving essential framework responsibilities to the user is the right approach long term. Neither is it clear that the CPU efficiency of the final implementation that does more than the sunny day scenario is going to be significantly better at the cost of adding complexity in user code. That can only be measured. Hence, I suggested that it be evaluated before including that change in the project. This is the case with any change or feature right? My only objection was to tie the progress on this jira by pre-accepting the other changes without going through due process. Specially when this jira does not mandate any user code changes. In the meanwhile, the current patch does not mandate any API changes for users. Unless users want to make the API change they can continue to use the existing API, even across releases. If they do want to make the change, its much simpler because it follows the existing pattern. But for users who are running large jobs and using framework built-in components, they can be unblocked on their scalability issues. Hence, my suggestion to complete the reviews of this patch and resolve it so that there is forward progress without requiring any user to make any code changes. In order to make progress, what I can try to do is limit the on demand routing to only composite event expansion and not change the flow for any other event. Add a new optional API for composite event expansion that will be implement by internal scatter-gather edge and optional so that users dont need to change their code. This will solve the memory scalability issue without increasing any CPU cost compared to any scenario as it exists today. I hope that clarifies and we can make progress on this jira. > Reduce AM mem usage caused by storing TezEvents > ----------------------------------------------- > > Key: TEZ-776 > URL: https://issues.apache.org/jira/browse/TEZ-776 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Siddharth Seth > Assignee: Bikas Saha > Attachments: TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, > TEZ-776.ondemand.3.patch, TEZ-776.ondemand.4.patch, TEZ-776.ondemand.5.patch, > TEZ-776.ondemand.6.patch, TEZ-776.ondemand.patch, With_Patch_AM_hotspots.png, > With_Patch_AM_profile.png, Without_patch_AM_CPU_Usage.png, > events-problem-solutions.txt, with_patch_jmc_output_of_AM.png, > without_patch_jmc_output_of_AM.png > > > This is open ended at the moment. > A fair chunk of the AM heap is taken up by TezEvents (specifically > DataMovementEvents - 64 bytes per event). > Depending on the connection pattern - this puts limits on the number of tasks > that can be processed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)