[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents

Bikas Saha (JIRA) Wed, 25 Feb 2015 11:40:56 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337050#comment-14337050
 ]


Bikas Saha commented on TEZ-776:
--------------------------------

>From the simulation, the CPU overhead for routing 10000 composite events per 
>heartbeat is 2ms. With our typical 500 events at a time it would be about 
>100us - which seems reasonable.

The details on 4 mentioned above sounds very similar to what the patch is doing 
(if 4 is not storing the events in TaskImpl). The APIs suggested also 
essentially do the same thing as the ones exemplified in the patch. Unless I am 
missing something. However, not forcing edge plugins to do all of this seems 
like a better choice overall. Obsoletion, in theory, suffers from a race 
condition between when a task pulls events and when some other task reports 
them as failed. Any mitigation in the AM is probably superfluous because 
timings are too small for when it will become a problem in the routing. The 
problem, from what I see, it in the windowing of event fetches from the inputs. 
Typically the DMevent and its InputFailed counterpart will be available to the 
input but across windows of fetches. So handling that in the input might be a 
possible practical mitigation. InputFailedEvents are well-known to the 
framework and obsoletion can be done by matching up routing indices by the 
framework itself but its overhead (wherever its done). If TransientDataMovement 
events are arriving and sitting around without consumers in the AM then they 
are probably not correctly done. Consolidating is essentially routing overhead. 
In general, IMO, we should try to keep event routing plugins as simple and 
quick as possible so that we can route events as soon as possible.

Yes. The return of a single target index is less expressive but that can be 
changed without adding much overhead to follow the pattern of the CDMevent. 
However, I could not think of any use case for it. So the patch is currently 
constraining it.

Having IOs do their own routing sounds like an interesting option. However that 
aside, the docs mention who can use what and that can be enforced by runtime 
checks.

readlock has not been replaced in the latest patch. Perhaps you mixed them up.

Its essential for the tasks to report their own indices to handled failure of 
the AM itself. Unless we want to save all these tracked indices in the history 
log and block sending heartbeats on these saves. Of course, re-connecting to 
existing tasks is not there yet, but doing this would essentially make that 
quite difficult.

No they should not exceed the maxEvents and the code handles that by stopping 
the routing when that could happen. We could add a flag that informs the task 
to come back immediately for more events (which would be useful independent of 
this jira).





> Reduce AM mem usage caused by storing TezEvents
> -----------------------------------------------
>
>                 Key: TEZ-776
>                 URL: https://issues.apache.org/jira/browse/TEZ-776
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Siddharth Seth
>            Assignee: Bikas Saha
>         Attachments: TEZ-776.ondemand.1.patch, TEZ-776.ondemand.patch, 
> events-problem-solutions.txt
>
>
> This is open ended at the moment.
> A fair chunk of the AM heap is taken up by TezEvents (specifically 
> DataMovementEvents - 64 bytes per event).
> Depending on the connection pattern - this puts limits on the number of tasks 
> that can be processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-776) Reduce AM mem usage caused by storing TezEvents

Reply via email to