[ 
https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346585#comment-14346585
 ] 

Siddharth Seth commented on TEZ-776:
------------------------------------

bq. The edge manager implementations are precisely that. Impls of interfaces so 
that routing logic can be captured in code rather than as routing tables in 
memory. And thats what all impls to date do. They have logic in code to 
generate the routing data on the fly instead of encoding it in some memory 
efficient manner. We are probably unnecessarily going around in circles about 
this point of efficiently storing routing tables by smart edge manager plugins. 
Thats already the case.
There's a difference between a routing table, and how events are routed through 
this table. I don't think we're discussing the routing table at all over here.
As an example - the routing table for ScatterGather / Broadcast is encoded in 
code due to symmetry. For Hive Bucket Joins - this is a combination of code and 
a lookup table.
Neither of these relate to how events are handled before or after they have 
been routed.

Comparisons between Option 4 and OnDemandRouting for the existing edge types - 
just on how they work.
||EdgeType|Option4|ODR|Notes|
|ScatterGather|EdgePlugin stores CompositeEvents. getNextN() for each task 
causes DMEvents to be created after a range list lookup.|Iterate over events in 
VM - call edge API for each event.|Similar, except the EdgePlugin runs a much 
leaner tight loop without any type checks etc. Maybe possible to optimize 
withing the EMPlugin depending on API|
|Broadcast|EdgePlugin stores 1 DMEvent from each task. Destination index for 
all is the same. EMPlugin just does a simple list lookup|ODR - routes each 
event like in the ScatterGather case|(Can be multiple DMEvents per source, but 
storage can be optimized for that)|
|OneToOne|EdgePlugin stores DMEvent for each src task. Looks up precisely one 
event for destination tasks.|ODR routes all src events through the EdgePlugin 
for each destination task, even though only one will ever match.|1000x1000 
OneToOne cost should be quite a bit lower with Option 4|
The fundamental point here is that the Edge knows and recognizes the repeatable 
behaviour post routing, and can optimize the storage accordingly. This has 
always been my main point of contention with ODR - other than the fact that 
we're trading one MXN problem for another.

bq. InputFailedEvent ...
This deals with physical indices, yes ? Doesn't that make it more difficult, if 
at all possible, to deprecate past events which haven't yet been routed ?

bq. Routing events as soon as they arrive would be covered by on-demand-routing 
since essentially it is on demand routing of events upon arrival.
Are we talking about one more pass over the events to generate a potential set 
of target tasks which have pending events ? Not all downstream consumers will 
be ready when a source event comes in. How would this work - source event comes 
in, route it immediately to generate a list of tasks, or maybe lookup 
destination tasks which are ready and waiting, and immediately route for those ?

bq. Running the simulation consumes 200-400% cpu but the AM is not always 
bottle-necked on event routing.
Does that mean 200-400% CPU spikes ? which is to be expected, since event 
routing doesn't happen throughout the span of the AM.

bq. For a job with 2 large joins taking ~1300s to complete on a 20 node cluster 
the cpu times reported are 487640 (with on demand routing) and 526340 (with 
existing routing).
On demand routing consumes less CPU than the current model ? That is extremely 
surprising. Is this a lookup from /proc ?, and are there additional 
optimizations which help with this.

bq. Another important correctness issue came up during the review of the tez 
paper submitted to sigmod. User payloads capture user metadata (and potentially 
data) and thus from a security correctness point of view its essential to limit 
who has access to the payloads. Inputs and outputs communicate via the payloads 
and need access for correctness. But nothing else should. Edge plugins are user 
code. Some written by Tez developers, some by others. Using third party plugins 
to route events should not mean that they have access to the user payloads of 
those events. That would be a potential security hole.
I'm not sure I agree with this being a security hole. If a user is sensitive 
about their meta-data, they're free to chose which plugins they run. I do agree 
that EdgePlugins should not be in the business of parsing the payload - since 
that will kill the AM (maybe with some super specialized exceptions where a 
user explicitly makes this choice).
IAC, a model like this potentially makes Option 4 for ScatterGather even more 
efficient, and converts it to a list lookup - since DMEvents don't need to be 
modified or created by the plugin. Instead, it's up to the framework to create 
the actual DMEvents, which can be moved over to the runtime.

> Reduce AM mem usage caused by storing TezEvents
> -----------------------------------------------
>
>                 Key: TEZ-776
>                 URL: https://issues.apache.org/jira/browse/TEZ-776
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Siddharth Seth
>            Assignee: Bikas Saha
>         Attachments: TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch, 
> TEZ-776.ondemand.3.patch, TEZ-776.ondemand.patch, events-problem-solutions.txt
>
>
> This is open ended at the moment.
> A fair chunk of the AM heap is taken up by TezEvents (specifically 
> DataMovementEvents - 64 bytes per event).
> Depending on the connection pattern - this puts limits on the number of tasks 
> that can be processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to