[
https://issues.apache.org/jira/browse/TEZ-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346585#comment-14346585
]
Siddharth Seth commented on TEZ-776:
------------------------------------
bq. The edge manager implementations are precisely that. Impls of interfaces so
that routing logic can be captured in code rather than as routing tables in
memory. And thats what all impls to date do. They have logic in code to
generate the routing data on the fly instead of encoding it in some memory
efficient manner. We are probably unnecessarily going around in circles about
this point of efficiently storing routing tables by smart edge manager plugins.
Thats already the case.
There's a difference between a routing table, and how events are routed through
this table. I don't think we're discussing the routing table at all over here.
As an example - the routing table for ScatterGather / Broadcast is encoded in
code due to symmetry. For Hive Bucket Joins - this is a combination of code and
a lookup table.
Neither of these relate to how events are handled before or after they have
been routed.
Comparisons between Option 4 and OnDemandRouting for the existing edge types -
just on how they work.
||EdgeType|Option4|ODR|Notes|
|ScatterGather|EdgePlugin stores CompositeEvents. getNextN() for each task
causes DMEvents to be created after a range list lookup.|Iterate over events in
VM - call edge API for each event.|Similar, except the EdgePlugin runs a much
leaner tight loop without any type checks etc. Maybe possible to optimize
withing the EMPlugin depending on API|
|Broadcast|EdgePlugin stores 1 DMEvent from each task. Destination index for
all is the same. EMPlugin just does a simple list lookup|ODR - routes each
event like in the ScatterGather case|(Can be multiple DMEvents per source, but
storage can be optimized for that)|
|OneToOne|EdgePlugin stores DMEvent for each src task. Looks up precisely one
event for destination tasks.|ODR routes all src events through the EdgePlugin
for each destination task, even though only one will ever match.|1000x1000
OneToOne cost should be quite a bit lower with Option 4|
The fundamental point here is that the Edge knows and recognizes the repeatable
behaviour post routing, and can optimize the storage accordingly. This has
always been my main point of contention with ODR - other than the fact that
we're trading one MXN problem for another.
bq. InputFailedEvent ...
This deals with physical indices, yes ? Doesn't that make it more difficult, if
at all possible, to deprecate past events which haven't yet been routed ?
bq. Routing events as soon as they arrive would be covered by on-demand-routing
since essentially it is on demand routing of events upon arrival.
Are we talking about one more pass over the events to generate a potential set
of target tasks which have pending events ? Not all downstream consumers will
be ready when a source event comes in. How would this work - source event comes
in, route it immediately to generate a list of tasks, or maybe lookup
destination tasks which are ready and waiting, and immediately route for those ?
bq. Running the simulation consumes 200-400% cpu but the AM is not always
bottle-necked on event routing.
Does that mean 200-400% CPU spikes ? which is to be expected, since event
routing doesn't happen throughout the span of the AM.
bq. For a job with 2 large joins taking ~1300s to complete on a 20 node cluster
the cpu times reported are 487640 (with on demand routing) and 526340 (with
existing routing).
On demand routing consumes less CPU than the current model ? That is extremely
surprising. Is this a lookup from /proc ?, and are there additional
optimizations which help with this.
bq. Another important correctness issue came up during the review of the tez
paper submitted to sigmod. User payloads capture user metadata (and potentially
data) and thus from a security correctness point of view its essential to limit
who has access to the payloads. Inputs and outputs communicate via the payloads
and need access for correctness. But nothing else should. Edge plugins are user
code. Some written by Tez developers, some by others. Using third party plugins
to route events should not mean that they have access to the user payloads of
those events. That would be a potential security hole.
I'm not sure I agree with this being a security hole. If a user is sensitive
about their meta-data, they're free to chose which plugins they run. I do agree
that EdgePlugins should not be in the business of parsing the payload - since
that will kill the AM (maybe with some super specialized exceptions where a
user explicitly makes this choice).
IAC, a model like this potentially makes Option 4 for ScatterGather even more
efficient, and converts it to a list lookup - since DMEvents don't need to be
modified or created by the plugin. Instead, it's up to the framework to create
the actual DMEvents, which can be moved over to the runtime.
> Reduce AM mem usage caused by storing TezEvents
> -----------------------------------------------
>
> Key: TEZ-776
> URL: https://issues.apache.org/jira/browse/TEZ-776
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Bikas Saha
> Attachments: TEZ-776.ondemand.1.patch, TEZ-776.ondemand.2.patch,
> TEZ-776.ondemand.3.patch, TEZ-776.ondemand.patch, events-problem-solutions.txt
>
>
> This is open ended at the moment.
> A fair chunk of the AM heap is taken up by TezEvents (specifically
> DataMovementEvents - 64 bytes per event).
> Depending on the connection pattern - this puts limits on the number of tasks
> that can be processed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)