Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/16291
> Each listener should in theory could work independent of each other and
we should only guarantee ordered processing of the events within a listener.
If we were starting from nothing, then yes, it would be valid and advisable
to design the Listener infrastructure using only this weaker guarantee. The
issue, though, is that we are not starting from nothing, but rather from a
system that currently offers a much stronger guarantee on the synchronized
behavior of Listeners. If it is the case that no Listeners currently rely on
the stronger guarantee and thus could work completely correctly under the
weaker guarantee of this PR, then we could make this change without much
additional concern. But reaching that level of confidence in current Listeners
is a difficult prerequisite -- strictly speaking, it's an impossible task.
We could carefully work through all the internal behavior of Spark's
Listeners to convince ourselves that they can work correctly under the new,
weaker guarantee. At a bare minimum, we need to do that much before we can
consider merging this PR -- but that's probably not enough. The problem is
that Listeners aren't just internal to Spark. Users have also developed their
own custom Listeners that either implement `SparkListenerInterface` or extend
`SparkListener` or `SparkFirehoseListener`, and we can't just assume that those
custom Listeners don't rely upon the current guarantee to either synchronize
behavior with other custom Listeners or even with Spark internal Listeners.
Since we can't know that user Listeners don't already rely upon the current,
stronger guarantee, the question now becomes whether we even have the freedom
to change that guarantee within the lifetime of Spark 2.x, or whether any such
change would have to wait for Spark 3.x.
`SparkListener` is still annotated as `@DeveloperAPI`, so if that were the
only piece in play, then we could change its guarantee fairly freely.
`SparkListenerInterface` is almost as good, since it includes the admonition in
a comment to "[n]ote that this is an internal interface which might change in
different Spark releases." The stickier issue is with `SparkFirehoseListener`,
which carries no such annotations or comments, but is just a plain public class
and API. So, after convincing ourselves that Spark's internal Listeners would
be fine with this PR, we'd still have to convince the Spark PMC that changing
the public `SparkFirehoseListener` (with prominent warnings in the release
notes, of course) before Spark 3.x would be acceptable.
And all of the above is still really only arguing about whether we *could*
adopt this PR in essentially its present form. There are still questions of
whether we *should* do this or maybe instead we should do something a little
different or more. I can see some merit in Marcelo's "opt in" suggestion. If
there is utility in having groups of Listeners that can rely upon synchronized
behavior, then we should probably retain one or more threads running
synchronized Listeners. For example, if Listener A relies upon synchronization
with Listeners B and C while D needs to synchronize with E, but F, G and H are
all independent, then there are a couple of things we could do. First, the
independent Listeners (F, G and H) can each run in its own thread, providing
the scalable performance that this PR is aiming for. After that, we could
either have one synchronized Listener thread for all the other Listeners, or we
could have one thread for A, B and C and one thread for D and E. Wheth
er we support only one synchronized Listener group/thread or multiple, we'd
still need some mechanism for Listeners to select into a synchronized group or
to indicate that they can and should be run independently on their own thread.
@rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]