I've been working on a few enhancements to the LivySessionController as part of
my ticket NIFI-6175 : Livy - Add Support for All Missing Session Startup
Features.
I've been running into a lot of... interesting? features in the code, and I was
hoping someone (Matt?) who was involved with it from the beginning could shed
some light on these questions before I change any default behavior/fix any
perceived bugs.
* If NiFi is running in a cluster then a race condition is entered where
you can't really predict how many Livy sessions will be created.
* If the controller service was recently running, and you just restarted
the controller service, then everything is fine. The existing Livy sessions
will be found and used. But even in this scenario it's not working correctly,
because if I asked the controller service to use two instances, it will use all
available sessions in Livy for the "Kind" (more on this farther down).
* The logic checks if any sessions exist on controller startup (and on
each update interval), since all instances of the controller service start up
roughly at the same time, you might end up with full duplication across the
cluster, or you might end up with 50% duplication, or anything in between,
depending on how quickly session create requests get sent in.
* The controller service will "steal" any open Livy session that it can
see, so long as the "Kind" of session matches the configuration. It will also
over reach in session allocation if more sessions are available then it needs.
* If there are 10 Livy sessions open, it will load all 10 as available
for use, even if I only wanted 2. If some of those sessions die off it does not
create new ones, but it will keep using them as long as they are available.
* If you have multiple Livy Controller Services, it's very hard
(impossible?) to keep sessions separate if they are running under the same
account (and maybe even if they are not, have not spent much time testing the
separate account option).
* The code does not block the sessions/mark it as used. It relies on the
Livy Session state value of "Idle" to designate a session as available. This is
another race condition where running multiple threads of
ExecuteSparkInteractive, either because your in a cluster, or because you just
have multiple threads, would easily dual assign a Livy session instead of using
the expected WAIT relationship.
* The Controller Service is unable to delete existing sessions. So even
though there is a Controller Service shutdown hook in the code, it does not
clean up it's open sessions and they have to time-out.
I don't have resolutions for all these issues. But one thing I was thinking
about doing was using the Livy Session Name parameter to tag the session when
it's created so it's associated with a specific Controller Service instance by
UUID (so it would work across a cluster too), and maybe only manage Livy
sessions from the master node? (not sure how to find out if your on Master, but
a thought).
Thanks,
Peter