I've been working on a few enhancements to the LivySessionController as part of 
my ticket NIFI-6175 : Livy - Add Support for All Missing Session Startup 
Features.

I've been running into a lot of... interesting? features in the code, and I was 
hoping someone (Matt?) who was involved with it from the beginning could shed 
some light on these questions before I change any default behavior/fix any 
perceived bugs.


  *   If NiFi is running in a cluster then a race condition is entered where 
you can't really predict how many Livy sessions will be created.
     *   If the controller service was recently running, and you just restarted 
the controller service, then everything is fine. The existing Livy sessions 
will be found and used. But even in this scenario it's not working correctly, 
because if I asked the controller service to use two instances, it will use all 
available sessions in Livy for the "Kind" (more on this farther down).
     *   The logic checks if any sessions exist on controller startup (and on 
each update interval), since all instances of the controller service start up 
roughly at the same time, you might end up with full duplication across the 
cluster, or you might end up with 50% duplication, or anything in between, 
depending on how quickly session create requests get sent in.
  *   The controller service will "steal" any open Livy session that it can 
see, so long as the "Kind" of session matches the configuration. It will also 
over reach in session allocation if more sessions are available then it needs.
     *   If there are 10 Livy sessions open, it will load all 10 as available 
for use, even if I only wanted 2. If some of those sessions die off it does not 
create new ones, but it will keep using them as long as they are available.
     *   If you have multiple Livy Controller Services, it's very hard 
(impossible?) to keep sessions separate if they are running under the same 
account (and maybe even if they are not, have not spent much time testing the 
separate account option).
     *   The code does not block the sessions/mark it as used. It relies on the 
Livy Session state value of "Idle" to designate a session as available. This is 
another race condition where running multiple threads of 
ExecuteSparkInteractive, either because your in a cluster, or because you just 
have multiple threads, would easily dual assign a Livy session instead of using 
the expected WAIT relationship.
  *   The Controller Service is unable to delete existing sessions. So even 
though there is a Controller Service shutdown hook in the code, it does not 
clean up it's open sessions and they have to time-out.

I don't have resolutions for all these issues. But one thing I was thinking 
about doing was using the Livy Session Name parameter to tag the session when 
it's created so it's associated with a specific Controller Service instance by 
UUID (so it would work across a cluster too), and maybe only manage Livy 
sessions from the master node? (not sure how to find out if your on Master, but 
a thought).

Thanks,
  Peter


Reply via email to