[Python-ideas] Improved multi-tasking performance through deterministic GIL hand-off

Sophist Mon, 10 May 2021 08:06:13 -0700

I don't know how many people will remember some work that David Beazley did 
about a decade ago on how the GIL impacts multithreading performance - 
essentially he instrumented the Python interpreter to log how multiple threads 
competed for  the GIL and gave several presentations over the space of 2-3 
years. A couple of years ago I reached out to him with an idea on how to 
significantly improve the way that Python handles multi-threading hand-off of 
the GIL, but unfortunately he was not interested in pursuing this further. I am 
raising it here in the hope that someone else would be interested in 
implementing this.


In essence my idea is to stop Python handing off the GIL through a competition 
between threads that are ready to run, and instead for Python to implement a 
scheduler for the GIL which decides which thread should get the GIL next and 
directly hands it over. 

Background
========
Here are the links to David Beazley's presentations:

2009: Inside the Python GIL                                          - 
https://www.youtube.com/watch?v=ph374fJqFPE
2010: Understanding the Python GIL                            - 
https://speakerdeck.com/dabeaz/understanding-the-python-gil 
https://www.youtube.com/watch?v=Obt-vMVdM8s
2011: Embracing the Global Interpreter Lock               - 
https://speakerdeck.com/dabeaz/embracing-the-global-interpreter-lock 
https://www.youtube.com/watch?v=fwzPF2JLoeU
2011: In Search of the Perfect Global Interpreter Lock - 
https://speakerdeck.com/dabeaz/in-search-of-the-perfect-global-interpreter-lock 
https://www.youtube.com/watch?v=5jbG7UKT1l4

The Problem
========

A currently executing thread can give up the GIL (essentially) for two reasons: 
A) It has started a blocking operation (like a reading data from disk) and is 
no longer runnable; or B) because it is using a lot of CPU and has reached a 
Python CPU timeout (nothing to do with the O/S scheduler time-slice) and is 
still runnable.

When the currently executing thread wants to give up the GIL, the current 
competitive approach essentially sends a signal to all runnable threads that 
the GIL is now available and the first thread to grab it gets to execute with 
the GIL. The problem with this (at least as I understand it) is that all 
threads except the thread that has just given up the GIL are not actually 
running on a CPU and need the O/S Scheduler to dispatch them before they can 
run, but the current thread is already running - so if the currently executing 
thread is is still runnable i.e. because it has reached the Python CPU timeout, 
then it is highly likely to be the first to grab the GIL back again. This leads 
to CPU-heavy threads dominating the CPU usage, and the I/O threads being 
starved out - and this is the opposite of what O/S Scheduling Theory says is 
optimum - which is to run the I/O threads first to keep the data flowing and 
everything responsive, typically using low levels of CPU, and let the CPU heavy
  threads mop up the remaining CPU.

But even if the currently executing CPU heavy thread doesn't grab the GIL 
again, there is no guarantee that the most appropriate of the other threads 
will grab it instead.

It should also be noted that this competitive approach has a lot of overhead - 
because every thread wakes up to try to compete for the GIL.

Proposed Solution
============

My proposed solution is for Python to implement its own (light-weight) 
scheduling system for the GIL - maintaining a list of threads that are runnable 
and whether they are I/O bound (gave up the GIL because they became 
non-runnable due to blocking I/O or waiting for a signal) or CPU heavy (gave up 
the GIL because of reaching the Python timeout). Ideally we would also add 
functionality to Python to allow you to specify a thread priority (ideally 
adjusting the O/S thread priority to match) and we would also retain this 
priority as part of the GIL scheduling.

As far as I can see this proposal is complementary to the current work on 
sub-interpreters - because there is a big base of existing code which would 
immediately benefit from these changes without any coding changes.

The solution I have in mind is this:

a. A scheduling queue is created with its own mutex. The reason for having a 
separate mutex is that threads can add themselves to the queue without needing 
to obtain the GIL.
b. The queue consists of a tree with the following branches:
    i) A list of thread priorities, decreasing order
    ii) Two subqueues: I/O bound, CPU bound
    iii) Each has a FIFO list of runnable threads 

When a thread is created, it is assumed to be I/O bound and is added to the 
back of the appropriate queue.

When the GIL is given up, if the currently executing thread is still runnable 
(reached the Python timeout) it adds itself to the back of the appropriate 
queue. It then takes the first runnable thread in the queue (highest priority, 
I/O bound if non empty, otherwise CPU bound) and signals that thread to take 
the GIL and start executing.

As a thread becomes runnable (because it gets a signal it is waiting for or 
because a blocking I/O completes), it adds itself to the back of the 
appropriate queue.

This is a deterministic solution with low overhead that should ensure that the 
appropriate thread gets the GIL. And I believe that it would be compatible with 
the intended sub-interpreter implementation.

A couple of caveats:

A. I think that there may be some other rare cases where a thread that has the 
GIL is forced to give it up - and the above algorithm might needs some tweaks 
to accommodate that.
B. Although this should be deterministic, I can imagine a possibility that the 
signalled thread does not wake up and take the GIL (due to bugs or other 
reasons not imagined) - and it might be worthwhile having some watchdog timeout 
mechanism to be started immediately before sending the GIL transfer signal and 
terminated immediately after receiving the GIL transfer signal, which will be 
triggered if for any reason the signalled thread does not wake up and take the 
GIL within a short period.

Because the code for handing off the GIL is quite localised, I believe that the 
implementation of this might be reasonably straight forward.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/A5MX6SQUHP65JC6V5ZFCCHMMJURM4KHB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Improved multi-tasking performance through deterministic GIL hand-off

Reply via email to