[pypy-commit] pypy stmgc-c7: Merge and start refactoring stm.rst

arigo Tue, 06 May 2014 09:57:37 -0700

Author: Armin Rigo <[email protected]>
Branch: stmgc-c7
Changeset: r71335:b0b7c78b3182
Date: 2014-05-06 18:56 +0200
http://bitbucket.org/pypy/pypy/changeset/b0b7c78b3182/


Log:    Merge and start refactoring stm.rst

diff --git a/pypy/doc/stm.rst b/pypy/doc/stm.rst
--- a/pypy/doc/stm.rst
+++ b/pypy/doc/stm.rst
@@ -3,144 +3,401 @@
 Software Transactional Memory
 =============================
 
+.. contents::
+
+
+This page is about ``pypy-stm``, a special in-development version of
+PyPy which can run multiple independent CPU-hungry threads in the same
+process in parallel.  It is a solution to what is known in the Python
+world as the "global interpreter lock (GIL)" problem --- it is an
+implementation of Python without the GIL.
+
+"STM" stands for Software `Transactional Memory`_, the technique used
+internally.  This page describes ``pypy-stm`` from the perspective of a
+user, describes work in progress, and finally gives references to more
+implementation details.
+
+This work was done by Remi Meier and Armin Rigo.  Thanks to all donors
+for crowd-funding the work so far!  Please have a look at the `2nd call
+for donation`_.
+
+.. _`Transactional Memory`: http://en.wikipedia.org/wiki/Transactional_memory
+.. _`2nd call for donation`: http://pypy.org/tmdonate2.html
+
 
 Introduction
 ============
 
-PyPy can be translated in a special mode based on Software Transactional
-Memory (STM).  This mode is not compatible with the JIT so far, and moreover
-adds a constant run-time overhead (so far 4-5x).
+``pypy-stm`` is a variant of the regular PyPy interpreter.  With caveats_
+listed below, it should be in theory within 20%-50% slower than a
+regular PyPy, comparing the JIT version in both cases.  It is called
+STM for Software Transactional Memory, which is the internal technique
+used (see `Reference to implementation details`_).
+
 The benefit is that the resulting ``pypy-stm`` can execute multiple
-threads of Python code in parallel.
+threads of Python code in parallel.  Programs running two threads or
+more in parallel should ideally run faster than in a regular PyPy
+(either now, or soon as bugs are fixed).
 
-* ``pypy-stm`` is fully compatible with a GIL-based PyPy; you can use it
-  as a drop-in replacement and multithreaded programs will run on multiple
-  cores.
+* ``pypy-stm`` is fully compatible with a GIL-based PyPy; you can use
+  it as a drop-in replacement and multithreaded programs will run on
+  multiple cores.
 
-* ``pypy-stm`` adds a low-level API in the ``thread`` module, namely
-  ``thread.atomic``, that can be used as described below.  This is meant
-  to improve existing multithread-based programs.  It is also meant to
-  be used to build higher-level interfaces on top of it.
+* ``pypy-stm`` does not impose any special API to the user, but it
+  provides a new pure Python module called `transactional_memory`_ with
+  features to inspect the state or debug conflicts_ that prevent
+  parallelization.  This module can also be imported on top of a non-STM
+  PyPy or CPython.
 
-* A number of higher-level interfaces are planned, using internally
-  threads and ``thread.atomic``.  They are meant to be used in
-  non-thread-based programs.  Given the higher level, we also recommend
-  using them in new programs instead of structuring your program to use
-  raw threads.
+* Building on top of the way the GIL is removed, we will talk
+  about `Atomic sections, Transactions, etc.: a better way to write
+  parallel programs`_.
 
 
+Getting Started
+===============
+
+**pypy-stm requires 64-bit Linux for now.**
+
+Development is done in the branch `stmgc-c7`_.  If you are only
+interested in trying it out, you can download a Ubuntu binary here__
+(``pypy-2.3.x-stm*.tar.bz2``, Ubuntu 12.04-14.04; these versions are
+release mode, but not stripped of debug symbols).  The current version
+supports four "segments", which means that it will run up to four
+threads in parallel.
+
+To build a version from sources, you first need to compile a custom
+version of clang(!); we recommend downloading `llvm and clang like
+described here`__, but at revision 201645 (use ``svn co -r 201645 <path>``
+for all checkouts).  Then apply all the patches in `this directory`__:
+they are fixes for a clang-only feature that hasn't been used so heavily
+in the past (without the patches, you get crashes of clang).  Then get
+the branch `stmgc-c7`_ of PyPy and run::
+
+   rpython/bin/rpython -Ojit --stm pypy/goal/targetpypystandalone.py
+
+.. _`stmgc-c7`: https://bitbucket.org/pypy/pypy/src/stmgc-c7/
+.. __: http://cobra.cs.uni-duesseldorf.de/~buildmaster/misc/
+.. __: http://clang.llvm.org/get_started.html
+.. __: https://bitbucket.org/pypy/stmgc/src/default/c7/llvmfix/
+
+
+.. _caveats:
+
+Current status
+--------------
+
+* So far, small examples work fine, but there are still a few bugs.
+  We're busy fixing them as we find them; feel free to `report bugs`_.
+
+* Currently limited to 1.5 GB of RAM (this is just a parameter in
+  `core.h`__).  Memory overflows are not correctly handled; they cause
+  segfaults.
+
+* The JIT warm-up time improved recently but is still bad.  In order to
+  produce machine code, the JIT needs to enter a special single-threaded
+  mode for now.  This means that you will get bad performance results if
+  your program doesn't run for several seconds, where *several* can mean
+  *many.*  When trying benchmarks, be sure to check that you have
+  reached the warmed state, i.e. the performance is not improving any
+  more.  This should be clear from the fact that as long as it's
+  producing more machine code, ``pypy-stm`` will run on a single core.
+
+* The GC is new; although clearly inspired by PyPy's regular GC, it
+  misses a number of optimizations for now.  Programs allocating large
+  numbers of small objects that don't immediately die, as well as
+  programs that modify large lists or dicts, suffer from these missing
+  optimizations.
+
+* The GC has no support for destructors: the ``__del__`` method is never
+  called (including on file objects, which won't be closed for you).
+  This is of course temporary.  Also, weakrefs might appear to work a
+  bit strangely for now (staying alive even though ``gc.collect()``, or
+  even dying but then un-dying for a short time before dying again).
+
+* The STM system is based on very efficient read/write barriers, which
+  are mostly done (their placement could be improved a bit in
+  JIT-generated machine code).  But the overall bookkeeping logic could
+  see more improvements (see `Low-level statistics`_ below).
+
+* Forking the process is slow because the complete memory needs to be
+  copied manually.  A warning is printed to this effect.
+
+* Very long-running processes (on the order of days) will eventually
+  crash on an assertion error because of a non-implemented overflow of
+  an internal 29-bit number.
+
+.. _`report bugs`: https://bugs.pypy.org/
+.. __: 
https://bitbucket.org/pypy/pypy/raw/stmgc-c7/rpython/translator/stm/src_stm/stm/core.h
+.. __: 
+
+
+
+User Guide
+==========
+  
+
 Drop-in replacement
-===================
+-------------------
 
 Multithreaded, CPU-intensive Python programs should work unchanged on
 ``pypy-stm``.  They will run using multiple CPU cores in parallel.
 
-(The existing semantics of the GIL (Global Interpreter Lock) are
+The existing semantics of the GIL (Global Interpreter Lock) are
 unchanged: although running on multiple cores in parallel, ``pypy-stm``
 gives the illusion that threads are run serially, with switches only
-occurring between bytecodes, not in the middle of one.)
+occurring between bytecodes, not in the middle of them.  Programs can
+rely on this: using ``shared_list.append()/pop()`` or
+``shared_dict.setdefault()`` as synchronization mecanisms continues to
+work as expected.
 
+This works by internally considering the points where a standard PyPy or
+CPython would release the GIL, and replacing them with the boundaries of
+"transaction".  Like their database equivalent, multiple transactions
+can execute in parallel, but will commit in some serial order.  They
+appear to behave as if they were completely run in this serialization
+order.
 
-High-level interface
-====================
 
-Alternatively, if you have a program not using threads, but containing a
-loop that runs "chunks" of work in random order::
+Atomic sections
+---------------
 
-    somedict = {...}
-    while len(somedict) > 0:
-        key, value = somedict.popitem()
-        do_work(key, value)   # which may add more things to 'somedict'
+PyPy supports *atomic sections,* which are blocks of code which you want
+to execute without "releasing the GIL".  *This is experimental and may
+be removed in the future.*  In STM terms, this means blocks of code that
+are executed while guaranteeing that the transaction is not interrupted
+in the middle.
 
-Then you can parallelize it *without using threads* by replacing this
-loop with code like this::
+Here is a usage example::
 
-    transaction.add(do_work, initialkey1, initialvalue1)
-    transaction.add(do_work, initialkey2, initialvalue2)
+    with __pypy__.thread.atomic:
+        assert len(lst1) == 10
+        x = lst1.pop(0)
+        lst1.append(x)
+
+In this (bad) example, we are sure that the item popped off one end of
+the list is appened again at the other end atomically.  It means that
+another thread can run ``len(lst1)`` or ``x in lst1`` without any
+particular synchronization, and always see the same results,
+respectively ``10`` and ``True``.  It will never see the intermediate
+state where ``lst1`` only contains 9 elements.  Atomic sections are
+similar to re-entrant locks (they can be nested), but additionally they
+protect against the concurrent execution of *any* code instead of just
+code that happens to be protected by the same lock in other threads.
+
+Note that the notion of atomic sections is very strong. If you write
+code like this::
+
+    with __pypy__.thread.atomic:
+        time.sleep(10)
+
+then, if you think about it as if we had a GIL, you are executing a
+10-seconds-long atomic transaction without releasing the GIL at all.
+This prevents all other threads from progressing at all.  While it is
+not strictly true in ``pypy-stm``, the exact rules for when other
+threads can progress or not are rather complicated; you have to consider
+it likely that such a piece of code will eventually block all other
+threads anyway.
+
+Note that if you want to experiment with ``atomic``, you may have to add
+manually a transaction break just before the atomic block.  This is
+because the boundaries of the block are not guaranteed to be the
+boundaries of the transaction: the latter is at least as big as the
+block, but maybe bigger.  Therefore, if you run a big atomic block, it
+is a good idea to break the transaction just before.  This can be done
+e.g. by the hack of calling ``time.sleep(0)``.  (This may be fixed at
+some point.)
+
+There are also issues with the interaction of locks and atomic blocks.
+This can be seen if you write to files (which have locks), including
+with a ``print`` to standard output.  If one thread tries to acquire a
+lock while running in an atomic block, and another thread has got the
+same lock, then the former may fail with a ``thread.error``.  The reason
+is that "waiting" for some condition to become true --while running in
+an atomic block-- does not really make sense.  For now you can work
+around it by making sure that, say, all your prints are either in an
+``atomic`` block or none of them are.  (This kind of issue is
+theoretically hard to solve.)
+
+
+Locks
+-----
+
+**Not Implemented Yet**
+
+The thread module's locks have their basic semantic unchanged.  However,
+using them (e.g. in ``with my_lock:`` blocks) starts an alternative
+running mode, called `Software lock elision`_.  This means that PyPy
+will try to make sure that the transaction extends until the point where
+the lock is released, and if it succeeds, then the acquiring and
+releasing of the lock will be "elided".  This means that in this case,
+the whole transaction will technically not cause any write into the lock
+object --- it was unacquired before, and is still unacquired after the
+transaction.
+
+This is specially useful if two threads run ``with my_lock:`` blocks
+with the same lock.  If they each run a transaction that is long enough
+to contain the whole block, then all writes into the lock will be elided
+and the two transactions will not conflict with each other.  As usual,
+they will be serialized in some order: one of the two will appear to run
+before the other.  Simply, each of them executes an "acquire" followed
+by a "release" in the same transaction.  As explained above, the lock
+state goes from "unacquired" to "unacquired" and can thus be left
+unchanged.
+
+This approach can gracefully fail: unlike atomic sections, there is no
+guarantee that the transaction runs until the end of the block.  If you
+perform any input/output while you hold the lock, the transaction will
+end as usual just before the input/output operation.  If this occurs,
+then the lock elision mode is cancelled and the lock's "acquired" state
+is really written.
+
+Even if the lock is really acquired already, a transaction doesn't have
+to wait for it to become free again.  It can enter the elision-mode anyway
+and tentatively execute the content of the block.  It is only at the end,
+when trying to commit, that the thread will pause.  As soon as the real
+value stored in the lock is switched back to "unacquired", it can then
+proceed and attempt to commit its already-executed transaction (which
+can fail and abort and restart from the scratch, as usual).
+
+Note that this is all *not implemented yet,* but we expect it to work
+even if you acquire and release several locks.  The elision-mode
+transaction will extend until the first lock you acquired is released,
+or until the code performs an input/output or a wait operation (for
+example, waiting for another lock that is currently not free).  In the
+common case of acquiring several locks in nested order, they will all be
+elided by the same transaction.
+
+
+Atomic sections, Transactions, etc.: a better way to write parallel programs
+----------------------------------------------------------------------------
+
+(This section describes locks as we plan to implement them, but also
+works with the existing atomic sections.)
+
+In the cases where elision works, the block of code can run in parallel
+with other blocks of code *even if they are protected by the same lock.*
+You still get the illusion that the blocks are run sequentially.  This
+works even for multiple threads that run each a series of such blocks
+and nothing else, protected by one single global lock.  This is
+basically the Python application-level equivalent of what was done with
+the interpreter in ``pypy-stm``: while you think you are writing
+thread-unfriendly code because of this global lock, actually the
+underlying system is able to make it run on multiple cores anyway.
+
+...
+
+``pypy-stm`` enables a better programming model whereby you can run
+non-threaded programs on multiple cores, simply by starting multiple
+threads but running each of them protected by the same lock.  (Note that
+"protected by the same lock" means right now "they are all protected by
+``__pypy__.thread.atomic``", but this might change in the future.)
+
+This capability can be hidden in a library or in the framework you use;
+the end user's code does not need to be explicitly aware of using
+threads.  For a simple example of this, there is `transaction.py`_ in
+``lib_pypy``.  The idea is that you write, or already have, some program
+where the function ``f(key, value)`` runs on every item of some big
+dictionary, say::
+
+    for key, value in bigdict.items():
+        f(key, value)
+
+Then you simply replace the loop with::
+
+    for key, value in bigdict.items():
+        transaction.add(f, key, value)
     transaction.run()
-    # and 'do_work()' may call more 'transaction.add(do_work, key, value)'
 
-The ``transaction`` module works as if it ran each 'do_work()' call
-serially in some unspecified order.  Under the hood, it creates a pool
-of threads.  But this is not visible: each 'do_work()' is run as one
-"atomic" block.  Multiple atomic block can actually run in parallel, but
-behave as if they were run serially.  This works as long as they are
-doing "generally independent" things.  More details about this later.
+This code runs the various calls to ``f(key, value)`` using a thread
+pool, but every single call is executed under the protection of a unique
+lock.  The end result is that the behavior is exactly equivalent --- in
+fact it makes little sense to do it in this way on a non-STM PyPy or on
+CPython.  But on ``pypy-stm``, the various locked calls to ``f(key,
+value)`` can tentatively be executed in parallel, even if the observable
+result is as if they were executed in some serial order.
 
-The module is written in pure Python (`lib_pypy/transaction.py`_).
-See the source code to see how it is based on the `low-level interface`_.
+This approach hides the notion of threads from the end programmer,
+including all the hard multithreading-related issues.  This is not the
+first alternative approach to explicit threads; for example, OpenMP_ is
+one.  However, it is one of the first ones which does not require the
+code to be organized in a particular fashion.  Instead, it works on any
+Python program which has got latent, imperfect parallelism.  Ideally, it
+only requires that the end programmer identifies where this parallelism
+is likely to be found, and communicates it to the system, using for
+example the ``transaction.add()`` scheme.
+    
+.. _`transaction.py`: 
https://bitbucket.org/pypy/pypy/raw/stmgc-c7/lib_pypy/transaction.py
+.. _OpenMP: http://en.wikipedia.org/wiki/OpenMP
 
 
-Low-level interface
-===================
+API of transactional_memory
+---------------------------
 
-Besides replacing the GIL with STM techniques, ``pypy-stm`` offers one
-additional explicit low-level API: ``thread.atomic``.  This is a context
-manager to use in a ``with`` statement.  Any code running in the ``with
-thread.atomic`` block is guaranteed to be fully serialized with respect
-to any code run by other threads (so-called *strong isolation*).
+The new pure Python module ``transactional_memory`` runs on both CPython
+and PyPy, both with and without STM.  It contains:
 
-Note that this is a *guarantee of observed behavior:* under the conditions
-described below, a ``thread.atomic`` block can internally run in parallel
-with other threads, whether they are in a ``thread.atomic`` or not.  But
-the behavior is as if the threads don't overlap.
+* ``getsegmentlimit()``: return the number of "segments" in
+  this pypy-stm.  This is the limit above which more threads will not be
+  able to execute on more cores.  (Right now it is limited to 4 due to
+  inter-segment overhead, but should be increased in the future.  It
+  should also be settable, and the default value should depend on the
+  number of actual CPUs.)  If STM is not available, this returns 1.
 
-Classical minimal example: in a thread, you want to pop an item from
-``list1`` and append it to ``list2``, knowing that both lists can be
-mutated concurrently by other threads.  Using ``thread.atomic`` this can
-be done without careful usage of locks on all the mutations of the lists::
+* ``print_abort_info(minimum_time=0.0)``: debugging help.  Each thread
+  remembers the longest abort or pause it did because of cross-thread
+  contention_.  This function prints it to ``stderr`` if the time lost
+  is greater than ``minimum_time`` seconds.  The record is then
+  cleared, to make it ready for new events.  This function returns
+  ``True`` if it printed a report, and ``False`` otherwise.
 
-    with thread.atomic:
-        x = list1.pop()
-        list2.append(x)
 
-Note that, unlike this minimal example, the expected usage is that
-``thread.atomic`` blocks are potentially very complex and long-running.
-This is what you typically get with the `high-level interface`_.
+API of __pypy__.thread
+----------------------
 
+The ``__pypy__.thread`` submodule is a built-in module of PyPy that
+contains a few internal built-in functions used by the
+``transactional_memory`` module.
 
-Interaction with locks
-======================
+* ``__pypy__.thread.signals_enabled``: a context manager that runs its
+  block with signals enabled.  By default, signals are only enabled in
+  the main thread; a non-main thread will not receive signals (this is
+  like CPython).  Enabling signals in non-main threads is useful for
+  libraries where threads are hidden and the end user is not expecting
+  his code to run elsewhere than in the main thread.
 
-Existing multi-threaded programs usually rely on locks, either directly
-from ``thread.allocate_lock()`` or by using variants from the
-``threading`` module.  Actually, some operations in the interpreter
-itself acquire locks internally too; most notably, any file access does.
 
-These locks work fine in ``pypy-stm`` either outside ``thread.atomic``
-blocks or inside ``thread.atomic`` blocks.  However, due to hard
-technical issues, it is not really possible for them to work correctly
-if a ``thread.atomic`` block tries to acquire a lock that has already
-been acquired outside.  In that situation (only), trying to acquire the
-lock will raise ``thread.error``.
+Conflicts
+---------
 
-Importantly, note that this is not issue with the `high-level
-interface`_, but only if you use ``thread.atomic`` directly.  In the
-high-level interface, the running code is either single-threaded
-(outside ``transaction.run()``) or systematically running in
-``thread.atomic`` blocks.
+Based on Software Transactional Memory, the ``pypy-stm`` solution is
+prone to "conflicts".  The basic idea is that threads execute their code
+speculatively, and at known points (e.g. between bytecodes) they
+coordinate with each other to agree on which order their respective
+actions should be "committed", i.e. become globally visible.  Each
+duration of time between two commit-points is called a "transaction"
+(this is related to, but slightly different from, the transactions
+above).
 
-If you *are* using ``thread.atomic`` directly, then a common way for
-this issue to show up is using ``print`` statements: this is due to the
-internal lock on ``stdout``.  You are free to use ``print`` either
-outside ``thread.atomic`` blocks or inside them, but not both
-concurrently.  A way to fix this is to put all ``print`` statements
-inside ``thread.atomic`` blocks, by writing this kind of code::
+A conflict occurs when there is no consistent ordering.  The classical
+example is if two threads both tried to change the value of the same
+global variable.  In that case, only one of them can be allowed to
+proceed, and the other one must be either paused or aborted (restarting
+the transaction).
 
-    with thread.atomic:
-        print "hello, the value is:", value
 
-Note that this actually also helps ensuring that the whole line (or
-lines) is printed atomically, instead of being broken up with
-interleaved output from other threads.
 
-In this case, it is always a good idea to protect ``print`` statements
-with ``thread.atomic``.  The reason it is not done automatically is that
-it is not useful for the high-level interface, and moreover not all file
-operations would benefit: if you have a read or write that may block,
-putting it in a ``thread.atomic`` would have the negative effect of
-suspending all other threads while we wait for the call to complete, as
-described next__.
+
+
+
+
+
+
+
+Implementation
+==============
+
+
 
 .. __: Parallelization_
 
@@ -252,3 +509,70 @@
 
 
 .. include:: _ref.txt
+
+
+
+---------------++++++++++++++++++++++++--------------------
+
+
+with lock:
+    sleep(1)
+
+
+option 1: lock.is_acquired is never touched, and all is done
+atomically; from the sleep() it is also inevitable; however other
+transactions can commit other "with lock" blocks as long as it goes
+into the past, so progress is not hindered if the other thread never
+needs inevitable; drawback = no other inevitable allowed
+
+option 2: lock.is_acquired=True is realized by the sleep() and the
+transaction commits; then other transactions cannot commit if they
+elided an acquire() until we have a real write to
+lock.is_acquired=False again; in the common case we need to make the
+transaction longer, to try to go until the release of the lock
+
+
+
+
+Low-level statistics
+--------------------
+
+When a non-main thread finishes, you get low-level statistics printed to
+stderr, looking like that::
+
+      thread 0x7f73377fe600:
+          outside transaction          42182    0.506 s
+          run current                  85466    0.000 s
+          run committed                34262    3.178 s
+          run aborted write write       6982    0.083 s
+          run aborted write read         550    0.005 s
+          run aborted inevitable         388    0.010 s
+          run aborted other                0    0.000 s
+          wait free segment                0    0.000 s
+          wait write read                 78    0.027 s
+          wait inevitable                887    0.490 s
+          wait other                       0    0.000 s
+          sync commit soon                 1    0.000 s
+          bookkeeping                  51418    0.606 s
+          minor gc                    162970    1.135 s
+          major gc                         1    0.019 s
+          sync pause                   59173    1.738 s
+          longest recordered marker          0.000826 s
+          "File "x.py", line 5, in f"
+
+On each line, the first number is a counter, and the second number gives
+the associated time --- the amount of real time that the thread was in
+this state.  The sum of all the times should be equal to the total time
+between the thread's start and the thread's end.  The most important
+points are "run committed", which gives the amount of useful work, and
+"outside transaction", which should give the time spent e.g. in library
+calls (right now it seems to be larger than that; to investigate).  The
+various "run aborted" and "wait" entries are time lost due to
+conflicts_.  Everything else is overhead of various forms.  (Short-,
+medium- and long-term future work involves reducing this overhead :-)
+
+The last two lines are special; they are an internal marker read by
+`transactional_memory.print_longest_marker`_.
+
+These statistics are not printed out for the main thread, for now.
+
_______________________________________________
pypy-commit mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-commit

[pypy-commit] pypy stmgc-c7: Merge and start refactoring stm.rst

Reply via email to