Author: Armin Rigo <[email protected]>
Branch: extradoc
Changeset: r5169:1bc0051aed74
Date: 2014-04-03 22:46 +0200
http://bitbucket.org/pypy/extradoc/changeset/1bc0051aed74/

Log:    in-progress

diff --git a/planning/tmdonate2.txt b/planning/tmdonate2.txt
--- a/planning/tmdonate2.txt
+++ b/planning/tmdonate2.txt
@@ -12,49 +12,326 @@
 Memory (TM) in PyPy, a way to run CPU-hungry Python programs in
 multithreaded mode.  It is a follow-up on our `first call`_.  Two years
 ago we suggested a single-threaded slow-down of somewhere between 2x and
-5x.  Our aim now is closer to 1.25x, i.e. running only 25% slower than
-the regular PyPy.
+5x.  The aim that seems now within reach is rather closer to 1.25x, i.e.
+running only 25% slower than the regular PyPy.
 
 We achieved --or overachieved-- most goals laid out in the first call by
 a large margin, while at the same time raising only about half the
-money.  The present proposal is thus about development of the second
-half: starting from the various missing low-level optimizations, it will
-most importantly focus on development of the Python-facing interface.
-This includes both internal things (e.g. do dictionaries need to be more
-TM-friendly in general?) as well as directly visible things (e.g. some
-debugger-like interface to explore common conflicts in a program).  It
-also includes exploring and tweaking some existing libraries
-(e.g. Twisted) to improve their TM-friendliness.
+money.  The result is described `in our docs`_.  The present proposal is
+about development of the second half: starting from the various missing
+low-level optimizations, it will most importantly focus on development
+of the Python-facing interface.  This includes both internal things
+(e.g. do dictionaries need to be more TM-friendly in general?) as well
+as directly visible things (e.g. some debugger-like interface to explore
+common conflicts in a program).  It also includes exploring and tweaking
+some existing libraries to improve their TM-friendliness (e.g. Twisted).
 
 See also the `update on HTM`_ below.
 
+.. _`in our docs`: https://pypy.readthedocs.org/en/latest/stm.html
+
 
 
 Introduction
 ============
 
+In the presence of today's machines with multiple processors, Python
+progress is lagging behind: on any CPU-constrained program, developers
+have a difficult choice to make.  They can use in-process solutions that
+do not offer multi-CPU usage.  In this respect, the natural choice
+nowadays is to use Twisted or other event-based paradigms, or systems
+that hide events in the control flow, like Stackless; or alternatively,
+they can use the existing ``threading`` module, with its associated GIL
+and the complexities of real multi-threaded programming (locks,
+deadlocks, races, etc.), which make this solution less attractive.  The
+big alternative is for them to rely on one of various multi-process
+solutions that are outside the scope of the core language; all of them
+in some way or another are hacks that require extra knowledge and time
+to use and that have an impact on the structure of the whole program.
 
+The aim of this series of proposals is to research and implement
+Transactional Memory in PyPy.  This is a technique that recently came to
+the front of the multi-core scene.  It promises to offer multi-core CPU
+usage without requiring to fall back to the multi-process solutions
+described above, and also without using the ``threading`` module ---
+just as a small, local extension of the programming language that would
+be used only in the core of the event loops.
+
+The first proposal was launched near the start of 2012 and has covered
+the fundamental research part, up to the point of getting a first
+version of PyPy working in a reasonable state (after collecting about
+USD$27'000, which is little more than half of the money that would have
+been required to do it more swiftly).
+
+This second proposal aims at fixing the remaining issues until we get a
+really good GIL-free PyPy (described in `goal 1`_ below); and then we
+will focus on the various new features needed to actually use multiple
+cores without explicitly using multithreading (`goal 2`_ below), up to
+and including adapting some existing framework libraries like Twisted,
+Tornado, and possibly Stackless and gevent (`goal 3`_ below).
 
 
 
 In more details
 ===============
 
+This is a call for financial help in implementing a version of PyPy able
+to use multiple processors in a single process, called PyPy-TM; and
+developping the APIs and libraries needed as well as enhancing commonly
+available frameworks to use the new feature.  The developers will be
+Armin Rigo and Remi Meier and possibly others.
 
-Hardware Transactional Memory
+We currently estimate the final performance goal at 25% to 50% of the
+speed of the regular PyPy in fully serial applications.  (This goal has
+been reached already in some cases, but we need to make this result more
+broadly applicable.)  We feel confident that it can work, in the
+following sense: the performance of PyPy-TM running any suited
+application should scale linearly or close-to-linearly with the number
+of processors.  This means that starting with two cores, such
+applications should perform better than in a regular PyPy.  (All numbers
+presented here are comparing different versions of PyPy which all have
+the JIT enabled.)
+
+You will find below a sketch of the `work plan`_.  If more money than
+requested is collected, then the excess will be entered into the general
+PyPy pot, used for example to finance sprint travel costs to students.
+
+**Note** For donations higher than $1,000, we can arrange for an invoice
+and a different payment method to avoid the high Paypal fees.  Please
+contact pypy at sfconservancy.org if you want to know details on how
+to donate via other means.
+
+
+What is the Global Interpreter Lock?
+------------------------------------
+
+The GIL, or Global Interpreter Lock, is a single lock in both CPython
+and the regular PyPy.  Every thread must acquire it in order to execute
+Python bytecodes.  This means that both with CPython and with the
+regular PyPy, Python programs do not gain any benefit in term of
+multicore performance even if they are using threads.
+
+
+What is Transactional Memory?
 -----------------------------
 
+`Transactional Memory`_ --- TM --- is a technique imported from
+databases: every time we want to do a change to the processors' main
+memory, we do it in a "transaction".  Multiple transactions can be
+executed in parallel by multiple cores.  When a transaction is complete,
+we try to commit it.  This might either succeed, or (if another
+transaction committed incompatible changes) fail.  If it fails, which is
+hopefully rare, we need to restart the transaction from scratch.
+
+Transactional Memory research has progressed a lot since two years ago,
+notably with the introduction of Intel's Haswell_ processors, which
+offer Hardware Transactional Memory (HTM).  We discuss below why we
+think HTM is, so far, still not suitable for our goals.
+
+.. _`Transactional Memory`: http://en.wikipedia.org/wiki/Transactional_memory
+.. _Haswell: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
+
+
+Hardware vs Software Transactional Memory
+-----------------------------------------
+
+The idea of Transactional Memory was recently made popular by Intel's
+Haswell_ processor (released in 2013).  We could replace most of the
+Software Transactional Memory (STM) library currently used inside PyPy
+with a much smaller Hardware Transactional Memory (HTM) library based on
+hardware features and running on Haswell-generation processors.  This
+has been attempted by Remi Meier recently.  However, it seems that we
+see problems as we expected them: the current generation of HTM
+processors is limited to run small-scale transactions.  Even the default
+transaction size used in PyPy-STM is often too much for HTM; and
+reducing this size increases overhead without completely solving the
+problem.  Based on this experience, it seems safe to say that right now
+HTM-enabled processors lack the support that we need.
+
+Future processors might improve on various aspects.  We are particularly
+interested in `Virtualizing Transactional Memory`_, a 2005 paper that
+describes the limits that we're running into and how to solve them more
+generally.  A CPU with support for the virtual memory described in this
+paper would certainly be better for running PyPy-HTM.
+
+None of the papers we found discusses the issue of sub-cache-line false
+conflicts, though (conflicts caused by two independent objects that
+happens to live in the same cache line, which is usually 64 bytes).
+This is in contrast with the current PyPy-STM, which doesn't have false
+conflicts of this kind at all and might thus be ultimately better for
+very-long-running transactions.
+
+.. _`Virtualizing Transactional Memory`: 
http://pages.cs.wisc.edu/~isca2005/papers/08A-02.PDF
+
+
+Why do it with PyPy instead of CPython?
+---------------------------------------
+
+While there have been early experiments on Hardware Transactional Memory
+with CPython (`Riley and Zilles (2006)`__, `Tabba (2010)`__), there has
+been no recent one.  The closest is an attempt using `Haswell on the
+Ruby interpreter`_.  None of these attempts tries to do the same using
+Software Transactional Memory.  We would nowadays consider it possible
+to adapt our stmgc-c7 library for CPython, but it would be a lot of
+work, starting from changing the reference-counting scheme.  PyPy is
+better designed to be open to this kind of research.
+
+But the best argument from an external point of view is probably that
+PyPy has got a JIT to start with.  It is thus starting from a better
+position in terms of performance, particularly for the long-running kind
+of programs that we target here.
+
+.. __: http://sabi.net/nriley/pubs/dls6-riley.pdf
+.. __: http://www.cs.auckland.ac.nz/~fuad/parpycan.pdf
+.. __: 
http://researcher.watson.ibm.com/researcher/files/jp-ODAIRA/PPoPP2014_RubyGILHTM.pdf
+
+
+Alternatives
+------------
+
+PyPy-TM will be slower than judicious usage of existing alternatives,
+based on multiple processes that communicate with each other in one way
+or another.  The counter-argument is that TM is not only a cleaner
+solution: there are cases in which it is not doable to organize (or
+retrofit) an existing program into the particular format needed for the
+alternatives.  In particular, small quickly-written programs don't need
+the additional baggage of cross-process communication; and large
+programs can sometimes be almost impossible to turn into multi-process
+versions.  By contrast, we believe that TM can fit naturally into most
+programs, because it only requires local changes to some dispatcher; the
+rest of the program should work without changes.
+
+
+Other non-Linux platforms
+-------------------------
+
+The current work relies heavily on Linux-, clang-, and 64-bit only
+features.  We believe it is a suitable restriction: a lot of multi- and
+many-core servers commonly available are nowadays x86-64 machines
+running Linux.  Nevertheless, non-Linux solutions appear to be possible
+as well.  OS/X (and likely the various BSDs) seems to handle ``mmap()``
+better than Linux does, and can remap individual pages of an existing
+mapping to various pages without hitting a limit of 65536 like Linux.
+Windows might also have a way, although we didn't measure yet; but the
+first issue with Windows would be to support Win64, which the regular
+PyPy doesn't.
+
+We will likely explore the OS/X way (as well as the Windows way if Win64
+support grows in PyPy), but this is not included in the scope of this
+proposal.
+
+
 More readings
 -------------
 
+See `our blog posts about STM`__.
+
+.. __: http://morepypy.blogspot.com/search/label/stm
 
 
 
 Work plan
 =========
 
+This is an very rough estimate of the amount of work it would take to
+complete the steps for an experienced developer who is already familiar
+with the PyPy codebase.  As before, we cannot guarantee the time
+estimates here, but we do agree to report regularly to the community, so
+our progress can be followed publicly.
 
+Paid work will be at $60/hour, but at least one developer who will work
+on the project --- Armin Rigo --- has committed to 2 hours of volunteer
+work per paid hour (so the total amount of money that we ask is divided
+by three).  A 10% general donation will go to the `Software Freedom
+Conservancy`_ itself, the non-profit organization of which the PyPy
+project is a member and which manages all the issues related to
+donations, payments, and tax-exempt status.
+
+.. _`Software Freedom Conservancy`: http://sfconservancy.org/
+
+
+Goal 1
+------
+
+The PyPy-STM that we have in the end of March 2014 is good enough in
+some cases to run existing multithreaded code without a GIL, but not in
+all of them.  There are a number of caveats for the user and missing
+optimizations.  The goal #1 is to improve this case and address
+the caveats.  The current status is written down `in the docs`__ and
+will evolve over time.
+
+.. __: https://pypy.readthedocs.org/en/latest/stm.html
+
+For future reference, at the end of March the main identified issues
+are:
+
+* There are still a number of bugs.
+
+* The JIT warm-up time is abysmal.
+
+* The GC is missing a number of optimizations that are present in
+  a regular PyPy.
+
+* Destructors are not supported (``__del__()`` methods).
+
+* The STM bookkeeping logic could see more improvements.
+
+* Forking the process is slow.
+
+Fixing all these issues is required before we can confidently say that
+PyPy-STM is an out-of-the-box replacement of a regular PyPy which gives
+speed-ups over the regular PyPy independently of the Python program it
+runs, as long as it is using at least two threads.
+
+
+Goal 2
+------
+
+This goal contains the various new features needed to use multiple cores
+without explicitly using multithreading; in other words, the new APIs
+and libraries accessible from Python programs that want to make use of
+this benefit.
+
+XXX improve from here
+
+The goal is to improve the existing atomic sections, but the most
+visible missing thing is that you don't get reports about the
+"conflicts" you get.  This would be the first thing that you need in
+order to start using atomic sections more extensively.  Also, for now:
+for better results, try to explicitly force a transaction break just
+before (and possibly after) each large atomic section, with
+``time.sleep(0)``.
+
+This approach hides the notion of threads from the end programmer,
+including all the hard multithreading-related issues.  This is not the
+first alternative approach to explicit threads; for example, OpenMP_ is
+one.  However, it is one of the first ones which does not require the
+code to be organized in a particular fashion.  Instead, it works on any
+Python program which has got latent, imperfect parallelism.  Ideally, it
+only requires that the end programmer identifies where this parallelism
+is likely to be found, and communicates it to the system, using for
+example the ``transaction.add()`` scheme.
+
+XXX Talk also about dict- or list-specific conflict avoidance;
+delaying some updates or I/O; etc. etc.
+
+
+Goal 3
+------
+
+XXX
+
+
+---------
+
+XXX fix
+Total: 5 months for the initial version; at least 8 additional months
+for the fast version.  We will go with a total estimate of 15 months,
+corresponding to USD$151200.  The amount sought by this fundraising
+campaign,  considering the 2 volunteer hours per paid hour is thus USD$50400.
 
 
 Benefits of This Work to the Python Community and the General Public
 ====================================================================
+
+XXX
_______________________________________________
pypy-commit mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-commit

Reply via email to