Author: Armin Rigo <[email protected]> Branch: extradoc Changeset: r5372:3c5ba3e46ec5 Date: 2014-07-23 08:02 +0200 http://bitbucket.org/pypy/extradoc/changeset/3c5ba3e46ec5/
Log: Finish the slides diff --git a/talk/ep2014/stm/talk.html b/talk/ep2014/stm/talk.html --- a/talk/ep2014/stm/talk.html +++ b/talk/ep2014/stm/talk.html @@ -502,40 +502,64 @@ </li> </ul> </div> -<div class="slide" id="big-point"> -<h1>Big Point</h1> +<div class="slide" id="pypy-stm"> +<h1>PyPy-STM</h1> <ul class="simple"> -<li>application-level locks still needed...</li> +<li>implementation of a specially-tailored STM ("hard" part):<ul> +<li>a reusable C library</li> +<li>called STMGC-C7</li> +</ul> +</li> +<li>used in PyPy to replace the GIL ("easy" part)</li> +<li>could also be used in CPython<ul> +<li>but refcounting needs replacing</li> +</ul> +</li> +</ul> +</div> +<div class="slide" id="how-does-it-work"> +<h1>How does it work?</h1> +<object data="fig4.svg" type="image/svg+xml"> +fig4.svg</object> +</div> +<div class="slide" id="demo"> +<h1>Demo</h1> +<ul class="simple"> +<li>counting primes</li> +</ul> +</div> +<div class="slide" id="long-transactions"> +<h1>Long Transactions</h1> +<ul class="simple"> +<li>threads and application-level locks still needed...</li> <li>but <em>can be very coarse:</em><ul> -<li>even two big transactions can optimistically run in parallel</li> +<li>two transactions can optimistically run in parallel</li> <li>even if they both <em>acquire and release the same lock</em></li> +<li>internally, drive the transaction lengths by the locks we acquire</li> </ul> </li> </ul> </div> <div class="slide" id="id2"> -<h1>Big Point</h1> +<h1>Long Transactions</h1> <object data="fig4.svg" type="image/svg+xml"> fig4.svg</object> </div> -<div class="slide" id="demo-1"> -<h1>Demo 1</h1> +<div class="slide" id="id3"> +<h1>Demo</h1> <ul class="simple"> -<li>"Twisted apps made parallel out of the box"</li> <li>Bottle web server</li> </ul> </div> -<div class="slide" id="pypy-stm"> -<h1>PyPy-STM</h1> +<div class="slide" id="pypy-stm-programming-model"> +<h1>PyPy-STM Programming Model</h1> <ul class="simple"> -<li>implementation of a specially-tailored STM:<ul> -<li>a reusable C library</li> -<li>called STMGC-C7</li> -</ul> -</li> -<li>used in PyPy to replace the GIL</li> -<li>could also be used in CPython<ul> -<li>but refcounting needs replacing</li> +<li>threads-and-locks, fully compatible with the GIL</li> +<li>this is not "everybody should use careful explicit threading +with all the locking issues"</li> +<li>instead, PyPy-STM pushes forward:<ul> +<li>use a thread pool library</li> +<li>coarse locking, inside that library only</li> </ul> </li> </ul> @@ -546,7 +570,7 @@ <li>current status:<ul> <li>basics work</li> <li>best case 25-40% overhead (much better than originally planned)</li> -<li>parallelizing user locks not done yet</li> +<li>parallelizing user locks not done yet (see "with atomic")</li> <li>tons of things to improve</li> <li>tons of things to improve</li> <li>tons of things to improve</li> @@ -558,52 +582,113 @@ </li> </ul> </div> -<div class="slide" id="demo-2"> -<h1>Demo 2</h1> +<div class="slide" id="summary-benefits"> +<h1>Summary: Benefits</h1> <ul class="simple"> -<li>counting primes</li> -</ul> -</div> -<div class="slide" id="benefits"> -<h1>Benefits</h1> -<ul class="simple"> -<li>Keep locks coarse-grained</li> <li>Potential to enable parallelism:<ul> -<li>in CPU-bound multithreaded programs</li> +<li>in any CPU-bound multithreaded program</li> <li>or as a replacement of <tt class="docutils literal">multiprocessing</tt></li> <li>but also in existing applications not written for that</li> <li>as long as they do multiple things that are "often independent"</li> </ul> </li> +<li>Keep locks coarse-grained</li> </ul> </div> -<div class="slide" id="issues"> -<h1>Issues</h1> +<div class="slide" id="summary-issues"> +<h1>Summary: Issues</h1> <ul class="simple"> -<li>Performance hit: 25-40% everywhere (may be ok)</li> <li>Keep locks coarse-grained:<ul> <li>but in case of systematic conflicts, performance is bad again</li> <li>need to track and fix them</li> -<li>need tool support (debugger/profiler)</li> +<li>need tool to support this (debugger/profiler)</li> </ul> </li> +<li>Performance hit: 25-40% over a plain PyPy-JIT (may be ok)</li> </ul> </div> -<div class="slide" id="summary"> -<h1>Summary</h1> +<div class="slide" id="summary-pypy-stm"> +<h1>Summary: PyPy-STM</h1> <ul class="simple"> -<li>Transactional Memory is still too researchy for production</li> -<li>But it has the potential to enable "easier parallelism"</li> +<li>Not production-ready</li> +<li>But it has the potential to enable "easier parallelism for everybody"</li> <li>Still alpha but slowly getting there!<ul> <li>see <a class="reference external" href="http://morepypy.blogspot.com/">http://morepypy.blogspot.com/</a></li> </ul> </li> +<li>Crowdfunding!<ul> +<li>see <a class="reference external" href="http://pypy.org/">http://pypy.org/</a></li> +</ul> +</li> </ul> </div> <div class="slide" id="part-2-under-the-hood"> <h1>Part 2 - Under The Hood</h1> <p><strong>STMGC-C7</strong></p> </div> +<div class="slide" id="overview"> +<h1>Overview</h1> +<ul class="simple"> +<li>Say we want to run N = 2 threads</li> +<li>We reserve twice the memory</li> +<li>Thread 1 reads/writes "memory segment" 1</li> +<li>Thread 2 reads/writes "memory segment" 2</li> +<li>Upon commit, we (try to) copy the changes to the other segment</li> +</ul> +</div> +<div class="slide" id="trick-1"> +<h1>Trick #1</h1> +<ul class="simple"> +<li>Objects contain pointers to each other</li> +<li>These pointers are relative instead of absolute:<ul> +<li>accessed as if they were "thread-local data"</li> +<li>the x86 has a zero-cost way to do that (<tt class="docutils literal">%fs</tt>, <tt class="docutils literal">%gs</tt>)</li> +<li>supported in clang (not gcc so far)</li> +</ul> +</li> +</ul> +</div> +<div class="slide" id="trick-2"> +<h1>Trick #2</h1> +<ul class="simple"> +<li>With Trick #1, most objects are exactly identical in all N segments:<ul> +<li>so we share the memory</li> +<li><tt class="docutils literal">mmap() MAP_SHARED</tt></li> +<li>actual memory usage is multiplied by much less than N</li> +</ul> +</li> +<li>Newly allocated objects are directly in shared pages:<ul> +<li>we don't actually need to copy <em>all new objects</em> at commit, +but only the few <em>old objects</em> modified</li> +</ul> +</li> +</ul> +</div> +<div class="slide" id="barriers"> +<h1>Barriers</h1> +<ul class="simple"> +<li>Need to record all reads and writes done by a transaction</li> +<li>Extremely cheap way to do that:<ul> +<li><em>Read:</em> set a flag in thread-local memory (one byte)</li> +<li><em>Write</em> into a newly allocated object: nothing to do</li> +<li><em>Write</em> into an old object: add the object to a list</li> +</ul> +</li> +<li>Commit: check if each object from that list conflicts with +a read flag set in some other thread</li> +</ul> +</div> +<div class="slide" id="id4"> +<h1>...</h1> +</div> +<div class="slide" id="thank-you"> +<h1>Thank You</h1> +<ul class="simple"> +<li><a class="reference external" href="http://morepypy.blogspot.com/">http://morepypy.blogspot.com/</a></li> +<li><a class="reference external" href="http://pypy.org/">http://pypy.org/</a></li> +<li>irc: <tt class="docutils literal">#pypy</tt> on freenode.net</li> +</ul> +</div> </div> </body> </html> diff --git a/talk/ep2014/stm/talk.rst b/talk/ep2014/stm/talk.rst --- a/talk/ep2014/stm/talk.rst +++ b/talk/ep2014/stm/talk.rst @@ -153,8 +153,8 @@ - but refcounting needs replacing -Commits ---------- +How does it work? +----------------- .. image:: fig4.svg @@ -165,17 +165,19 @@ * counting primes -Big Point +Long Transactions ---------------------------- -* application-level locks still needed... +* threads and application-level locks still needed... * but *can be very coarse:* - - even two big transactions can optimistically run in parallel + - two transactions can optimistically run in parallel - even if they both *acquire and release the same lock* + - internally, drive the transaction lengths by the locks we acquire + Long Transactions ----------------- @@ -211,7 +213,7 @@ - basics work - best case 25-40% overhead (much better than originally planned) - - parallelizing user locks not done yet (see ``with atomic``) + - parallelizing user locks not done yet (see "with atomic") - tons of things to improve - tons of things to improve - tons of things to improve @@ -224,8 +226,6 @@ Summary: Benefits ----------------- -* Keep locks coarse-grained - * Potential to enable parallelism: - in any CPU-bound multithreaded program @@ -236,6 +236,8 @@ - as long as they do multiple things that are "often independent" +* Keep locks coarse-grained + Summary: Issues --------------- @@ -248,7 +250,7 @@ - need tool to support this (debugger/profiler) -* Performance hit: 25-40% everywhere (may be ok) +* Performance hit: 25-40% over a plain PyPy-JIT (may be ok) Summary: PyPy-STM @@ -256,12 +258,16 @@ * Not production-ready -* But it has the potential to enable "easier parallelism" +* But it has the potential to enable "easier parallelism for everybody" * Still alpha but slowly getting there! - see http://morepypy.blogspot.com/ +* Crowdfunding! + + - see http://pypy.org/ + Part 2 - Under The Hood ----------------------- @@ -272,7 +278,7 @@ Overview -------- -* Say we want to run two threads +* Say we want to run N = 2 threads * We reserve twice the memory @@ -290,16 +296,56 @@ * These pointers are relative instead of absolute: - - + - accessed as if they were "thread-local data" + - the x86 has a zero-cost way to do that (``%fs``, ``%gs``) -Trick #1 + - supported in clang (not gcc so far) + + +Trick #2 -------- -* Most objects are the same in all segments: +* With Trick #1, most objects are exactly identical in all N segments: - so we share the memory - - ``mmap() MAP_SHARED`` trickery + - ``mmap() MAP_SHARED`` + - actual memory usage is multiplied by much less than N +* Newly allocated objects are directly in shared pages: + + - we don't actually need to copy *all new objects* at commit, + but only the few *old objects* modified + + +Barriers +-------- + +* Need to record all reads and writes done by a transaction + +* Extremely cheap way to do that: + + - *Read:* set a flag in thread-local memory (one byte) + + - *Write* into a newly allocated object: nothing to do + + - *Write* into an old object: add the object to a list + +* Commit: check if each object from that list conflicts with + a read flag set in some other thread + + +... +------------------- + + +Thank You +--------- + +* http://morepypy.blogspot.com/ + +* http://pypy.org/ + +* irc: ``#pypy`` on freenode.net _______________________________________________ pypy-commit mailing list [email protected] https://mail.python.org/mailman/listinfo/pypy-commit
