Author: Remi Meier <remi.me...@inf.ethz.ch> Branch: extradoc Changeset: r5349:897d0fa65804 Date: 2014-06-17 17:23 +0200 http://bitbucket.org/pypy/extradoc/changeset/897d0fa65804/
Log: some adjustments diff --git a/talk/icooolps2014/position-paper.pdf b/talk/icooolps2014/position-paper.pdf index 4c20e424eab52b3d1bdeed6cd22738c5db3b0800..e7a9e9a08d66ba2c9c26bafa1c59d5d4a2e7bf14 GIT binary patch [cut] diff --git a/talk/icooolps2014/position-paper.tex b/talk/icooolps2014/position-paper.tex --- a/talk/icooolps2014/position-paper.tex +++ b/talk/icooolps2014/position-paper.tex @@ -116,7 +116,7 @@ Dynamic languages became very popular in recent years. At some point, the need for concurrency arose, and many of them made the choice to use a single global interpreter lock (GIL) to synchronise - the interpreter in a multithreading scenario. This choice however + the interpreter in a multithreading scenario. This choice, however, makes it impossible to actually run code in parallel. Here we want to compare different approaches to replacing the GIL @@ -143,7 +143,7 @@ performance increases less and less every year, many dynamic languages have a problem. While there is certainly a lot of popularity around languages like Python and Ruby, their ability to make use of multiple -cores is somewhat limited. For ease of implementation they chose to +cores is somewhat limited. For ease of implementation, they chose to use a single, global interpreter lock (GIL) to synchronise the execution of code in multiple threads. While this is a straight-forward way to eliminate synchronisation issues in the @@ -168,43 +168,20 @@ threads for concurrency, as well as the changes that are required to the interpreter itself. -These requirements are not easy to meet. We argue that STM is the -overall winner. While it currently has a big performance problem, it -gets more points in all the other categories. We think that it is the -only solution that also provides a better synchronisation mechanism to -the application in the form of parallelisable atomic blocks. - -%% \subsection{Issue} -%% The issue that we want to discuss is how to efficiently support -%% multi-core parallel execution of code in dynamic languages that were -%% designed with GIL semantics in mind. - -%% Furthermore, a solution to this problem should also bring better -%% synchronisation mechanism with it... - -%% (supporting (large) atomic blocks for synchronisation) - -%% \subsection{Our Position} -%% Current solutions for replacing the GIL include STM, HTM, and -%% fine-grained locking. STM is usually too slow, HTM very limited, and -%% locking suffers from complexity that makes it hard to implement -%% correctly. We argue that the best way forward is still STM and that -%% its performance problem can be solved. - -%% Current solutions like STM, HTM, and fine-grained locking are slow, hard -%% to implement correctly, and don't fit the specific problems of dynamic -%% language. STM is the best way forward but has bad performance, so we -%% fix that. +These requirements are not easy to meet. The author's position is that +STM provides the best way forward. While STM currently has a big +performance problem, it gets more points in the other categories. We +think that it is the only solution that also provides a better +synchronisation mechanism to the application in the form of +parallelisable atomic blocks. In the following section, we try to +present a balanced view of the compared approaches. \section{Discussion} -In this section we examine the approaches and highlight their -advantages and disadvantages. -%% \paragraph{dynamic language VM problems} -%% XXX: -%% - high allocation rate (short lived objects)\\ -%% - (don't know anything about the program that runs until it actually runs: arbitrary atomic block size) +In this section we first explain the motivation for using a GIL and +then examine different approaches to remove or avoid it -- highlighting +their advantages and disadvantages. \subsection{Why is there a GIL?} @@ -214,7 +191,7 @@ also applies to Abstract Syntax Tree (AST) interpreters, where the GIL may only be released between interpreting two AST nodes.}. The interpreter can thus rely on complete isolation and atomicity for the -instructions' execution. Thus, accesses to data structures like +instructions' execution. Also, accesses to data structures like dictionaries and lists happen atomically and do not need additional protection from data races when shared between threads. @@ -222,8 +199,8 @@ model~\cite{lamport79}. This can be very valuable as it means less surprises for the programmer. For example in Table~\ref{tab:seq_cons}, the programmer can expect the critical section to only be entered by -one thread. If the model allowed to buffer the writes, both threads -may enter the critical section at the same time. +one thread. On the other hand, if the model allowed to buffer the +writes, both threads may enter the critical section at the same time. \begin{table}[!ht] \begin{center} @@ -231,11 +208,12 @@ \hline Thread 1 & Thread 2 \\ \hline - \multicolumn{2}{|l|}{\texttt{A = B = 0}} \\ + \multicolumn{2}{|c|}{\texttt{A = B = 0}} \\ \hline \texttt{A = 1} & \texttt{B = 1} \\ \texttt{if B == 0:} & \texttt{if A == 0:} \\ - \texttt{ critical section} & \texttt{ critical section} \\ + \multicolumn{2}{|c|}{only one thread enters here} \\ + \multicolumn{2}{|c|}{(e.g.\ critical section)} \\ \hline \end{tabular} \caption{Critical section with a sequential consistency model.} @@ -263,7 +241,7 @@ degree. Again, a potential solution should be able to integrate with external libraries with similar ease. We will however focus our argumentation more on running code in the interpreted language in -parallel, not the external C calls. +parallel, not the external C code. Since the GIL is mostly an implementation detail of the interpreter, it is not exposed to the application running on top of it. To @@ -284,7 +262,8 @@ that the GIL is not released during the execution of the atomic block. Of course, this still means that no two atomic blocks can execute in parallel or even concurrently. Potential solutions are -preferable if they provide a good way to implement atomic blocks that +preferable if they provide a good way to implement atomic blocks +(or another, comparable synchronisation mechanism) that are also able to be executed in parallel. @@ -318,8 +297,8 @@ \subsection{Potential Solutions} \label{sec:pot_solutions} -For the discussion we define a set of criteria to evaluate the -multiple potential solutions for removing or avoiding the GIL and its +For the discussion, we define a set of criteria to evaluate the +potential solutions for removing or avoiding the GIL and its limitations: \begin{description} @@ -350,14 +329,13 @@ and releasing locks produces. The former means that sometimes it is necessary to fall back to less fine-grained locking, preventing some potential parallelism in order to keep the complexity manageable. -The latter means that we lose a bit of performance in the -single-threaded case compared to the GIL, which requires much less -acquire-release operations. +The latter means that we lose a bit of performance compared to the +GIL, which requires much less acquire-release operations. Jython~\cite{webjython} is one project that implements an interpreter for Python on the Java Virtual Machine (JVM) and that uses fine-grained locking\footnote{The performance impact of -fine-grained locking is milder in Java than it would be in a typical piece +fine-grained locking is milder on the JVM than it would be in a typical piece of C code; see e.g.~\cite{biased}.} to correctly synchronise the interpreter. For a language like Python, one needs quite a few, carefully placed locks -- every dictionary, list, instance, or mutable @@ -367,21 +345,21 @@ there is no central location for all these locks, the complexity of the implementation is quite a bit larger compared to using a GIL. Integrating external, non-thread-safe libraries should -however be very simple too. One could simply use one lock per library +however be very simple too. One can simply use one lock per library to avoid this issue. In the end, fine-grained locking can transparently replace the GIL and therefore parallelise existing applications, generally without any -changes\footnote{There are rare cases where not having atomic - bytecodes actually changes the semantics. E.g.\ in Jython, - \texttt{dict1.update(dict2)} is not atomic: it first reads data from - \texttt{dict2} with \texttt{dict2}'s lock, and then puts it into - \texttt{dict1} with \texttt{dict1}'s lock. A lot can happen - in-between.}. An implementation has to follow the GIL semantics very +changes. An implementation has to follow the GIL semantics very closely, otherwise it may expose some latent data races in existing -applications which are just not exposed with a GIL. This approach does -however not provide a better parallelising synchronisation mechanism -to the application like e.g.\ atomic blocks. +applications which are just not exposed with a GIL\footnote{There are + rare cases where not having atomic bytecodes actually changes the + semantics. E.g.\ in Jython, \texttt{dict1.update(dict2)} is not + atomic: it first reads data from \texttt{dict2} with \texttt{dict2}'s + lock, and then puts it into \texttt{dict1} with \texttt{dict1}'s + lock. A lot can happen in-between.}. This approach does however not +provide a better parallelising synchronisation mechanism to the +application and still requires explicit locking in the application. @@ -389,7 +367,7 @@ There are also approaches that work around the GIL instead of trying to replace it. If an application can be split into completely -independent parts that only very rarely need to share anything, or +independent parts that only very rarely need to share something, or only do so via an external program like a database, then it is sensible to have one GIL per independent part. At the extreme, there are applications that parallelise perfectly simply by running @@ -450,7 +428,7 @@ suffered a lot from this. The performance of HTM is pretty good as it does not introduce much -overhead ($<40\%$ overhead~\cite{odaira14}). And it can transparently +overhead ($<40\%$~\cite{odaira14}). And it can transparently parallelise existing applications to some degree. The implementation is very straight-forward because it directly replaces the GIL in a central place. HTM is also directly compatible with any external @@ -512,8 +490,8 @@ general overview in Table \ref{tab:comparison}. The points were assigned according to the criteria described in \ref{sec:pot_solutions}. Since the criteria are defined intuitively, there are no formal justifications -for the assigned points. The reader is thus advised to take them with a -grain of salt. +for the number of points. The reader is thus advised to take the result +with a grain of salt. The general picture is everything else than clear. It looks like HTM may be a good solution to replace the GIL in the near future. Current @@ -537,9 +515,9 @@ One way to get more performance is to develop STM systems that make better use of low-level features in existing OS kernels. We are -currently working on a STM system that makes use of several such +currently working on an STM system that makes use of several such features like virtual memory and memory segmentation. We further -tailor the system to the discussed use case which gives us an +tailor the system to the discussed use case, which gives us an advantage over other STM systems that are more general. With this approach, initial results suggest that we can keep the overhead of STM well below 50\%. A hybrid TM system, which also uses HTM to accelerate _______________________________________________ pypy-commit mailing list pypy-commit@python.org https://mail.python.org/mailman/listinfo/pypy-commit