Author: David Schneider <[email protected]>
Branch: extradoc
Changeset: r4540:68df76a06dd0
Date: 2012-08-13 13:45 +0200
http://bitbucket.org/pypy/extradoc/changeset/68df76a06dd0/

Log:    Changes based on Stephan's remarks

diff --git a/talk/vmil2012/paper.tex b/talk/vmil2012/paper.tex
--- a/talk/vmil2012/paper.tex
+++ b/talk/vmil2012/paper.tex
@@ -125,8 +125,8 @@
 \cfbolz{the first two two paragraphs talk about deoptimization, then it
 switches to guards. I would say we should only talk about guards in the
 beginning}
-In this paper we describe and analyze how deoptimization works in the context
-of tracing just-in-time compilers. What instructions are used in the
+\bivab{Introduce in a sentence what guards are}
+In this paper we describe and analyze how guards, as a concept of tracing 
just-in-time compilers, work. Explicating what concepts are used in the
 intermediate and low-level representation of the JIT instructions and how these
 are implemented.
 
@@ -138,7 +138,7 @@
 operations in the traces produced by RPython's tracing JIT, our
 goal is to present concrete numbers for the frequency and the overhead related
 to guards, explain how they are implemented in the different levels of 
RPython's
-tracing JIT and explain the rationale behind the design decisions based on the
+tracing JIT and clarify the rationale behind the design decisions based on the
 numbers provided here.
 
 \cfbolz{this paragraph now suddenly \emph{introduces} guards, despite having 
talked about them already}
@@ -151,7 +151,9 @@
 operations before and for about 15\% to 20\% of the operations after the
 optimization pass over the traced and later compiled parts of the benchmarks,
 making guards one of the most common types of operations. Many of these guards
-fail rarely or not all during execution. There are several aspects to consider
+fail rarely or not all during execution.
+
+There are several aspects to consider
 in the design and optimization of guards, the first aspect is that due to the
 large number of guards the memory overhead related to storing the information
 needed for deoptimization should be kept low. A second aspect is that
@@ -169,8 +171,9 @@
 describe based on them the reasoning behind and the implementation of guards in
 RPython's tracing just-in-time compiler. the contributions of this paper are:
 \begin{itemize}
-  \item An analysis of guards in the context of RPython's tracing JIT to
-  substantiate the aforementioned observation, based on a set of benchmarks,
+  \item An analysis and benchmark of guards in the context of RPython's 
tracing Jit.
+  %An analysis of guards in the context of RPython's tracing JIT to
+  %substantiate the aforementioned observation, based on a set of benchmarks,
   \item detailed measurements about the frequency and the
   overhead associated with guards, and
   \item a description about how guards are implemented in the high\-
@@ -234,12 +237,12 @@
 and run on one of the different supported target platforms/architectures such
 as C, .NET and Java. During the transformation process
 different low level aspects suited for the target environment are automatically
-added to program such as (if needed) a garbage collector and with some hints
-provided by the author a just-in-time compiler.
+added to the program such as (if needed) a garbage collector
+and a just-in-time compiler based on hints provided by the author.
 
 \subsection{RPython's Tracing JIT Compilers}
 \label{sub:tracing}
-Tracing JITs are a technique of just-in-time compilers that generate code by
+Tracing is a technique of just-in-time compilers that generate code by
 observing the execution of a program. VMs using tracing JITs typically are
 mixed-mode execution environments that also contain an interpreter. The
 interpreter profiles the executing program and selects frequently executed code
@@ -247,7 +250,7 @@
 selecting hot loops.
 
 After profiling identifies an interesting
-path, tracing is started, recording all operations that are executed on this
+path, tracing is started thus recording all operations that are executed on 
this
 path. This includes inlining functional calls.
 As in most compilers, tracing JITs use an intermediate representation to
 store the recorded operations, which is typically in SSA
@@ -295,7 +298,7 @@
 
 \begin{figure}
     \input{figures/unopt-log.tex}
-    \caption{Unoptimized trace}
+    \caption{Unoptimized trace. Numbers on the right represent the line 
numbers in the original program. Lines marked with -1 represent lines added by 
JIT}
     \label{fig:unopt-trace}
 \end{figure}
 
@@ -305,7 +308,7 @@
 Since tracing linearizes control flow by following one concrete execution,
 the full control flow of a program is not observed.
 The possible points of deviation from the trace are guard operations
-that check whether the same assumptions observed during tracing
+that check whether the same assumptions observed while tracing
 still hold during execution.
 Similarly, in the case of dynamic languages guards can also encode type
 assumptions.
@@ -315,17 +318,17 @@
 to reconstruct the interpreter state when that guard fails.
 This information is called the \emph{resume data}.
 
-To do this reconstruction, it is necessary to take the values of the SSA
+To do this reconstruction it is necessary to take the values of the SSA
 variables of the trace and build interpreter stack frames.  Tracing
 aggressively inlines functions, therefore the reconstructed state of the
 interpreter can consist of several interpreter frames.
 
 If a guard fails often enough, a trace is started from it
-to create a trace tree.
+forming a trace tree.
 When that happens another use case of resume data
 is to construct the tracer state.
 After the bridge has been recorded and compiled it is attached to the guard.
-If the guard fails later, the bridge is executed. Therefore the resume data of
+If the guard fails later the bridge is executed. Therefore the resume data of
 that guard is no longer needed.
 
 There are several forces guiding the design of resume data handling.
@@ -362,10 +365,10 @@
 
 The core idea of storing resume data as compactly as possible
 is to share parts of the data structure between subsequent guards.
-This is often useful because the density of guards in traces is so high,
+This is useful because the density of guards in traces is so high,
 that quite often not much changes between them.
 Since resume data is a linked list of symbolic frames
-often only the information in the top frame changes from one guard to the next.
+in many cases only the information in the top frame changes from one guard to 
the next.
 The other symbolic frames can often just be reused.
 The reason for this is that during tracing only the variables
 of the currently executing frame can change.
@@ -396,14 +399,13 @@
 \label{sub:optimization}
 
 Guards interact with optimizations in various ways.
-Most importantly optimizations try to remove as many operations
-and therefore guards as possible.
-This is done with many classical compiler optimizations.
+Using many classical compiler optimizations the JIT tries to remove as many
+operations, and therefore guards, as possible.
 In particular guards can be removed by subexpression elimination.
 If the same guard is encountered a second time in the trace,
 the second one can be removed.
 This also works if a later guard is weaker
-and therefore implied by a earlier guard.
+and hence implied by a earlier guard.
 
 One of the techniques in the optimizer specific to tracing for removing guards
 is guard strengthening~\cite{bebenita_spur:_2010}.
@@ -422,10 +424,10 @@
 Allocation removal makes resume data more complex.
 Since allocations are removed from the trace it becomes necessary
 to reconstruct the objects that were not allocated so far when a guard fails.
-Therefore the resume data needs to store enough information
+Consequently the resume data needs to store enough information
 to make this reconstruction possible.
 
-Adding this additional information is done as follows.
+Adding this additional information is done as follows:
 So far, every variable in the symbolic frames
 contains a constant or an SSA variable.
 After allocation removal the variables in the symbolic frames can also contain
@@ -488,16 +490,16 @@
 \end{figure}
 
 
-After optimization the resulting trace is handed to the over platform specific
+After optimization the resulting trace is handed over to the platform specific
 backend to be compiled to machine code. The compilation phase consists of two
 passes over the lists of instructions, a backwards pass to calculate live
 ranges of IR-level variables and a forward pass to emit the instructions. 
During
 the forward pass IR-level variables are assigned to registers and stack
-locations by the register allocator according to the requirements of the to be
+locations by the register allocator according to the requirements of the
 emitted instructions.  Eviction/spilling is performed based on the live range
 information collected in the first pass. Each IR instruction is transformed
 into one or more machine level instructions that implement the required
-semantics, operations without side effects whose result is not used are not
+semantics. Operations without side effects whose result is not used are not
 emitted. Guards instructions are transformed into fast checks at the machine
 code level that verify the corresponding condition.  In cases the value being
 checked by the guard is not used anywhere else the guard and the operation
@@ -536,28 +538,29 @@
 ...
     \end{lstlisting}
   \end{minipage}
-  \caption{Separated and merged compilation of operations and guards}
+  \caption{Result of separated (left) and merged (right) compilation of 
operations and guards (top).}
   \label{fig:trace-compiled}
 \end{figure}
 
-Each guard in the IR has attached to it a list of the IR-variables required to
+Attached to each guard in the IR is a list of the IR-variables required to
 rebuild the execution state in case the trace is left through
 the guard. When a guard is compiled, in addition to the
-condition check two things are generated/compiled. First a special data
-structure called \emph{low-level resume data} is created that encodes the
-information provided by the register allocator about where the values
-corresponding to each IR-variable required by the guard will be stored when
-execution reaches the code emitted for the corresponding guard. This data
-structure stores the data in a compressed manner using an encoding that uses
-8bits to store 7bits of information. This encoding is efficient to create and
+condition check two things are generated/compiled.
+
+First a special data
+structure called \emph{low-level resume data} is created. This data structure 
encodes the
+information about where, i.e. which register or stack location, the 
IR-variables required to rebuild the state will be stored when the guard is 
executed.
+This data
+structure stores the values in a succinct manner using an encoding that uses
+8 bits to store 7 bits of information, ignoring leading zeros. This encoding 
is efficient to create and
 provides a compact representation of the needed information,
 to maintain an acceptable memory profile.
 
 Second a piece of code is generated for each guard that acts as a trampoline.
-Guards are implemented as a conditional jump to this trampoline. In case the
-condition checked in the guard fails
-execution jumps to the corresponding trampoline. In the trampoline the pointer 
to the
-\emph{low-level resume data} is loaded and jumps to generic bailout handler, 
also known as compensation code,
+Guards are implemented as a conditional jump to this trampoline in case the
+guard checked fails.
+In the trampoline the pointer to the
+\emph{low-level resume data} is loaded and execution jumps to a generic 
bailout handler, also known as compensation code,
 that is used to leave the compiled trace in case of a guard failure.
 
 Using the encoded location information the bailout handler reads from the
@@ -570,7 +573,7 @@
 As in previous sections the underlying idea for the design of guards is to have
 a fast on-trace profile and a potentially slow one in the bailout case where
 the execution takes one of the side exits due to a guard failure. At the same
-time the data stored in the backend needed to rebuild the state needs to be as
+time the data stored in the backend, required to rebuild the state, should be 
as
 compact as possible to reduce the memory overhead produced by the large number
 of guards, the numbers in Figure~\ref{fig:backend_data} illustrate that the
 compressed encoding currently has about 15\% to 25\% of the size of of the
@@ -578,9 +581,10 @@
 
 As explained in previous sections, when a specific guard has failed often 
enough
 a new trace, referred to as a \emph{bridge}, starting from this guard is 
recorded and
-compiled. When compiling bridges the goal is that future failures of the guards
-that led to the compilation of the bridge should execute the bridge without
-additional overhead. In particular the failure of the guard should not lead
+compiled.
+Since the goal of compiling bridges is to improve execution speed on the
+diverged path (failing guard) they should not introduce additional overhead.
+In particular the failure of the guard should not lead
 to leaving the compiled code prior to execution the code of the bridge.
 
 The process of compiling a bridge is very similar to compiling a loop.
@@ -595,7 +599,7 @@
 original loop up to the guard. This means that no register/stack reshuffling is
 needed before executing a bridge.
 
-Once the bridge has been compiled the guard that led to compiling the bridge is
+Once the bridge has been compiled the corresponding guard is
 patched to redirect control flow to the bridge in case the check fails. In
 the future, if the guard fails again it jumps to the code compiled for the 
bridge
 instead of bailing out. Once the guard has been compiled and attached to the
@@ -671,7 +675,7 @@
   \item Guard failures are local and rare.
 \end{itemize}
 
-All figures in this section do not take garbage collection of machine code 
into account. Pieces
+All measurements presented in this section do not take garbage collection of 
machine code into account. Pieces
 of machine code can be globally invalidated or just become cold again. In both
 cases the generated machine code and the related data is garbage collected. The
 figures show the total amount of operations that are evaluated by the JIT and
@@ -731,7 +735,7 @@
 resume data combined and being compressed as described before.
 
 Tracing JIT compilers only compile the subset of the code executed in a program
-that is traced in a hot loop, for this reason the amount of generated machine
+that occurs in a hot loop, for this reason the amount of generated machine
 code will be smaller than in other juts-in-time compilation approaches.  This
 creates a larger discrepancy between the size of the \texttt{resume data} when
 compared to the size of the generated machine code and illustrates why it is 
important to compress the \texttt{resume data} information.
@@ -755,7 +759,7 @@
 requires 18.3\% to 31.1\% of the space compared to a naive approach. This
 shows that large parts of the resume data are redundant and can be stored more
 efficiently through using the techniques described above. On the other hand
-comparing the results to the xz compression which only requires between 17.1\%
+comparing the results to the xz compression which only needs between 17.1\%
 and 21.1\% of the space required by our compression shows that the compression
 is not optimal but a trade-off between the required space and the time needed
 to build a good compressed representation of the compressed resume data for the
@@ -765,15 +769,15 @@
 \label{sub:guard_failure}
 The last point in this discussion is the frequency of guard failures.
 Figure~\ref{fig:failing_guards} presents for each benchmark a list of the
-relative amounts of guards that ever fail and of guards that fail more than 200
-times.\footnote{
-    The threshold of 200 is rather high. It was picked experimentally to give
+relative amounts of guards that ever fail and of guards that fail often enough 
that a bridge is compiled.
+\footnote{
+    The threshold used is 200 failures. This rather high threshold was picked 
experimentally to give
     good results for long-running programs.
 }
-As described before, for guards that fail more than 200 times, a trace
-is recorded that starts from the guard. Afterwards the guard is patched so 
that later
-failures execute the new trace instead of jumping to the trampoline and 
returning to the interpreter. Hence the
-numbers presented for guards that fail more than 200 times represent the 200
+
+After the guard is patched
+failures execute the new bridge instead of jumping to the trampoline and 
returning to the interpreter. Hence the
+numbers presented for guards that have a bridge represent the
 failures up to the compilation of the bridge and all executions of the then
 attached bridge.
 
@@ -787,7 +791,7 @@
 of all the guards in the optimized traces ever fail. This amount varies between
 2.4\% and 5.7\% of all guards. As can be expected, even fewer guards fail often
 enough that a bride is compiled for them, only 1.2\% to 3.6\% of all guards
-fail more than 200 times. Also of all failing guards a few fail extremely often
+fail often enough that a bridge is compiled. Also of all failing guards a few 
fail extremely often
 and most fail rarely. The results emphasizes that as most of the guards never
 fail it is important to make sure that the successful execution of a guard does
 not have unnecessary overhead.
@@ -899,27 +903,27 @@
 In this paper we have concentrated on guards, an operation typically found in
 tracing just-in-time compilers and used to denote points of possible control
 flow divergence in recorded traces.
-We described how, based on the observation that guards are a frequent operation
-in traces and that they do not fail often, guards have been implemented in the
+Based on the observation that guards are a frequent operation in traces and
+that they do not fail often, we described how they have been implemented in the
 high and low level components of RPython's tracing JIT compiler.
 
 Finally we have presented experimental data collected using the standard PyPy
 benchmark set to evaluate previous observations and assumptions. Our
-experiments showed that, as previously assumed, guards are a very common
+experiments confirmed that guards are a very common
 operation in traces. At the same time guards are associated with a high
 overhead, because for all compiled guards information needs to be
 stored to restore the execution state in case of a bailout. The measurements
 showed that the compression techniques used in PyPy effectively reduce the
-overhead of guards, while it still produces a significant overhead. The results
+overhead of guards, but it still produces a significant overhead. The results
 also showed that guard failure is a local event: there are few
 guards that fail at all, and even fewer that fail very often.
 These numbers validate the design decision of reducing the overhead of
 successful guard checks as much as possible while paying a higher price in the
 case of bailout due to having to decode compressed state representation.
-The compressed state representation is reduces the memory footprint of rarely
+The compressed state representation reduces the memory footprint of rarely
 used data.
 
-Based on the observation that most guards do not fail very often or at all it
+Based on the observation that guard failure is rare it
 would be worth exploring if a more aggressive compression scheme for guards
 would be worth the memory saving in contrast to the increased decoding
 overhead. Based on the same observation we would like to explore the concept of
_______________________________________________
pypy-commit mailing list
[email protected]
http://mail.python.org/mailman/listinfo/pypy-commit

Reply via email to