On 08/09/2013 05:27 PM, Jim Blandy wrote:
On 08/09/2013 04:29 PM, Nicolas B. Pierron wrote:
The goal of the tainting is to re-construct the inverted data-flow graph,
i-e finding the origin of a string which flow into a function. And the
data-flow graph is basically what is monitored when we register that we
can see a new values flowing inside a store at a specific code location.
I think that if we want to capture this kind of information, we should at
least make it in such a way that we can also use it to improve our
performance. If we are able to isolate the data-flow, we could optimize
our data representation based on guarded invariants of the data-flow
(dynamic deforestation?), and with the support of a moving GC, we could
optimize/deoptimize the value representation on GCs. [to be seen as a JIT
compiler for the data flow instead of only having JIT compilers for the
control flow]
It's true that, in principle, the flow graph the compiler uses and the flow
graph taint analysis uses are the same. But in practice they're very different.
* Taint is concerned with flow *through* string primitives:
concatenation, substring, regexp match extraction, and so on. The
compiler doesn't know much about those operations, and so is only
concerned with getting them their arguments, and delivering their
results to the right place. It doesn't relate their inputs to their
outputs.
This is a problem of instrumentation, and this would still exists even with
tainting. Also, as I mentioned to Ivan, monitoring strings is an
approximation, as a string might be given to JSON.parse or converted into an
Array/TypedArray.
* Taint needs to dynamically observe the flow of values in specific
actual executions. If a particular branch isn't taken, then the
not-executed code shouldn't affect taint results. But the compiler
needs to reach conservative conclusions that hold on all possible
executions.
This would be true in the case of a static compiler, but in the case of a
dynamic compiler, we can omit information based on the monitored flow. In
fact, TI already restrict the possible type to the observed type, and this
is for this precise reason that we need to insert type-barriers in
IonMonkey's code, when the set of observed type is not equal to the upper
bound calculated by the type inference.
What you propose would require substantial contributions from a group of
engineers (IonMonkey and GC hackers) that is in high demand; it's hard for
me to imagine taint support becoming a sufficient priority for that team -
especially since it's an unproven approach. In contrast, the taint analysis
I brought up here has been prototyped and shown to be valuable, and is
within reach of a volunteer (Ivan) from the security team.
One of the reason why I would prefer us to depend on such information, is
that our focus is set on performances. If a bug or an incorrect value
appear in the analysis, then it would likely be related to a performance
issue or an incorrect behavior. The reasons why I want to find a
performance reason for doing this analysis is that we could rely on it and
make it better.
As a side note. Currently, we conditionally maintained an artificial stack
a side to the Interpreter, Baseline and IonMonkey. This stack is only used
by the Gecko profiler. Worse, tbpl does not even run test on the JS Engine
to ensure that we get it in a correct shape. So using the information
collected by this profiling could be helpful in many ways. Such as finding
functions which are worth keeping across GCs.
Another example, is the type inference. Currently we collect a lot of
information which is valuable for the developer tools. Sadly, we do not
have a well-detailed API to make it usable out-side the engine. But the
fact that we rely on it, ensure that the type information we see would be
better than any static analysis tool.
--
Nicolas B. Pierron
_______________________________________________
dev-tech-js-engine-internals mailing list
dev-tech-js-engine-internals@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals