[PATCH] D114231: [clang][docs][dataflow] Added an introduction to dataflow analysis

Dmitri Gribenko via Phabricator via cfe-commits Mon, 29 Nov 2021 05:51:15 -0800

gribozavr2 added inline comments.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:24
+*   [Introduction to Dataflow 
Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
+*   [Introduction to abstract 
interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
+*   [Introduction to symbolic 
execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
----------------
xazax.hun wrote:
> I know, everyone has their favorite set of resources. Let me share mine: 
> `https://cs.au.dk/~amoeller/spa/`
> I think this one might have a bit more content with more examples. Feel free 
> to leave it as is. 
Forgot about this book -- I also like it a lot, added!


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:49
+  int x;
+  // x is {}
+  if (n > 0) {
----------------
xazax.hun wrote:
> Nit: I wonder if representing the value of `x` with an empty set is a good 
> choice. One could argue that `x` has an indeterminate value that could be 
> represented with the set of all integers using top. Others could argue for a 
> bottom value. Those concepts are not yet introduced. Maybe initializing `x` 
> in the source code would side step this problem.
Since there is a discussion of these topics below, I agree that avoiding the 
question at this point is better. Added initialization.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:71-72
+Abstract algebra provides a nice formalism that models this type of structure,
+namely, a lattice. A lattice is a partially ordered set, in which every two
+elements have a least upper bound (called a *join*).
+
----------------
xazax.hun wrote:
> Technically, I think this would be a `join-semilattice`. To have a `lattice`, 
> we would need both `join` and `meet`.
Corrected.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:86-87
+
+Computing the join in the lattice corresponds to finding the lowest common
+ancestor between two nodes in its Hasse diagram.
+
----------------
xazax.hun wrote:
> If we want to be practical, I wonder if we want to give some guidance how to 
> implement that efficiently. E.g. we could cite a paper like this: 
> https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.4911&rep=rep1&type=pdf
Thanks, added.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:111-112
+information that we track. In this case, we can, for example, arbitrarily limit
+the size of sets to 3 elements. If at a certain program point `x` has more than
+3 possible values, we stop tracking specific values at that program point.
+Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
----------------
xazax.hun wrote:
> I wonder if we want to cite widening and narrowing in this context to give 
> people some pointers where to look for more info.
I think it should be a separate section, with a slightly different example. To 
show the value of widening we need a lattice that can abstract the infinite set 
`{0; 1; 2; ...}` more precisely than "⊤" (e.g., "non-negative"). There would be 
also the usual handwaving about figuring out how to make those types of 
conclusions in the general case. I think adding all that here would break the 
flow and make the explanation less crisp. Adding a separate section on widening 
would be good one day, when the dataflow framework gets the feature. I don't 
think it is necessary right now.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:182
+void Example(int n) {
+  int x;          // x is ⊥
+  if (n > 0) {
----------------
xazax.hun wrote:
> I think it would make sense to include the whole program state in the 
> comments. E.g. `// x is ⊥, n is ⊤`.
> This could showcase how we constrain the value of `n` to `42` in one of the 
> branches.
So far we have been focusing on figuring out the values of `x` the local 
variable, not `n` the parameter, so I'd rather not deviate. Also, comparing `n` 
to 42 wouldn't necessarily lead to constraining the set of values here (for 
example, in the framework we do it in the flow condition).


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:279
+```
+out = transfer(basic_block, join(in_1, in_2, ..., in_n))
+```
----------------
xazax.hun wrote:
> While I agree that this is the general formula, people sometimes do small 
> variations, e.g.:
> ```
> out =  join(transfer(basic_block,in_1), transfer(basic_block,in_2), ..., 
> transfer(basic_block,in_n))
> ```
> 
> This is less efficient as we end up invoking the transfer function more 
> often, but it can be more precise. E.g. with some ad-hoc notation:
> 
> ```
> Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> join(in...) : x : {⊤}
> transfer("x/2", ...) == x : {⊤}
> 
> vs.
> Given the branches: x: {1}, x: {2}, x: {3}, x: {4}
> transfer("x/2", ...) ==x : {0}, x : {1}, x : {1}, x: {2} == outs
> join(outs) == x: {0, 1, 2}
> ```
This section is only attempting to give the readers an intuition of why 
dataflow works, hence it must be straightforward, not necessarily exhaustive or 
rigorous. I tried to add your suggestion here, but it breaks the flow. I tried 
to add it at the end of the section, but it looks out of place. So, sorry, I 
couldn't fit it here. If you have a specific wording suggestion that works 
here, I would be happy to apply it.

I think one would need to add a section like "variations on dataflow equations" 
to properly introduce the idea. It also seems to me that this specific 
variation is also a special case of a more general idea of deferring join 
operations to improve precision; that is, instead of doing a join everywhere 
where classic dataflow prescribes it, we can instead keep exploring separate 
paths, and only do a join at some later point in the CFG. Similarly we can also 
unroll loops in a precise way up to a certain number of iterations, and attempt 
to join/widen only at after that. These are of course important ideas, but they 
don't help with what this document is trying to achieve.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:288
+Since the lattice has a finite height, the algorithm is guaranteed to 
terminate.
+Each iteration of the algorithm can change computed values only to larger 
values
+from the lattice. In the worst case, all computed values become `⊤`, which is
----------------
xazax.hun wrote:
> Only if our transfer functions are monotonic :)
That's an important condition, added.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:299
+
+## Symbolic execution: a very short informal introduction
+
----------------
xazax.hun wrote:
> I think many people find it confusing, what is the relationship between 
> symbolic execution and abstract interpretation? I found this answer helpful: 
> https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation
> I wonder if it might be worth citing.
So far I have avoided using the term "abstract interpretation" in this 
document, or going into more deep topics. However I read this answer and I like 
it too, so I added a link to it here.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:365
+void ExampleOfSymbolicPointers(bool b) {
+  int x = 0;     // x is {0}
+  int* ptr = &x; // x is {0}
----------------
xazax.hun wrote:
> Here, I think we want to include `ptr` in the comments describing the program 
> state. 
Added.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:381-382
+
+Output parameters are function parameters of pointer or reference type whose
+pointee is completely overwritten by the function. They are common in pre-C++11
+code due to the absence of move semantics. In modern C++ output parameters are
----------------
xazax.hun wrote:
> One could argue that you also cannot read from the pointee, otherwise you 
> have an input-output parameter. There is an example clarifying this below, 
> but I wonder if we want to get the definition right from the beginning.
Corrected.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:469
+
+*   Prove that the function only borrows the pointer and does not persist it.
+
----------------
xazax.hun wrote:
> While I do agree we cannot explain everything, like what borrowing or 
> escaping means in this document, I wonder if we want to add some citations.
I reworded to avoid using the term "borrow".


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:478-479
+
+To analyze the function body we can use a lattice which consists of normal
+states and failure states. A normal state describes program points where we are
+sure that no behaviors that block the refactoring have occurred. Normal states
----------------
xazax.hun wrote:
> I wonder if the distinction between normal states and failure states is 
> useful. In general, we can combine arbitrary number of lattices and propagate 
> all the information in a single pass. I.e., we could have multiple "normal" 
> or "failure" states.
> 
> There are multiple ways to combine lattices, we can put them on top of each 
> other, or next to each other introducing new top/bottom values, or we can 
> take their products.
> I wonder if the distinction between normal states and failure states is 
> useful.

I'm not sure I understand -- this distinction is useful in this particular 
approach to solve this problem, since it helps solve the problem? Or are you 
objecting to the term "failure"?

Of course, in general, an analysis does not need to have failure/normal states, 
and like you said, if we track information about multiple candidate output 
parameters at the same time, each can be in either a normal or failure state at 
every program point independently of other parameters. However, this document 
provides an example of a solution for this particular problem; the goal is not 
to solve the problem, but to give the reader an intuition of how dataflow ideas 
can be applied to solve real problems.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:570
+
+The implementation of dead store analysis is very similar to output parameter
+analysis: we need to track stores and loads, and find stores that were never
----------------
xazax.hun wrote:
> It might be a good idea to reference liveness analysis which is often used to 
> answer many related questions:
> * Finding dead stores
> * Finding uninitialized variable
> * Finding a good point to deallocate memory
> * Finding out if it would be safe to move an object
> * ... 
Good point, added a reference.


================
Comment at: clang/docs/DataFlowAnalysisIntro.md:808
+To implement it using the data flow analysis framework, we can produce a 
warning
+if any part of the branch condition is implied by the path condition.
+
----------------
xazax.hun wrote:
> Here, I think the term `path condition` becomes ambiguous. Traditionally, 
> path condition is a path sensitive property, in the code snippet below:
> ```
> if (b1)
> { ... }
> 
> if (b1 && b2)
> { ... }
> ```
> 
> We would reach the second branch on two paths (depending on whether we took 
> the first one). On one of the paths, the path condition will have `b1`, on 
> the other, it will have `!b1`.
> 
> I think here you mean the non-path-sensitive property. I'd suggest doing a 
> strict distinction and use path condition for path sensitive properties and 
> flow condition for flow sensitive properties.
Replaced with "flow condition".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114231/new/

https://reviews.llvm.org/D114231

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D114231: [clang][docs][dataflow] Added an introduction to dataflow analysis

Reply via email to