On Thu, 2023-03-16 at 09:54 +0100, Pierrick Philippe wrote:
> On 15/03/2023 17:26, David Malcolm wrote:
> > On Wed, 2023-03-15 at 16:24 +0100, Pierrick Philippe wrote:
[...snip...]
> >
> >
> > An ana::svalue is a pattern of bits (possibly symbolic, such as
> > "the
> > constant 42" or "the initial value of param x")
> >
> > An ana::region is a description of a memory location for
> > reads/writes
> > (possibly symbolic, such as "local var x within this frame", or
> > "the
> > block reached by dereferencing the initial value of ptr in the
> > frame
> > above").
> >
> > Sorry if I've misread things; I'll try to answer the rest of the
> > email
> > as best as I can...
> >
> > > So, first question: is there any way to associate and track the
> > > state
> > > of
> > > a rvalue, independently of its lvalue?
> > >
> > > To try to clarify the question, here's an example:
> > >
> > > '''
> > > int __attribute__("__some_attribute__") x = 42;
> > > /* STEP-1
> > > From now on, we consider the state of x as being marked by
> > > some_attribute.
> > > But in fact, in the log, we can observe that we'll have something
> > > like
> > > this in the new '/ana::program_state/':
> > > {0x4b955b0: (int)42: marked_state (‘x’)} */
> > Yes: sm-state is associated within a program_state with an svalue,
> > in
> > this case with the constant 42.
> >
> > There isn't yet a way to instead give sm-state to "x" itself. I
> > suppose you could give sm-state to &x (the pointer that points to x
> > is
> > an instance of region_svalue, which is an svalue), but I haven't
> > tried
> > this approach myself.
> >
> > > int *y = &x;
> > > /* STEP-2
> > > For analysis purpose, you want to know that from now on, y is
> > > pointing
> > > to marked data.
> > > So you set state of the LHS of the GIMPLE statement (i.e. some
> > > ssa_name
> > > instance of y) accordingly, with a state named 'points-
> > > to_marked_data'
> > > and setting 'x' as the origin of the state (in the sense of the
> > > argument
> > > /origin/ from '/ana::sm_context::on_transition/'.
> > > What we now have in the new '/ana::program_state/' is this:
> > > {0x4b9acb0: &x: points-to-marked_data (‘&x’) (origin: 0x4b955b0:
> > > (int)42
> > > (‘x’)), 0x4b955b0: (int)42: marked_state (‘x’)} */
> > Yes: you've set the state on the svalue "&x", not on "y".
> >
> > > int z = *y;
> > > /* STEP-3
> > > Now you have implicit copy of marked data, and you want to report
> > > it.
> > > So you state the LHS of the GIMPLE statement (i.e. some ssa_name
> > > instance of z) as being marked, with 'y' as the origin.
> > > What we now have in the new '/ana::program_state/' is this:
> > > {0x4b9acb0: &x: points-to-marked_data (‘&x’) (origin: 0x4b955b0:
> > > (int)42
> > > (‘x’)), 0x4b955b0: (int)42: marked_state (‘x’)} */
> > > '''
> > Presumably the program_state also shows that you have a binding for
> > the
> > region "z", containing the svalue 42 (this is represented within
> > the
> > "store", within the region_model within the program_state).
>
> Indeed, there is a binding for region "z" to the svalue 42 within the
> new program state.
>
> > > In STEP-2:
> > >
> > > We lost the information saying that the rvalue of y (i.e. y), is
> > > pointing to marked data.
> > > Only remain the information that its lvalue (i.e. &x), is
> > > pointing to
> > > marked data, which is of course true.
> > Note that the analyzer by default attempts to purge state that it
> > doesn't think will be needed anymore, so it may have purged the
> > state
> > of "y" if y isn't going to be read from anymore. You could try -
> > fno-
> > analyzer-state-purge to avoid this purging.
>
> Nothing changes when I run it with the -fno-anlyzer-state-purge,
> it is still the state of &x which is tracked.
>
> > > In STEP-3:
> > >
> > > No information is added regarding z in the new program_state.
> > > In fact, if one have a closer look at the log, we see that the
> > > LHS of
> > > the GIMPLE statement (the ssa_name instance of z), is already in
> > > the
> > > state of 'marked_data'.
> > > Through the log to the call to '/sm_context::on_transition/' this
> > > can
> > > be
> > > seen:
> > > marked_sm: state transition of ‘z_5’: marked_data ->
> > > marked_data
> > >
> > > All of this step is somehow leading to confusing diagnostic.
> > > For example, you miss the fact that 'z' is being marked, because
> > > no
> > > event happen as it is somehow aliasing 'x' svalue.
> > > Though it might not be true in case of missed optimization.
> > >
> > > Of course, if anyone wants to have a look at my code, I can
> > > provide
> > > it
> > > to you (it is not yet publicly available at the moment).
> > I think I'd have to look at the code to be of more help; I confess
> > that
> > I stopped understanding somewhere around step 3, sorry.
> No worries, the idea is that regarding the state_map, z_3 is already
> considered as marked.
> > Are you able to post a simple example of the code you'd like to
> > analyze, with an idea of the behavior you'd like from the analyzer?
> > (e.g. the state transitions)
>
> Sure here is an exemple:
>
> '''
> int __attribute__("__some_attribute__") x = 42;
> // Transition state for x: start -> marked
> int *y = &x;
> // Transition state for y: start -> points-to_marked
> int z = *y;
> // Transition state for z: start -> marked
> if (z);
> // Diagnostic is issued because of the use of a marked data in a
> condition, for analysis purpose.
> '''
>
> The issued diagnostic I do have in mind looks like:
>
> '''
> 1 | int __attribute__("__some_attribute__") x = 42;
> | ^
> | |
> | (1) ‘x’ gets marked here
> 2 | int *y = &x;
> | ~
> | |
> | (2) ‘y’ points-to marked data here
> 3 | int z = *y;
> | ~
> | |
> | (3) ‘z’ gets marked here
> 4 | if (z);
> | ~
> | |
> | (4) used of a marked value 'z' in a condition
> '''
>
> What I get from now on is this:
>
> '''
> 1 | int __attribute__((__taint__)) x = 42;
> | ^
> | |
> | (1) ‘x’ gets
> tainted here
> |......
> 4 | if(z) return 666;
> | ~
> | |
> | (2) use of tainted value ‘z’ in a condition
> '''
If I'm reading things right, it looks like the analyzer is correctly
complaining that the marked svalue is being used in a condition, but
the problem is that the events in the diagnostic_path aren't very good:
it's missing events (2) and (3) from the "what you had in mind"
diagnostic.
If I'm right, then the issue here seems to be more about how we
generate the diagnostic_path (the list of event messages displayed for
the diagnostic) when we emit the saved_diagnostic, rather than about
the state-tracking.
Unfortunately, when I wrote that code I was primarily looking at state
machines for tracking allocation state (e.g. double-free), and it
doesn't work well with other kinds of state machine. I've experimented
with other approaches that might work better. I think ultimately we
want to be able to walk backwards from a saved_diagnostic and be able
to answer the question "why did it get into this state?", which might
involve following assignments backwards. I have a semi-working
prototype of this, but it would be a lot of work to get ready for trunk
(it tries to record a set of reversible "change" objects, so that we
can walk backwards from the diagnostic, undoing changes to see where a
value came from).
>
> And for some reason, if I change the analyzed code to this:
>
> '''
> int __attribute__("__some_attribute__") x = 42;
> int *y = &x;
> x = 6; // New line
> int z = *y;
> if (z);
> '''
>
> This is what I get:
>
> '''
> 2 | int __attribute__((__taint__)) x = 42;
> | ^
> | |
> | (1) ‘x’ gets
> marked
> here
> 3 | int * y = &x;
> | ~
> | |
> | (2) ‘&x’ points to tainted data here
> 4 | x = 6;
> 5 | int z = *y;
> | ~
> | |
> | (3) ‘x’ gets tainted here
> 6 | if(z) return 666;
> | ~
> | |
> | (4) use of tainted value ‘z’ in a condition
That event (3) looks bogus; looks like a bug in how we generate events
(the existing version of the code I was talking about above).
> '''
>
> Sorry for the long mail here,
No worries; sorry about the bugs in the analyzer :/
> Let me know if you want to have a look at the exploded graph or at
> the
> analyzer's log.
Does the above help?
Dave