Re: On IPA-PTA field sensitivity and pointer expressions

Richard Biener via Gcc Fri, 25 Sep 2020 05:51:42 -0700

On Fri, Sep 25, 2020 at 2:27 PM Erick Ochoa
<erick.oc...@theobroma-systems.com> wrote:
>
>
>
> On 25/09/2020 13:30, Richard Biener wrote:
> > On Fri, Sep 25, 2020 at 9:05 AM Erick Ochoa
> > <erick.oc...@theobroma-systems.com> wrote:
> >>
> >> Hi,
> >>
> >> I am working on an alias analysis using the points-to information
> >> generated during IPA-PTA. If we look at the varmap varinfo_t array in
> >> gcc/tree-ssa-struct.c, most of the constraint variable info structs
> >> contain a non-null decl field which points to a valid tree in gimple
> >> (which is an SSA variable and a pointer). I am trying to find out a way
> >> to obtain points-to information for pointer expressions. By this, the
> >> concrete example I have in mind is answering the following question:
> >>
> >> What does `astruct->aptrfield` points to?
> >>
> >> Here I have a concrete example:
> >>
> >>
> >> #include <stdlib.h>
> >>
> >> struct A { char *f1; struct A *f2;};
> >>
> >> int __GIMPLE(startwith("ipa-pta"))
> >> main (int argc, char * * argv)
> >> {
> >>     struct A * p1;
> >>     char * pc;
> >>     int i;
> >>     int _27;
> >>
> >>     i_15 = 1;
> >>     pc = malloc(100); // HEAP(1)
> >>     p1 = malloc (16); // HEAP(2)
> >>     p1->f1 = pc;
> >>     p1->f2 = p1;
> >>     _27 = (int) 0;
> >>     return _27;
> >> }
> >>
> >>
> >> Will give the following correct points-to information:
> >>
> >> HEAP(1) = { }
> >> HEAP(2) = { HEAP(1) HEAP(2) }
> >> pc_30 = { HEAP(1) }
> >> p1_32 = { HEAP(2) }
> >>
> >> However, there does not seem to be information printed for neither:
> >>
> >>     p1->f1
> >>     p1->f2
> >>
> >> which I would expect (or like) something like:
> >>
> >>     p1_32->0 = { HEAP(1) }
> >>     p1_32->64 = { HEAP(2) }
> >>
> >> Looking more closely at the problem, I found that some varinfo_t have a
> >> non-null "complex" field. Which has an array of "complex" constraints
> >> used to handle offsets and dereferences in gimple. For this same gimple
> >> code, we have the following complex constraints for the variable p1_32:
> >>
> >> main.clobber = p1_32 + 64
> >> *p1_32 = pc_30
> >> *p1_32 + 64 = p1_32
> >
> > The issue is that allocated storage is not tracked field-sensitive since
> > we do not know it's layout at the point of allocation (where we allocate
> > the HEAP variable).  There are some exceptions, see what we do
> > for by-reference parameters in create_variable_info_for_1:
> >
> >        if (vi->only_restrict_pointers
> >            && !type_contains_placeholder_p (TREE_TYPE (decl_type))
> >            && handle_param
> >            && !bitmap_bit_p (handled_struct_type,
> >                              TYPE_UID (TREE_TYPE (decl_type))))
> >          {
> >            varinfo_t rvi;
> >            tree heapvar = build_fake_var_decl (TREE_TYPE (decl_type));
> >            DECL_EXTERNAL (heapvar) = 1;
> >            if (var_can_have_subvars (heapvar))
> >              bitmap_set_bit (handled_struct_type,
> >                              TYPE_UID (TREE_TYPE (decl_type)));
> >            rvi = create_variable_info_for_1 (heapvar, "PARM_NOALIAS", true,
> >                                              true, handled_struct_type);
> >            if (var_can_have_subvars (heapvar))
> >              bitmap_clear_bit (handled_struct_type,
> >                                TYPE_UID (TREE_TYPE (decl_type)));
> >            rvi->is_restrict_var = 1;
> >            insert_vi_for_tree (heapvar, rvi);
> >            make_constraint_from (vi, rvi->id);
> >            make_param_constraints (rvi);
> >
> > where we create a heapvarwith a specific aggregate type.  Generally
> > make_heapvar (for the allocation case) allocates a variable without
> > subfields:
> >
> > static varinfo_t
> > make_heapvar (const char *name, bool add_id)
> > {
> >    varinfo_t vi;
> >    tree heapvar;
> >
> >    heapvar = build_fake_var_decl (ptr_type_node);
> >    DECL_EXTERNAL (heapvar) = 1;
> >
> >    vi = new_var_info (heapvar, name, add_id);
> >    vi->is_heap_var = true;
> >    vi->is_unknown_size_var = true;
> >    vi->offset = 0;
> >    vi->fullsize = ~0;
> >    vi->size = ~0;
> >    vi->is_full_var = true;
> >
> > I've once had attempted to split (aka generate subfields) a variable
> > on-demand during solving but that never worked well.
> >
> > So for specific cases like C++ new T we could create heapvars
> > appropriately typed.  But you have to double-check for correctness
> > because of may_have_pointers and so on.
>
> Hi Richard, I am not 100% sure about the meaning of the
> may_have_pointers field. In the source it says:
>
>    270   /* True if this field may contain pointers.  */
>    271   unsigned int may_have_pointers : 1;
>
> Does that mean that (1) the field is a pointer type? (2) An integer
> which may have hold the value of a pointer?


Yes.

> Or that (3) there are
> pointers pointing to that field?

No, that property doesn't exist as flag.

> You mention that for C++ new T heap vars are typed appropriately, but we
> still need to check for correctness. If I understood correctly, this
> means that may_have_pointers means that there might be pointers pointing
> to that variable (?) Because the pointers could potentially be casted
> and therefore the layout be interpreted in a different way. This is just
> my interpretation of what you said, can you please confirm or clarify
> that bit?

No, C++ new T heap vars are the same as for malloc.  I said it is probably
easier to derive a good guess what type to use there.  And yes, we'd
have to still force all fields to have may_have_pointers to true because
it's really just arbitrary sub-dividing of storage without any type info.

>
> Another clarification point: you mentioned that the heap variables'
> layout needs to be known in order to do what I want. I think you are
> right, but let me continue the example.

So we at least need to know the size for most of the machinery to work.

> These are the points-to sets we have:
>
>    HEAP(1) = { }
>    HEAP(2) = { HEAP(1) HEAP(2) }
>    pc_30 = { HEAP(1) }
>    p1_32 = { HEAP(2) }
>
> These are the complex constraints of interest:
>
>    *p1_32 = pc_30
>    *p1_32 + 64 = p1_32
>
> This is what we want:
>
>     p1_32->0 = { HEAP(1) }
>     p1_32->64 = { HEAP(2) }
>
> If I were to continue working on say "parsing the constraint variables
> to generate something like what I want", it would mean that I would have
> to parse all possible complex constraint variables. Which would include
> the following type of statements:
>
>   *x = y + off_y
>
> x corresponds to a constraint that has a valid gimple SSA variable
> y is a heap variable
>
> And therefore, since y does not have a layout known, in these cases *x
> can point to whatever y points to right? Because even though off_y might
> be known, since the layout of y is unknown this basically would mean
> that we are forced to treat off_y as UNKNOWN_OFFSET.

yes

> So, the only cases which I would be able to correctly determine are
> those in which there is no heap variable either in the lhs or rhs
> (inclusive) of the complex constraint. Which I think implies that there
> are *some* cases which can be statically determined, like the one
> above... (There might be some complications which I haven't thought of
> of course...)

well, usually the fields are accessed via x->a and x->b so the
offsets are constant and not unknown.

>
> Now this is just some random thought and maybe does not warrant a
> concrete response below but I will be thinking about this during the
> weekend:
>
> I'm wondering whether it would be possible to keep HEAP variables
> without a concrete layout but still "unify" all aliases that point to
> that heap variable to at least have some idea of what the layout looks
> like. For example:
>
> struct A { char *f0; struct A* f1; };
> struct A *p0 = malloc(argc * sizeof(struct A)); // HEAP(0)
>
> We don't know the size of malloc... but if we find all pointers which
> point to HEAP(0) we could say something like
>
> HEAP(0) is an array with size unknown and we can accesses the following:
> 8 * n bytes for a struct A* type. No need to keep track of individual
> array elements accessed, just knowing that if its a multiple of 8 bytes,
> it will have a layout of struct A*.

not sure what you're after, but the current points-to solver does not
track offsets, it just has an approximation of offsets via splitting
pointed-to variables into pieces.  So if p points to { s@0+32 }
(subfield of s with extent 0 to 32) we know it points _somewhere_
inside that region.  If you then have a constraint like q = p + 4
then q will point to { s@0+32 s@32+32 } if there exists an adjacent
field.  A following r = q - 4 will then add another field left of s@0+32
(ok bad example, negative offset).  So the field-sensitivity will
bleed out quickly and once you see a UNKNOWN_OFFSET every
subfield will be replaced by all subfields.

> Or instead of doing an analysis to unify all alias sets, just perform a
> speculative layout and disable it if something else which may point to
> that location has a different type or performs some UNKNOWN_OFFSET
> pointer arithmetic... (But I digress, I have to read a bit more on
> pointer analyses to find out more...)
>
> Thanks (also, I know I digressed a little in this last point, but please
> clarify the first two points! This has been very helpful)
>
> >
> >> It seems to me that I can probably parse these complex constraints to
> >> generate the answers which I want. Is this the way this is currently
> >> being handled in GCC or is there some other standard mechanism for this?
> >
> > GCC is in the end only interested in points-to sets for SSA names
> > which never have subfields.  The missing subfields for aggregates
> > simply make the points-to solution less precise.
> >
> > Richard.
> >
> >> Thanks!

Re: On IPA-PTA field sensitivity and pointer expressions

Reply via email to