On 25/09/2020 13:30, Richard Biener wrote:
On Fri, Sep 25, 2020 at 9:05 AM Erick Ochoa
<erick.oc...@theobroma-systems.com> wrote:

Hi,

I am working on an alias analysis using the points-to information
generated during IPA-PTA. If we look at the varmap varinfo_t array in
gcc/tree-ssa-struct.c, most of the constraint variable info structs
contain a non-null decl field which points to a valid tree in gimple
(which is an SSA variable and a pointer). I am trying to find out a way
to obtain points-to information for pointer expressions. By this, the
concrete example I have in mind is answering the following question:

What does `astruct->aptrfield` points to?

Here I have a concrete example:


#include <stdlib.h>

struct A { char *f1; struct A *f2;};

int __GIMPLE(startwith("ipa-pta"))
main (int argc, char * * argv)
{
    struct A * p1;
    char * pc;
    int i;
    int _27;

    i_15 = 1;
    pc = malloc(100); // HEAP(1)
    p1 = malloc (16); // HEAP(2)
    p1->f1 = pc;
    p1->f2 = p1;
    _27 = (int) 0;
    return _27;
}


Will give the following correct points-to information:

HEAP(1) = { }
HEAP(2) = { HEAP(1) HEAP(2) }
pc_30 = { HEAP(1) }
p1_32 = { HEAP(2) }

However, there does not seem to be information printed for neither:

    p1->f1
    p1->f2

which I would expect (or like) something like:

    p1_32->0 = { HEAP(1) }
    p1_32->64 = { HEAP(2) }

Looking more closely at the problem, I found that some varinfo_t have a
non-null "complex" field. Which has an array of "complex" constraints
used to handle offsets and dereferences in gimple. For this same gimple
code, we have the following complex constraints for the variable p1_32:

main.clobber = p1_32 + 64
*p1_32 = pc_30
*p1_32 + 64 = p1_32

The issue is that allocated storage is not tracked field-sensitive since
we do not know it's layout at the point of allocation (where we allocate
the HEAP variable).  There are some exceptions, see what we do
for by-reference parameters in create_variable_info_for_1:

       if (vi->only_restrict_pointers
           && !type_contains_placeholder_p (TREE_TYPE (decl_type))
           && handle_param
           && !bitmap_bit_p (handled_struct_type,
                             TYPE_UID (TREE_TYPE (decl_type))))
         {
           varinfo_t rvi;
           tree heapvar = build_fake_var_decl (TREE_TYPE (decl_type));
           DECL_EXTERNAL (heapvar) = 1;
           if (var_can_have_subvars (heapvar))
             bitmap_set_bit (handled_struct_type,
                             TYPE_UID (TREE_TYPE (decl_type)));
           rvi = create_variable_info_for_1 (heapvar, "PARM_NOALIAS", true,
                                             true, handled_struct_type);
           if (var_can_have_subvars (heapvar))
             bitmap_clear_bit (handled_struct_type,
                               TYPE_UID (TREE_TYPE (decl_type)));
           rvi->is_restrict_var = 1;
           insert_vi_for_tree (heapvar, rvi);
           make_constraint_from (vi, rvi->id);
           make_param_constraints (rvi);

where we create a heapvarwith a specific aggregate type.  Generally
make_heapvar (for the allocation case) allocates a variable without
subfields:

static varinfo_t
make_heapvar (const char *name, bool add_id)
{
   varinfo_t vi;
   tree heapvar;

   heapvar = build_fake_var_decl (ptr_type_node);
   DECL_EXTERNAL (heapvar) = 1;

   vi = new_var_info (heapvar, name, add_id);
   vi->is_heap_var = true;
   vi->is_unknown_size_var = true;
   vi->offset = 0;
   vi->fullsize = ~0;
   vi->size = ~0;
   vi->is_full_var = true;

I've once had attempted to split (aka generate subfields) a variable
on-demand during solving but that never worked well.

So for specific cases like C++ new T we could create heapvars
appropriately typed.  But you have to double-check for correctness
because of may_have_pointers and so on.

Hi Richard, I am not 100% sure about the meaning of the may_have_pointers field. In the source it says:

  270   /* True if this field may contain pointers.  */
  271   unsigned int may_have_pointers : 1;

Does that mean that (1) the field is a pointer type? (2) An integer which may have hold the value of a pointer? Or that (3) there are pointers pointing to that field?

You mention that for C++ new T heap vars are typed appropriately, but we still need to check for correctness. If I understood correctly, this means that may_have_pointers means that there might be pointers pointing to that variable (?) Because the pointers could potentially be casted and therefore the layout be interpreted in a different way. This is just my interpretation of what you said, can you please confirm or clarify that bit?

Another clarification point: you mentioned that the heap variables' layout needs to be known in order to do what I want. I think you are right, but let me continue the example.

These are the points-to sets we have:

  HEAP(1) = { }
  HEAP(2) = { HEAP(1) HEAP(2) }
  pc_30 = { HEAP(1) }
  p1_32 = { HEAP(2) }

These are the complex constraints of interest:

  *p1_32 = pc_30
  *p1_32 + 64 = p1_32

This is what we want:

   p1_32->0 = { HEAP(1) }
   p1_32->64 = { HEAP(2) }

If I were to continue working on say "parsing the constraint variables to generate something like what I want", it would mean that I would have to parse all possible complex constraint variables. Which would include the following type of statements:

 *x = y + off_y

x corresponds to a constraint that has a valid gimple SSA variable
y is a heap variable

And therefore, since y does not have a layout known, in these cases *x can point to whatever y points to right? Because even though off_y might be known, since the layout of y is unknown this basically would mean that we are forced to treat off_y as UNKNOWN_OFFSET.

So, the only cases which I would be able to correctly determine are those in which there is no heap variable either in the lhs or rhs (inclusive) of the complex constraint. Which I think implies that there are *some* cases which can be statically determined, like the one above... (There might be some complications which I haven't thought of of course...)


Now this is just some random thought and maybe does not warrant a concrete response below but I will be thinking about this during the weekend:

I'm wondering whether it would be possible to keep HEAP variables without a concrete layout but still "unify" all aliases that point to that heap variable to at least have some idea of what the layout looks like. For example:

struct A { char *f0; struct A* f1; };
struct A *p0 = malloc(argc * sizeof(struct A)); // HEAP(0)

We don't know the size of malloc... but if we find all pointers which point to HEAP(0) we could say something like

HEAP(0) is an array with size unknown and we can accesses the following:
8 * n bytes for a struct A* type. No need to keep track of individual array elements accessed, just knowing that if its a multiple of 8 bytes, it will have a layout of struct A*.

Or instead of doing an analysis to unify all alias sets, just perform a speculative layout and disable it if something else which may point to that location has a different type or performs some UNKNOWN_OFFSET pointer arithmetic... (But I digress, I have to read a bit more on pointer analyses to find out more...)

Thanks (also, I know I digressed a little in this last point, but please clarify the first two points! This has been very helpful)


It seems to me that I can probably parse these complex constraints to
generate the answers which I want. Is this the way this is currently
being handled in GCC or is there some other standard mechanism for this?

GCC is in the end only interested in points-to sets for SSA names
which never have subfields.  The missing subfields for aggregates
simply make the points-to solution less precise.

Richard.

Thanks!

Reply via email to