Re: [PATCH] Make strlen range computations more conservative

Jeff Law Fri, 03 Aug 2018 00:47:51 -0700

On 08/01/2018 09:13 PM, Martin Sebor wrote:
> On 08/01/2018 01:19 AM, Richard Biener wrote:
>> On Tue, 31 Jul 2018, Martin Sebor wrote:
>>
>>> On 07/31/2018 09:48 AM, Jakub Jelinek wrote:
>>>> On Tue, Jul 31, 2018 at 09:17:52AM -0600, Martin Sebor wrote:
>>>>> On 07/31/2018 12:38 AM, Jakub Jelinek wrote:
>>>>>> On Mon, Jul 30, 2018 at 09:45:49PM -0600, Martin Sebor wrote:
>>>>>>> Even without _FORTIFY_SOURCE GCC diagnoses (some) writes past
>>>>>>> the end of subobjects by string functions.  With _FORTIFY_SOURCE=2
>>>>>>> it calls abort.  This is the default on popular distributions,
>>>>>>
>>>>>> Note that _FORTIFY_SOURCE=2 is the mode that goes beyond what the
>>>>>> standard
>>>>>> requires, imposes extra requirements.  So from what this mode
>>>>>> accepts or
>>>>>> rejects we shouldn't determine what is or isn't considered valid.
>>>>>
>>>>> I'm not sure what the additional requirements are but the ones
>>>>> I am referring to are the enforcing of struct member boundaries.
>>>>> This is in line with the standard requirements of not accessing
>>>>> [sub]objects via pointers derived from other [sub]objects.
>>>>
>>>> In the middle-end the distinction between what was originally a
>>>> reference
>>>> to subobjects and what was a reference to objects is quickly lost
>>>> (whether through SCCVN or other optimizations).
>>>> We've run into this many times with the __builtin_object_size already.
>>>> So, if e.g.
>>>> struct S { char a[3]; char b[5]; } s = { "abc", "defg" };
>>>> ...
>>>> strlen ((char *) &s) is well defined but
>>>> strlen (s.a) is not in C, for the middle-end you might not figure
>>>> out which
>>>> one is which.
>>>
>>> Yes, I'm aware of the middle-end transformation to MEM_REF
>>> -- it's one of the reasons why detecting invalid accesses
>>> by the middle end warnings, including -Warray-bounds,
>>> -Wformat-overflow, -Wsprintf-overflow, and even -Wrestrict,
>>> is less than perfect.
>>>
>>> But is strlen(s.a) also meant to be well-defined in the middle
>>> end (with the semantics of computing the length or "abcdefg"?)
>>
>> Yes.
>>
>>> And if so, what makes it well defined?
>>
>> The fact that strlen takes a char * argument and thus inline-expansion
>> of a trivial implementation like
>>
>>  int len = 0;
>>  for (; *p; ++p)
>>    ++len;
>>
>> will have
>>
>>  p = &s.a;
>>
>> and the middle-end doesn't reconstruct s.a[..] from the pointer
>> access.
>>
>>>
>>> Certainly not every "strlen" has these semantics.  For example,
>>> this open-coded one doesn't:
>>>
>>>   int len = 0;
>>>   for (int i = 0; s.a[i]; ++i)
>>>     ++len;
>>>
>>> It computes 2 (with no warning for the out-of-bounds access).
>>
>> Yes.
> 
> If that's not a problem then why is it one when strlen() does
> the same thing?  Presumably the answer is: "because here
> the access is via array indexing and in strlen via pointer
> dereferences."  (But in C there is no difference between
> the two.  Also see below.)
But the semantics within GCC aren't necessarily C, they are GIMPLE.
While they are usually very similar, they can differ.  That also means
that if array indexing gets turned into pointer dereferences we lose
semantic information.




> I have seen and I think shown in this discussion examples
> where this is not so.  For instance:
> 
>   struct S { char a[1], b[1]; };
> 
>   void f (struct S *s, int i)
>   {
>     char *p = &s->a[i];
>     char *q = &s->b[0];
> 
>     char x = *p;
>     *q = 11;
> 
>     if (x != *p)            // folded to false
>       __builtin_abort ();   // eliminated
>   }
> 
> Is this a bug?  (I hope not.)
What I keep coming back to is what is the type of the object we pass to
strlen in GIMPLE.  If it is a char array, then we can use the array
bounds to construct bounds for the result.  If it is a char *, then we
can not.

This example is really looking at the aliasing model (which is another
place where the semantics may not match precisely).  But I don't think
you can equate what happens in the aliasing model of GIMPLE with the
strlen case.  They're apples and oranges.


> 
>>> If we can't then the only language we have in common with users
>>> is the standard.  (This, by the way, is what the C memory model
>>> group is trying to address -- the language or feature that's
>>> missing from the standard that says when, if ever, these things
>>> might be valid.)
>>
>> Well, you simply have to not compare apples and oranges,
>> a strlen implementation that isn't a strlen implementation
>> and strlen.
> 
> As I'm sure you know, the C standard doesn't differentiate
> between the semantics of array subscript expressions and
> pointer dereferencing.  They both mean the same thing.
> (Nothing prevents an implementation from defining strlen
> as a macro that expands into a loop using array indices
> for array arguments.)
But this is an area were the C standard and GIMPLE differ.  It's easy to
miss (as I did in the initial review).  But that's the way it is.


Jeff

Re: [PATCH] Make strlen range computations more conservative

Reply via email to