Re: PING [PATCH] warn for strlen of arrays with missing nul (PR 86552, 86711, 86714) )

Bernd Edlinger Fri, 24 Aug 2018 14:57:02 -0700

On 08/24/18 18:04, Jeff Law wrote:
> On 08/24/2018 06:27 AM, Bernd Edlinger wrote:
> [ Lots of snipping throughout ]
> 
> 
>>>>
>>>>> +
>>>>>      if (TREE_CODE (src) == COND_EXPR
>>>>>          && (only_value || !TREE_SIDE_EFFECTS (TREE_OPERAND (src, 0))))
>>>>>        {
>>>>> -      tree len1, len2;
>>>>> -
>>>>> -      len1 = c_strlen (TREE_OPERAND (src, 1), only_value);
>>>>> -      len2 = c_strlen (TREE_OPERAND (src, 2), only_value);
>>>>> +      tree len1 = c_strlen (TREE_OPERAND (src, 1), only_value, arrs);
>>>>> +      tree len2 = c_strlen (TREE_OPERAND (src, 2), only_value, arrs + 1);
>>>>>          if (tree_int_cst_equal (len1, len2))
>>>>> - return len1;
>>>>> + {
>>>>
>>>> funny, if called with NULL *arr and arrs[0] alias each other.
>>>
>>>>
>>>>> +   *arr = arrs[0] ? arrs[0] : arrs[1];
>>>>> +   return len1;
>>>>> + }
>>>>>        }
>>>
>>> And in this case it looks like your comment is before the code you're
>>> commenting about.  It was fairly obvious in this case because the code
>>> prior to your "funny, if called..." comment didn't reference arr or arrs
>>> at all.
>>>
>>> And more importantly, why are you concerned about the aliasing?
>>>
>>
>> It is just *arr = arrs[0] does nothing, but it looks like the author
>> was not aware of it.  It may be okay, but causes head-scratching.
>> If you don't have the context you will think this does something different.
> *arr = arrs[0] ? arrs[0] : arrs[1];
> 
> Yes, there's potentially a dead store when arr points to arrs[0] because
> of the earlier initialization when NULL was passed for arr.  But
> otherwise we'd be checking repeatedly that arr != NULL.
> 
> 
>>>>>      /* PTR can point to the byte representation of any string type, 
>>>>> including
>>>>>         char* and wchar_t*.  */
>>>>>      const char *ptr = TREE_STRING_POINTER (src);
>>>>> @@ -650,7 +709,8 @@ c_strlen (tree src, int only_value)
>>>>>          offsave = fold_convert (ssizetype, offsave);
>>>>>          tree condexp = fold_build2_loc (loc, LE_EXPR, boolean_type_node, 
>>>>> offsave,
>>>>>                                         build_int_cst (ssizetype, len * 
>>>>> eltsize));
>>>>
>>>> this computation is wrong, it computes not in units of eltsize,
>>>> I am however not sure if it is really good that this function tries to
>>>> compute strlen of wide character strings.
>>> Note that in the second version of Martin's patch eltsize will always be
>>> 1 when we get here.  Furthermore, the multiplication by eltsize is gone.
>>>    It looks like you switched back to commenting after the code for that
>>> comment, but then immediately thereafter...
>>>
>>
>> Acutally I had no idea that the second patch did resolve some of my comments
>> and which those were.
> It happens.  Multiple long threads make it difficult to follow.
> 
>>
>> I had the impression that it is just splitting up of a large patch into
>> several smaller without reworking at the same time.
>>
>> Once again, a summary what was changed would be helpful.
> Agreed.
> 
> 
>>>> What are you fixing here, I think that was another bug.
>>>> If this fixes something then it should be in a different patch,
>>>> just handling this.
>>>>
>>>>>      unsigned len = string_length (ptr + eltoff * eltsize, eltsize,
>>>>> -                         maxelts - eltoff);
>>>>> +                         strelts - eltoff);
>>>>>    
>>>>>      return ssize_int (len);
>>>>>    }
>>> I'm guessing the comment in this case refers to the code after.
>>> Presumably questioning the change from maxelts to strelts.
>>>
>>>
>>
>> Yes.  I was thinking that could be a patch of its own.
> Funny.  I found myself in agreement and was going to extract this out
> and run it through the usual tests and get it onto the trunk
> immediately.  Then I realized you'd already fixed this in the patch that
> added the eltsize paramter to c_strlen which has already been committed
> :-)
> 
> 
> 
>>
>> Well, yes, although I changed my mind on strlen(L"abc") meanwhile.
>>
>> This may appear as if I contradict myself but, it is more like I learn.
>>
>> The installed patch for the ELTSIZE parameter was in part inspired by the
>> thought about "not folding invalid stuff" that was forming in my mind
>> at that time, even before I wrote it down and sent it to this list.
> ACK.  And I think there seems to be consensus forming around that
> concept which is good.
> 
> If we ultimately decide not to fold strlen of a wide character string,
> then that'll be an easy enough change to make.  In the mean time bring
> consistent with how the C library strlen works is a good thing IMHO.
> 
> 
>>>>> @@ -8255,19 +8333,30 @@ fold_builtin_classify_type (tree arg)
>>>>>      return build_int_cst (integer_type_node, type_to_class (TREE_TYPE 
>>>>> (arg)));
>>>>>    }
>>>>>    
>>>>> -/* Fold a call to __builtin_strlen with argument ARG.  */
>>>>> +/* Fold a strlen call to FNDECL of TYPE, and with argument ARG.  */
>>>>>    
>>>>>    static tree
>>>>> -fold_builtin_strlen (location_t loc, tree type, tree arg)
>>>>> +fold_builtin_strlen (location_t loc, tree fndecl, tree type, tree arg)
>>>>>    {
>>>>>      if (!validate_arg (arg, POINTER_TYPE))
>>>>>        return NULL_TREE;
>>>>>      else
>>>>>        {
>>>>> -      tree len = c_strlen (arg, 0);
>>>>> -
>>>>> +      tree arr = NULL_TREE;
>>>>> +      tree len = c_strlen (arg, 0, &arr);
>>>>
>>>> Is it possible to write a test case where strlen(L"test") reaches this 
>>>> point?
>>>> what will c_strlen return then?
>>> Given your fix to have c_strlen count bytes by default, I think we're OK
>>> here.
>>>
>>
>> Yes.  But my fix was still incomplete.
>> So incremental progress is necessary here.
> That's fine.  I'm generally a fan of incremental improvements.
> 
>>>>
>>>>>        {
>>>>>          *ptr_offset = fold_convert (sizetype, offset);
>>>>> +      if (nulterm)
>>>>> + *nulterm = true;
>>>>> +      if (decl)
>>>>> + *decl = NULL_TREE;
>>>>>          return array;
>>>>>        }
>>> So your comment seems to be referring to the code above the comment as
>>> well as below.  Confusing.  Consistency really helps.
>>>
>>> Are we at a point where we're ready to declare STRING_CSTs as always
>>> being properly terminated?  If not the hunk of code above needs some
>>> rethinking since we can't guarantee the string is properly terminated.
>>>
>>
>> No.  That is still in the flux.
>> And I am not sure if the "properly terminated" will survive.
>>
>> I am digging deep to the ground now.
>> And the correctness of the code in questing depends heavily on the
>> results.  Therefore I would like to fix this from the ground.
>>
>> Adding lots of new code that is based on wrong assumptions
>> will not help.
> OK.  So clearly this hunk needs rethinking.  I'm not sure if/how
> critical the code above is -- we may be able to get away with deferring
> this chunk.  I'll have to look at that more deeply.
>


Yes. Note one interesting thing.  The new version of the STRING_CST
consistency check in varasm.c does some magic.
This time I think Richard generally agreed on the approach.

The surprise is, it fixes both PR86711/86714 without any other patch.

I really think that is the preferred way to go ahead.

If you prevent inconsistent input to the string_constant function
then it is a lot easier to prevent inconsistent output.

So I start to think that at least my proposed fix for PR86711/86714
seems like it was not yet at the root cause of the problem.
The root cause was probably the semantical differences between
different uses of STRING_CSTs.


>>
>>
>>>
>>>>>    
>>>>> @@ -11414,6 +11422,49 @@ string_constant (tree arg, tree *ptr_offset)
>>>>>      if (!array_size || TREE_CODE (array_size) != INTEGER_CST)
>>>>>        return NULL_TREE;
>>>>>    
>>>>> +  unsigned HOST_WIDE_INT array_elts = tree_to_uhwi (array_size);
>>>>> +
>>>>
>>>> I don't understand why this is necessary at all.
>>>> It looks way too complicated, to say the least.
>>>>
>>>> TREE_TYPE (init) has already the type of the member.
>>>>
>>>>> +  /* When ARG refers to an aggregate (of arrays) try to determine
>>>>> +     the size of the character array within the aggregate.  */
>>>>> +  tree ref = arg;
>>>>> +  tree reftype = TREE_TYPE (arg);
>>>>> +
>>>>> +  if (TREE_CODE (ref) == MEM_REF)
>>>>> +    {
>>>>> +      ref = TREE_OPERAND (ref, 0);
>>>>> +      if (TREE_CODE (ref) == ADDR_EXPR)
>>>>> + {
>>>>> +   ref = TREE_OPERAND (ref, 0);
>>>>> +   reftype = TREE_TYPE (ref);
>>>>> + }
>>>>> +    }
>>>>> +  else
>>>>> +    while (TREE_CODE (ref) == ARRAY_REF)
>>>>> +      {
>>>>> + reftype = TREE_TYPE (ref);
>>>>> + ref = TREE_OPERAND (ref, 0);
>>>>> +      }
>>>>> +
>>>>> +  if (TREE_CODE (ref) == COMPONENT_REF)
>>>>> +    reftype = TREE_TYPE (ref);
>>>>> +
>>>>> +  while (TREE_CODE (reftype) == ARRAY_TYPE)
>>>>> +    {
>>>>> +      tree next = TREE_TYPE (reftype);
>>>>> +      if (TREE_CODE (next) == INTEGER_TYPE)
>>>>> + {
>>>>> +   if (tree size = TYPE_SIZE_UNIT (reftype))
>>>>> +     if (tree_fits_uhwi_p (size))
>>>>> +       array_elts = tree_to_uhwi (size);
>>>>
>>>> so array_elts is measued in bytes.
>>> So I'm guessing your comment about the code looking way too complicated
>>> is referring to the code *after* your comment.  That code is not in the
>>> v2 patch.  At least not 01/06 which addresses the codegen/opt issues.  I
>>> haven't checked the full kit to see if this code appears in a subsequent
>>> patch.
>>>
>>
>> No I meant the code between
>> /* When ARG refers to an aggregate (of arrays) try to determine
>>      the size of the character array within the aggregate.  */
>>
>> and here.  It is wrong (assuming C semantic on GIMPLE).
> Ah.  Yea.  I think we're broadly in agreement we can't do that for
> anything which affects optimization/codegen.  Given those bits aren't in
> Martin's v2 patch I think we can move on.
> 
> 

Yes, with the new semantics of STRING_CST we can guarantee the
STRING_CST have a _valid_ TYPE_UNIT_SIZE, even in cases of flexible array 
members
where previously the TYPE_UNIT_SIZE was NULL.  That makes a big difference here.


So I would wait with this topic until we have the semantic issues resolved.



>>
>>>
>>>>
>>>>> +   break;
>>>>> + }
>>>>> +
>>>>> +      reftype = TREE_TYPE (reftype);
>>>>> +    }
>>>>> +
>>>>> +  if (decl)
>>>>> +    *decl = array;
>>>>> +
>>>>>      /* Avoid returning a string that doesn't fit in the array
>>>>>         it is stored in, like
>>>>>         const char a[4] = "abcde";
>>>>> @@ -11427,7 +11478,9 @@ string_constant (tree arg, tree *ptr_offset)
>>>>>      unsigned HOST_WIDE_INT length = TREE_STRING_LENGTH (init);
>>>>>      length = string_length (TREE_STRING_POINTER (init), charsize,
>>>>>                             length / charsize);
>>>>
>>>> Some callers especially those where the wrong code happens, expect
>>>> to be able to access the STRING_CST up to TREE_STRING_LENGTH,
>>>> But using string_length assume thy stop at the first nul char.
>>> Example please?
>>>
>>
>> tree-ssa-forwprop.c:
>>                 str1 = string_constant (src1, &off1);
>>                 if (str1 == NULL_TREE)
>>                   break;
>>                 if (!tree_fits_uhwi_p (off1)
>>                     || compare_tree_int (off1, TREE_STRING_LENGTH (str1) - 
>> 1) > 0
>>                     || compare_tree_int (len1, TREE_STRING_LENGTH (str1)
>>                                                - tree_to_uhwi (off1)) > 0
> I understand that you're concerned about reading past the end of the
> returned STRING_CST via the subsequent memcpy in tree-ssa-forwprop.c
> which starts at src1+off1 and reads len1 bytes.
> 
> But I'm not sure how that can happen in the V2 patch from Martin.  In
> fact, one of the fundamental goals of the V2 patch is to avoid this
> exact problem.
> 

here I quote form martins 1/6 patch:
> +  /* Compute the lower bound number of elements (not bytes) in the array
> +     that the string is used to initialize.  The actual size of the array
> +     will be may be greater if the string is shorter, but the important
> +     data point is whether the literal, including the terminating nul,
> +     fits in the array. */
> +  unsigned HOST_WIDE_INT array_elts
> +    = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (init))) / charsize;
> +
> +  /* Compute the string length in (wide) characters.  */

So prior to my STRING_CST patch this will ICE on a flexible array member,
because those have TYPE_SIZE_UNIT = NULL, and tree_to_uhwi will ICE.

I used:

  compare_tree_int (TYPE_SIZE_UNIT (TREE_TYPE (init)),
                    TREE_STRING_LENGTH (init))
  
and this will not ICE with NULL, but consider it like infinity,
and return 1.

So my version did not ICE in that case.

Once we have the new STRING_CST semantics in place it will make
this defined, but it is at the same time completely unnecessary.


Bernd.

Re: PING [PATCH] warn for strlen of arrays with missing nul (PR 86552, 86711, 86714) )

Reply via email to