On 08/24/18 12:52, Richard Biener wrote: > On Wed, Aug 22, 2018 at 6:57 AM Bernd Edlinger > <bernd.edlin...@hotmail.de> wrote: >> >> On 08/21/18 10:33, Richard Biener wrote: >>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>> >>>> On 08/20/18 15:19, Richard Biener wrote: >>>>> On Mon, 20 Aug 2018, Bernd Edlinger wrote: >>>>> >>>>>> On 08/20/18 13:01, Richard Biener wrote: >>>>>>> On Wed, Aug 1, 2018 at 3:05 PM Bernd Edlinger >>>>>>> <bernd.edlin...@hotmail.de> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 08/01/18 11:29, Richard Biener wrote: >>>>>>>>> >>>>>>>>> Hmm. I think it would be nice if TREE_STRING_LENGTH would >>>>>>>>> match char[2] and TYPE_SIZE_UNIT even if that is inconvenient >>>>>>>>> for your check above. Because the '\0' doesn't belong to the >>>>>>>>> string. Then build_string internally appends a '\0' outside >>>>>>>>> of TREE_STRING_LENGTH. >>>>>>>>> >>>>>>>> >>>>>>>> Hmm. Yes, but the outside-0 byte is just one byte, not a wide >>>>>>>> character. >>>>>>> >>>>>>> That could be fixed though (a wide 0 is just N 0s). Add a elsz = 1 >>>>>>> parameter to build_string and allocate as many extra 0s as needed. >>>>>>> >>>>>>> There are STRING_CSTs which are not string literals, >>>>>>>> for instance attribute tags, Pragmas, asm constrants, etc. >>>>>>>> They use the '\0' outside, and have probably no TREE_TYPE. >>>>>>>> >>>>>>>>> >>>>>>>>>> So I would like to be able to assume that the STRING_CST objects >>>>>>>>>> are internally always generated properly by the front end. >>>>>>>>> >>>>>>>>> Yeah, I guess we need to define what "properly" is ;) >>>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>>> And that the ARRAY_TYPE of the string literal either has the >>>>>>>>>> same length than the TREE_STRING_LENGTH or if it is shorter, >>>>>>>>>> this is always exactly one (wide) character size less than >>>>>>>>>> TREE_STRING_LENGTH >>>>>>>>> >>>>>>>>> I think it should be always the same... >>>>>>>>> >>>>>>>> >>>>>>>> One could not differentiate between "\0" without zero-termination >>>>>>>> and "" with zero-termination, theoretically. >>>>>>> >>>>>>> Is that important? Doesn't the C standard say how to parse string >>>>>>> literals? >>>>>>> >>>>>>>> We also have char x[100] = "ab"; >>>>>>>> that is TREE_STRING_LENGTH=3, and TYPE_SIZE_UNIT(TREE_TYPE(x)) = 100. >>>>>>>> Of course one could create it with a TREE_STRING_LENGTH = 100, >>>>>>>> but imagine char x[100000000000] = "ab" >>>>>>> >>>>>>> The question is more about TYPE_SIZE_UNIT (TREE_TYPE ("ab")) which I >>>>>>> hope matches "ab" and not 'x'. If it matches 'x' then I'd rather have >>>>>>> it >>>>>>> unconditionally be [], thus incomplete (because the literals "size" >>>>>>> depends >>>>>>> on the context/LHS it is used on). >>>>>>> >>>>>> >>>>>> Sorry, but I must say, it is not at all like that. >>>>>> >>>>>> If I compile x.c: >>>>>> const char x[100] = "ab"; >>>>>> >>>>>> and set a breakpoint at output_constant: >>>>>> >>>>>> Breakpoint 1, output_constant (exp=0x7ffff6ff9dc8, size=100, align=256, >>>>>> reverse=false) at ../../gcc-trunk/gcc/varasm.c:4821 >>>>>> 4821 if (size == 0 || flag_syntax_only) >>>>>> (gdb) p size >>>>>> $1 = 100 >>>>>> (gdb) call debug(exp) >>>>>> "ab" >>>>>> (gdb) p *exp >>>>>> $2 = {base = {code = STRING_CST, side_effects_flag = 0, constant_flag = >>>>>> 1, >>>>>> (gdb) p exp->typed.type->type_common.size_unit >>>>>> $5 = (tree) 0x7ffff6ff9d80 >>>>>> (gdb) call debug(exp->typed.type->type_common.size_unit) >>>>>> 100 >>>>>> (gdb) p exp->string.length >>>>>> $6 = 3 >>>>>> (gdb) p exp->string.str[0] >>>>>> $8 = 97 'a' >>>>>> (gdb) p exp->string.str[1] >>>>>> $9 = 98 'b' >>>>>> (gdb) p exp->string.str[2] >>>>>> $10 = 0 '\000' >>>>>> (gdb) p exp->string.str[3] >>>>>> $11 = 0 '\000' >>>>>> >>>>>> >>>>>> This is an important property of string_cst objects, that is used in >>>>>> c_strlen: >>>>>> >>>>>> It folds c_strlen(&x[4]) directly to 0, because every byte beyond >>>>>> TREE_STRING_LENGTH >>>>>> is guaranteed to be zero up to the type size. >>>>>> >>>>>> I would not have spent one thought on implementing an optimization like >>>>>> that, >>>>>> but that's how it is right now. >>>>> >>>>> Huh. So somebody interpreted STRING_CSTs similar to CONSTRUCTORs aka >>>>> they have zero-padding up to its type size. I don't see this documented >>>>> anywhere and it would suggest to "optimize" "ab\0\0\0\0" to "ab\0" >>>>> with appropriate TYPE_SIZE. >>>>> >>>>> This is also a relatively new thing on trunk (coming out of the added >>>>> mem_size parameter of string_constant). That this looks at the STRING_CST >>>>> type like >>>>> >>>>> if (TREE_CODE (array) == STRING_CST) >>>>> { >>>>> *ptr_offset = fold_convert (sizetype, offset); >>>>> if (mem_size) >>>>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>>>> return array; >>>>> >>>>> I'd call simply a bug. As said, interpretation of STRING_CSTs should >>>>> depend on the context. And for >>>>> >>>> >>>> This use of the TYPE_SIZE_UNIT was pre-existing to r263607 >>>> before that (but not in the gcc-8-branch) we had this in c_strlen: >>>> >>>> HOST_WIDE_INT maxelts = strelts; >>>> tree type = TREE_TYPE (src); >>>> if (tree size = TYPE_SIZE_UNIT (type)) >>>> if (tree_fits_shwi_p (size)) >>>> { >>>> maxelts = tree_to_uhwi (size); >>>> maxelts = maxelts / eltsize - 1; >>>> } >>>> >>>> which I moved to string_constant and return the result through memsize. >>>> >>>> It seems to be working somehow, and I'd bet removing that will immediately >>>> break twenty or thirty of the strlenopt tests. >>>> >>>> Obviously the tree string objects allow way too much variations, >>>> and it has to be restricted in some way or another. >>>> >>>> In the moment my approach is: look at what most string constants do >>>> and add assertions to ensure that there are no exceptions. >>>> >>>> >>>>> char a[4] = "abc"; >>>>> char b[5] = "abc"; >>>>> >>>>> we should better be able to share STRING_CSTs [you can see LTO >>>>> sharing the nodes if you do b[4] but not when b[5] I suppose]. >>>>> >>>>>> All I want to do, is make sure that all string constants have the same >>>>>> look and feel >>>>>> in the middle-end, and restrict the variations that are allowed by the >>>>>> current >>>>>> implementation. >>>>> >>>>> Sure, I understand that. But I'd like to simplify things and not add >>>>> complications like looking at TYPE_SIZE vs. TREE_STRING_LENGTH to decide >>>>> whether sth is 0-terminated. >>>>> >>>> >>>> This is not only about 0-terminated. (*) >>>> >>>> It is also about when you get a STRING_CST with a TREE_STRING_LENGTH, >>>> there are places in the middle-end that assume that the object contains >>>> _all_ bytes up to TREE_STRING_LENGTH without looking at TYPE_SIZE_UINT. >>>> Those I want to protect. >>> >>> Well, but string_constant handles &STRING_CST just fine but in that >>> context there's no "object" we look at. >>> >>> IMHO whenever string_constant ends up with a DECL, looking at >>> ctor_for_folding and we end up with a STRING_CST that doesn't fit >>> in the DECL we looked at we have a bug (non-truncated STRING_CST) >>> and either should do the truncation or simply return NULL. >>> >>> So we should look at DECL_SIZE_UNIT and compare that with >>> TREE_STRING_LENGTH. Otherwise you are relying on TYPE_SIZE_UNIT >>> of the STRING_CST being "correct" (fitting the decl context). >>> >>>> Bernd. >>>> >>>> >>>> *: I called the parameter memsize, and yes, I wrote: "If MEM_SIZE is zero, >>>> the string is only returned when it is properly zero terminated.", >>>> but maybe I should have written: >>>> "If MEM_SIZE is zero, the string is only returned when the storage size >>>> of the string object is at least TREE_STRING_LENGTH." >>>> That's at least exactly what the check does. >>> >>> Well, as said above >>> >>> if (TREE_CODE (array) == STRING_CST) >>> { >>> *ptr_offset = fold_convert (sizetype, offset); >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (array)); >>> return array; >>> } >>> >>> simply assumes the "storage size" of a STRING_CST is determined >>> by its TYPE_SIZE_UNIT. That may be true as you noted in the folloup, >>> but clearly in the above case there's nothing to protect? And in >>> the case we pulled the STRING_CST from some DECL_INITIAL it doesn't >>> protect from overflow of the FIELD_DECL unless frontends never >>> generate "wrong" STRING_CSTs? >>> >> >> Hmm, I digged some more in varasm.c >> >> Normal STRING_CST use get_constant_size to allocate >> MAX(TYPE_SIZE_UNIT, TREE_STRING_LENGTH) >> thus they can have excess zero bytes. >> >> while STRING_CST that are in DECL_INITIALIZERs >> are shrinked to DECL_SIZE_UNIT. >> >> At least that difference looks unnecessary to me. >> >> >> But compare_constant does not compare the TYPE_SIZE_UNIT: >> >> case STRING_CST: >> if (TYPE_MODE (TREE_TYPE (t1)) != TYPE_MODE (TREE_TYPE (t2))) >> return 0; >> >> return (TREE_STRING_LENGTH (t1) == TREE_STRING_LENGTH (t2) >> && ! memcmp (TREE_STRING_POINTER (t1), TREE_STRING_POINTER >> (t2), >> TREE_STRING_LENGTH (t1))); >> >> >> This looks now like a bug. > > Well. I think that we emit excess zero bytes for "truncated" STRING_CSTs > isn't > a designed feature. I'd rather not have it that way because the length of a > STRING_CST becomes defined in multiple places if you consider STRING_CST > with [] or too small type then TREE_STRING_LENGTH is authorative while > if the type is larger then that is authorative. > > The length of a STRING_CST if output should be only determined from > TREE_STRING_LENGTH. >
Hmm. I think about that. It is possible that I change my mind.... Currently the semantic of STRING_CST objects differs for literals and for initializer elements. For literals the TREE_STRING_LENGTH is guaranteed to resemble the runtime string object. For initializer elements the type domain overrides the TREE_STRING_LENGTH if it is available. If it is a flexible array member the type domain is not available, and the TREE_STRING_LENGTH determines the memory size. The regression in tree-ssa-forwprop.c (PR 86714) is caused by the fact that previously constant_string did always return string literals while now it can as well return initializer elements. But tree-ssa-forwprop expects string literal semantics. I would like to remove the semantic differences. I have a patch in the test, that fixes the TYPE_DOMAIN of flexible array members. Maybe the most important property is TYPE_DOMAIN of the STRING_CST should never be shorter than the TREE_STRING_LENGTH. Then it the difference between string literals and string initializers will disappear. And that should probably take precedence over the zero-termination property of the string_cst objects. Which will of course put one more question mark on the correctness of the string_constant, c_strlen etc. >> >>> Btw, get_constant_size / mergeable_string_section suggsts that >>> we may view STRING_CST as general target-encoded byte blob. >>> That may be useful to compress CONSTRUCTORs memory use. >>> >>> It looks like mergeable_string_section wrongly would reject >>> a char[] typed string because int_size_in_bytes returns -1 >>> for incomplete types. I still believe using char[] for most >>> STRING_CSTs generated by FEs would be a good thing for >>> IL memory use. Shouldn't the mem_size initializations use >>> sth like get_constant_size as well? >>> >> >> I think incomplete types exist only in structs with variable >> length: >> >> struct { >> int x; >> char y[]; >> } s = { 1, "test" }; >> >> this is no candidate for a mergeable string. >> >> but >> char y[] = "test"; >> >> is fixed by the FE to: >> char y[5] = "test"; >> >> So that does not make problems with mergeable_string_section. >> >> >>> Also >>> >>> if (mem_size) >>> *mem_size = TYPE_SIZE_UNIT (TREE_TYPE (init)); >>> else if (compare_tree_int (array_size, length + 1) < 0) >>> return NULL_TREE; >>> >>> the latter check doesn't seem to honor 'offset' and *mem_size >>> is again looking at the STRING_CST, not at the FIELD_DECL or >>> wherever we got the STRING_CST from. >>> >> >> the caller compares offset with *mem_size, but we do not have >> the FIELD_DECL at hand here (Or I did not know where it was). >> I'd rather fail the cases when the TYPE_SIZE_UNIT is null, >> or something non-constant. >> >> All those are variations of vla with initializer and similar. >> Once the STRING_CST has a type, I would like to use it, >> when it doesn't the whole thing is probably not worth it >> anyway. > > I would rather have STRING_CSTs only have a type for the > elements, not the whole thing... you can get that > if you drop TYPE_DOMAIN on all STRING_CSTs array types. > This will break the string_constant code even more. It depends on the TYPE_DOMAIN to mean something. > Not sure what breaks if we change get_constant_size to > return TREE_STRING_LENGTH unconditionally. At least > somehow the need to special-case STRING_CSTs looks > odd and I wonder if we do this in other functions as well... > I think as well this special handling in get_constant_size is something that has to go away. It is even worse, the DECL_SIZE_UNIT is also NULL in flexible array members. But I think I can make it work the other way round. So that STRING_CST always have a TYPE_DOMAIN which may only be larger than the TREE_STRING_LENGTH. Bernd. > Richard. > >> >> >> Bernd.