Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-08-04 Thread Jeff Law

On 08/04/2016 08:27 AM, David Malcolm wrote:


As for test coverage, v2 and v3 of the kit add over a thousand lines of
selftest code that heavily exercise string lexing, using the
 line_table_case machinery to run the tests with various interesting
boundary conditions with line_table (e.g. near
 LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES).

In terms of test coverage of the fallbacks, patch 2 of v3 of the kit
directly exercises the substr_loc.get_range in
gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via
gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the
tests there cover the failures, via:

  error_at (strloc, "unable to read substring range: %s", err);

which we wouldn't do in a normal diagnostic (but which is appropriate
for testing the machinery itself).

Patch 3 of the v3 kit adds a format_warning_va function to c-format.c
which is responsible for dealing with failures:
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html
THanks for pointing this out.  I hadn't started looking at the meat of 
the on-demand locations until this morning.





Looking at patch 3, there's a fair amount of end-to-end testing in
 gcc.dg/format/diagnostic-ranges.c but it looks like I forgot to add an
end-to-end test there of failure due to stringification; I can add one.
 Is the rest of the v3 patch kit reviewable?
Absolutely.  I wasn't trying to imply that it wasn't  -- in fact most of 
it is self-approvable stuff and I've only got a couple questions about 
the rest.


Jeff



Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-08-04 Thread David Malcolm
On Wed, 2016-08-03 at 09:59 -0600, Jeff Law wrote:
> On 07/29/2016 03:42 PM, Joseph Myers wrote:
> > On Tue, 26 Jul 2016, David Malcolm wrote:
> > 
> > > This patch implements precise tracking of source locations for
> > > the
> > > individual chars within string literals, so that we can e.g.
> > > underline
> > > specific ranges in -Wformat diagnostics.  It handles macros,
> > > concatenated tokens, escaped characters etc.
> > 
> > What if the string literal results from stringizing other tokens
> > (which
> > might have arisen in turn from macro expansion, including expansion
> > of
> > built-in macros not just those defined in source files, etc.)? 
> >  "You don't
> > get precise locations" would be a fine answer for such cases -
> > provided
> > there is good testsuite coverage of them to show they don't crash
> > the
> > compiler or underline nonsensical characters.
> I think losing precise locations in some circumstances would be fine
> as 
> well -- as long as we understand the limitations.

In v3 of the patch, this fails gracefully.

> And, yes, crashing or underlining nonsensical characters would be
> bad, 

The API in input.c is get_source_range_for_substring, which returns an
error message (intended for us, rather than end-users); it is wrapped
by this method in c-common.c:

/* Attempt to determine the source range of the substring.
   If successful, return NULL and write the source range to *OUT_RANGE.
   Otherwise return an error message.  Error messages are intended
   for GCC developers (to help debugging) rather than for end-users.  */

const char *
substring_loc::get_range (source_range *out_range) const

> so it'd be obviously good to test some of that to ensure the
> fallbacks 
> work as expected.

As for test coverage, v2 and v3 of the kit add over a thousand lines of
selftest code that heavily exercise string lexing, using the
 line_table_case machinery to run the tests with various interesting
boundary conditions with line_table (e.g. near
 LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES).

In terms of test coverage of the fallbacks, patch 2 of v3 of the kit
directly exercises the substr_loc.get_range in 
gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via
gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the
tests there cover the failures, via:

  error_at (strloc, "unable to read substring range: %s", err);

which we wouldn't do in a normal diagnostic (but which is appropriate
for testing the machinery itself).

Patch 3 of the v3 kit adds a format_warning_va function to c-format.c
which is responsible for dealing with failures:
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html

Sadly the comment got a bit mangled by git in that patch due to the
proximity to the deleted function location_column_from_byte_offset;
here's an inline copy (after patch 4, which adds param
CORRECTED_SUBSTRING for doing fix-it hints for bad format strings):

/* Emit a warning governed by option OPT, using GMSGID as the format
   string and AP as its arguments.

   Attempt to obtain precise location information within a string
   literal from FMT_LOC.

   Case 1: if substring location is available, and is within the range of
   the format string itself, the primary location of the
   diagnostic is the substring range obtained from FMT_LOC, with the
   caret at the *end* of the substring range.

   For example:

 test.c:90:10: warning: problem with '%i' here [-Wformat=]
 printf ("hello %i", msg);
~^

   Case 2: if the substring location is available, but is not within
   the range of the format string, the primary location is that of the
   format string, and an note is emitted showing the substring location.

   For example:
 test.c:90:10: warning: problem with '%i' here [-Wformat=]
 printf("hello " INT_FMT " world", msg);
^
 test.c:19: note: format string is defined here
 #define INT_FMT "%i"
  ~^

   Case 3: if precise substring information is unavailable, the primary
   location is that of the whole string passed to FMT_LOC's constructor.
   For example:

 test.c:90:10: warning: problem with '%i' here [-Wformat=]
 printf(fmt, msg);
^~~

   For each of cases 1-3, if param_range is non-NULL, then it is used
   as a secondary range within the warning.  For example, here it
   is used with case 1:

 test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
 printf ("foo %s bar", long_i + long_j);
  ~^   ~~~

   and here with case 2:

 test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=]
 printf ("foo " STR_FMT " bar", long_i + long_j);
 ^  ~~~
 test.c:89:16: note: format string is defined here
 #define STR_FMT "%s"
  ~^

   and with case 3:

 test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
 

Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-08-03 Thread Jeff Law

On 07/29/2016 03:42 PM, Joseph Myers wrote:

On Tue, 26 Jul 2016, David Malcolm wrote:


This patch implements precise tracking of source locations for the
individual chars within string literals, so that we can e.g. underline
specific ranges in -Wformat diagnostics.  It handles macros,
concatenated tokens, escaped characters etc.


What if the string literal results from stringizing other tokens (which
might have arisen in turn from macro expansion, including expansion of
built-in macros not just those defined in source files, etc.)?  "You don't
get precise locations" would be a fine answer for such cases - provided
there is good testsuite coverage of them to show they don't crash the
compiler or underline nonsensical characters.
I think losing precise locations in some circumstances would be fine as 
well -- as long as we understand the limitations.


And, yes, crashing or underlining nonsensical characters would be bad, 
so it'd be obviously good to test some of that to ensure the fallbacks 
work as expected.


jeff



Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-08-03 Thread Jeff Law

On 07/29/2016 11:27 AM, David Malcolm wrote:

On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote:

On 29 July 2016 at 16:25, David Malcolm  wrote:


FWIW, it appears that clang uses the on-demand approach; the
relevant
code appears to be StringLiteral::getLocationOfByte:
http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008


As far as I know, llvm doesn't do language diagnostics from the
middle-end/LTO. Thus, they do not have those problems.


If you really want to have middle-end diagnostics from LTO, I can make
the on-demand approach work.

I can also do the stored-location approach, but it would mean rewriting
all the patches again, I think, would be less efficient.

I would prefer the on-demand approach.

Who is empowered to make a decision here?
ISTM we've got a bit of a deadlock here with the two intertwined 
patches.  I'm wondering if we can move both forward, perhaps without the 
higher quality diagnostics for Martin's work initially.  Then iterate on 
what's in-tree to add the higher quality diagnostics, then figure out 
how to deal with some of the issues we have in the LTO space.


Martin's model of running early or late depending on flags is, IMHO, the 
right approach.  And more generally its a good solution for other 
problems in this space.  With that in mind, finding a way to get at the 
diagnostics framework from within the middle end and eventually LTO is, 
IMHO, important.


Given that the diagnostics are the uncommon case, I would strongly 
prefer an on-demand approach rather than recording a ton of stuff in the 
front-end for the unlikely case that we're going to want a diagnostic in 
the middle-end or LTO.


Jeff


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-08-01 Thread Joseph Myers
On Thu, 28 Jul 2016, Martin Sebor wrote:

> like it as well.  So perhaps the problem to solve is how to teach
> LTO to talk to the front end.  One way to do it would be to build
> the front ends as shared libraries.

I think building front ends as shared libraries would run into different 
platforms (e.g. Windows) having very different conceptual models for 
shared libraries, especially when you get into shared libraries depending 
on symbols from the main executable (you might need to make all the 
language-independent parts of the compiler into a shared library as well).  
But a useful starting point could be to eliminate all cases where 
different front ends define external functions / variables with the same 
name (which would also enable statically linking multiple front ends 
together, to do such things without depending on shared libraries at all).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread Manuel López-Ibáñez
On 29 July 2016 at 18:27, David Malcolm  wrote:
> On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote:
>> On 29 July 2016 at 16:25, David Malcolm  wrote:
>> >
>> > FWIW, it appears that clang uses the on-demand approach; the
>> > relevant
>> > code appears to be StringLiteral::getLocationOfByte:
>> > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008
>>
>> As far as I know, llvm doesn't do language diagnostics from the
>> middle-end/LTO. Thus, they do not have those problems.
>
> If you really want to have middle-end diagnostics from LTO, I can make
> the on-demand approach work.

Personally, I'm happy with having this work only on the FEs. I haven't
had time to look at what Martin is doing, so he may prefer otherwise.

In any case, making it work from LTO could be done as a follow-up, no?

> I can also do the stored-location approach, but it would mean rewriting
> all the patches again, I think, would be less efficient.

Agreed, FWIW.

> I would prefer the on-demand approach.
>
> Who is empowered to make a decision here?

I thought you were the diagnostics maintainer ;-)

Cheers,

Manuel.


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread David Malcolm
On Fri, 2016-07-29 at 21:42 +, Joseph Myers wrote:
> On Tue, 26 Jul 2016, David Malcolm wrote:
> 
> > This patch implements precise tracking of source locations for the
> > individual chars within string literals, so that we can e.g.
> > underline
> > specific ranges in -Wformat diagnostics.  It handles macros,
> > concatenated tokens, escaped characters etc.
> 
> What if the string literal results from stringizing other tokens
> (which 
> might have arisen in turn from macro expansion, including expansion
> of 
> built-in macros not just those defined in source files, etc.)?  "You
> don't 
> get precise locations" would be a fine answer for such cases -
> provided 
> there is good testsuite coverage of them to show they don't crash the
> compiler or underline nonsensical characters.

Good question.  I briefly tested it just now, and it happens to fail
gracefully.  I'll add proper test coverage for this.


> > +   return "range starts after
> > LINE_MAP_MAX_LOCATION_WITH_COLS";
> 
> Where do these strings get used?  Hopefully not in diagnostics for
> users, 
> as they aren't written in user terms, and any diagnostic string like
> that 
> would need to be marked up to be extracted for translation.

Quoting from the comment for get_source_range_for_substring:

   Return NULL if successful, or an error message if any errors occurred.
   Error messages are intended for GCC developers (to help debugging) rather
   than for end-users.

and various functions in the patch follow this pattern (maybe I need to
add this to more comments?)

I initially had these functions return bool, but found that a const
char * was much more useful when debugging failures.
(In the testsuite I do happen to use it in a diagnostic, but that's in
a plugin, and is purely intended for verifying that various cases are
hitting various error paths - analogous to looking for messages in a
dumpfile).


Thanks
Dave


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread Joseph Myers
On Tue, 26 Jul 2016, David Malcolm wrote:

> This patch implements precise tracking of source locations for the
> individual chars within string literals, so that we can e.g. underline
> specific ranges in -Wformat diagnostics.  It handles macros,
> concatenated tokens, escaped characters etc.

What if the string literal results from stringizing other tokens (which 
might have arisen in turn from macro expansion, including expansion of 
built-in macros not just those defined in source files, etc.)?  "You don't 
get precise locations" would be a fine answer for such cases - provided 
there is good testsuite coverage of them to show they don't crash the 
compiler or underline nonsensical characters.

> + return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS";

Where do these strings get used?  Hopefully not in diagnostics for users, 
as they aren't written in user terms, and any diagnostic string like that 
would need to be marked up to be extracted for translation.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread David Malcolm
On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote:
> On 29 July 2016 at 16:25, David Malcolm  wrote:
> > 
> > FWIW, it appears that clang uses the on-demand approach; the
> > relevant
> > code appears to be StringLiteral::getLocationOfByte:
> > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008
> 
> As far as I know, llvm doesn't do language diagnostics from the
> middle-end/LTO. Thus, they do not have those problems.

If you really want to have middle-end diagnostics from LTO, I can make
the on-demand approach work.

I can also do the stored-location approach, but it would mean rewriting
all the patches again, I think, would be less efficient.

I would prefer the on-demand approach.

Who is empowered to make a decision here?




Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread Manuel López-Ibáñez
On 29 July 2016 at 16:25, David Malcolm  wrote:
>
> FWIW, it appears that clang uses the on-demand approach; the relevant
> code appears to be StringLiteral::getLocationOfByte:
> http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008

As far as I know, llvm doesn't do language diagnostics from the
middle-end/LTO. Thus, they do not have those problems.

Cheers,

Manuel.


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread David Malcolm
On Fri, 2016-07-29 at 10:46 -0400, David Malcolm wrote:
> On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote:
> > > Currently all that we need from the C family of frontends is the
> > > cpp_reader and the string concatenation records.  I think we can
> > > reconstruct the cpp_reader if we have the options, though
> > > presumably
> > > that's per TU, so to support all this we'd need to capture e.g.
> > > the
> > > per
> > > -TU encoding information in the LTO records, for the case where
> > > one
> > > TU
> > > is UTF-8 encoded source to UTF-8 execution, and another TU is
> > > EBCDIC
> > > -encoded source to UCS-4 execution (or whatever).  And there's an
> > > issue
> > > if different TUs compiled the same header with different encoding
> > > options.
> > > 
> > > Or... we could not bother.  This is a Quality of Implementation
> > > thing,
> > > for improving diagnostics, and in each case, the diagnostic is
> > > required
> > > to cope with substring location information not being available
> > > (and
> > > the code I posted in patch 2 of the kit makes it trivial to
> > > handle
> > > that
> > > case from a diagnostic).  So we could simply have LTO use the
> > > fallback mode.
> > > 
> > > There are two high-level approaches I've tried:
> > > 
> > > (a) capture the substring location information in the
> > > lexer/parser
> > > in
> > > the frontend as it runs, and store it somehow.
> > > 
> > > (b) regenerate it "on-demand" when a diagnostic needs it.
> > > 
> > > Approach (b) is inherently going to be prone to the LTO issues
> > > you
> > > describe, but it avoids adding to the CPU cycles/memory
> > > consumption
> > > for
> > > the common case of not needing the information. [1]
> > > 
> > > Is approach (b) acceptable?
> > 
> > If (b) means potentially reduced quality of the location ranges
> > in the -Wformat-length pass (e.g., with funky C++ format strings)
> > then I don't think that's enough of a problem to worry about, at
> > least not for this warning.
> > 
> > If it means not being able to use the solution you're working
> > on in the middle end  at all (unless I misunderstood that doesn't
> > seem to be what you're implying, but just to be sure) then that
> > would seem like a serious shortcoming.  I would continue to use
> > the code I copied from c-format.c (assuming that will still work),
> > but as more warnings are implemented in later passes it would
> > lead to duplicating code or reinventing the wheel just to get
> > around the limitation (or simply worse quality diagnostics).
> 
> It'll work fine for the middle-end within cc1 and cc1plus.
> 
> I'm specifically referring to LTO here, and it would be fixable from
> LTO if we can encode information about the TU encoding options into
> the
> LTO data stream, and capture the string concatenation records there
> too
> (but that would be followup work).

FWIW, it appears that clang uses the on-demand approach; the relevant
code appears to be StringLiteral::getLocationOfByte:
http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008


> 
> > Martin
> > 
> > > 
> > > Thanks
> > > Dave
> > > 
> > > [1] with the exception of the string concatenation records, but I
> > > believe those are tiny
> > > 
> > 


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread David Malcolm
On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote:
> > Currently all that we need from the C family of frontends is the
> > cpp_reader and the string concatenation records.  I think we can
> > reconstruct the cpp_reader if we have the options, though
> > presumably
> > that's per TU, so to support all this we'd need to capture e.g. the
> > per
> > -TU encoding information in the LTO records, for the case where one
> > TU
> > is UTF-8 encoded source to UTF-8 execution, and another TU is
> > EBCDIC
> > -encoded source to UCS-4 execution (or whatever).  And there's an
> > issue
> > if different TUs compiled the same header with different encoding
> > options.
> > 
> > Or... we could not bother.  This is a Quality of Implementation
> > thing,
> > for improving diagnostics, and in each case, the diagnostic is
> > required
> > to cope with substring location information not being available
> > (and
> > the code I posted in patch 2 of the kit makes it trivial to handle
> > that
> > case from a diagnostic).  So we could simply have LTO use the
> > fallback mode.
> > 
> > There are two high-level approaches I've tried:
> > 
> > (a) capture the substring location information in the lexer/parser
> > in
> > the frontend as it runs, and store it somehow.
> > 
> > (b) regenerate it "on-demand" when a diagnostic needs it.
> > 
> > Approach (b) is inherently going to be prone to the LTO issues you
> > describe, but it avoids adding to the CPU cycles/memory consumption
> > for
> > the common case of not needing the information. [1]
> > 
> > Is approach (b) acceptable?
> 
> If (b) means potentially reduced quality of the location ranges
> in the -Wformat-length pass (e.g., with funky C++ format strings)
> then I don't think that's enough of a problem to worry about, at
> least not for this warning.
> 
> If it means not being able to use the solution you're working
> on in the middle end  at all (unless I misunderstood that doesn't
> seem to be what you're implying, but just to be sure) then that
> would seem like a serious shortcoming.  I would continue to use
> the code I copied from c-format.c (assuming that will still work),
> but as more warnings are implemented in later passes it would
> lead to duplicating code or reinventing the wheel just to get
> around the limitation (or simply worse quality diagnostics).

It'll work fine for the middle-end within cc1 and cc1plus.

I'm specifically referring to LTO here, and it would be fixable from
LTO if we can encode information about the TU encoding options into the
LTO data stream, and capture the string concatenation records there too
(but that would be followup work).

> Martin
> 
> > 
> > Thanks
> > Dave
> > 
> > [1] with the exception of the string concatenation records, but I
> > believe those are tiny
> > 
> 


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread Martin Sebor

Currently all that we need from the C family of frontends is the
cpp_reader and the string concatenation records.  I think we can
reconstruct the cpp_reader if we have the options, though presumably
that's per TU, so to support all this we'd need to capture e.g. the per
-TU encoding information in the LTO records, for the case where one TU
is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC
-encoded source to UCS-4 execution (or whatever).  And there's an issue
if different TUs compiled the same header with different encoding
options.

Or... we could not bother.  This is a Quality of Implementation thing,
for improving diagnostics, and in each case, the diagnostic is required
to cope with substring location information not being available (and
the code I posted in patch 2 of the kit makes it trivial to handle that
case from a diagnostic).  So we could simply have LTO use the
fallback mode.

There are two high-level approaches I've tried:

(a) capture the substring location information in the lexer/parser in
the frontend as it runs, and store it somehow.

(b) regenerate it "on-demand" when a diagnostic needs it.

Approach (b) is inherently going to be prone to the LTO issues you
describe, but it avoids adding to the CPU cycles/memory consumption for
the common case of not needing the information. [1]

Is approach (b) acceptable?


If (b) means potentially reduced quality of the location ranges
in the -Wformat-length pass (e.g., with funky C++ format strings)
then I don't think that's enough of a problem to worry about, at
least not for this warning.

If it means not being able to use the solution you're working
on in the middle end  at all (unless I misunderstood that doesn't
seem to be what you're implying, but just to be sure) then that
would seem like a serious shortcoming.  I would continue to use
the code I copied from c-format.c (assuming that will still work),
but as more warnings are implemented in later passes it would
lead to duplicating code or reinventing the wheel just to get
around the limitation (or simply worse quality diagnostics).

Martin



Thanks
Dave

[1] with the exception of the string concatenation records, but I
believe those are tiny





Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-29 Thread David Malcolm
On Thu, 2016-07-28 at 15:16 -0600, Martin Sebor wrote:
> On 07/28/2016 02:38 PM, Martin Sebor wrote:
> > On 07/28/2016 02:12 PM, David Malcolm wrote:
> > > On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote:
> > > > On 27 July 2016 at 15:30, David Malcolm 
> > > > wrote:
> > > > > > Perhaps it could live for now in c-format.c, since it is
> > > > > > the only
> > > > > > place using it?
> > > > > 
> > > > > Martin Sebor [CC-ed] wants to use it from the middle-end:
> > > > >https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> > > > > so it's unclear to me that c-format.c would be a better
> > > > > location.
> > > > 
> > > > Fine. He will have to figure out how to get a cpp_reader from
> > > > the
> > > > middle-end, though.
> > > 
> > > It seems to me that on-demand reconstruction of source locations
> > > for
> > > STRING_CST nodes is inherently frontend-specific: unless we have
> > > the
> > > frontend record the information in some fe-independent way (which
> > > I
> > > assume we *don't* want to do, for space-efficiency), we need to
> > > be able
> > > to effectively re-run part of the frontend.
> > > 
> > > So maybe this needs to be a langhook; the c-family can use the
> > > global
> > > cpp_reader * there, and everything else can return a "not
> > > supported"
> > > code if a diagnostic requests substring location information (and
> > > the
> > > diagnostic needs to be able to cope with that).
> > 
> > The problem with the lanhook approach, as I learned from my first
> > -Wformat-length attempt, is that it doesn't make the front end
> > implementation available to LTO.  So passes that run late enough
> > with LTO (like the latest version of the -Wformat-length pass
> > does) would not be bale to make use of it.
> 
> I'm sorry, I didn't mean to sound like I was dismissing the idea.
> I agree that string processing is language and front-end specific.
> Having the middle end call back into the front-end also seems like
> the right thing to do, not just to make this case work, but others
> like it as well.  So perhaps the problem to solve is how to teach
> LTO to talk to the front end.  One way to do it would be to build
> the front ends as shared libraries.

Turning frontends into shared libraries as a prerequisite would seem to
be imposing a significant burden on the patch.

Currently all that we need from the C family of frontends is the
cpp_reader and the string concatenation records.  I think we can
reconstruct the cpp_reader if we have the options, though presumably
that's per TU, so to support all this we'd need to capture e.g. the per
-TU encoding information in the LTO records, for the case where one TU
is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC
-encoded source to UCS-4 execution (or whatever).  And there's an issue
if different TUs compiled the same header with different encoding
options.

Or... we could not bother.  This is a Quality of Implementation thing,
for improving diagnostics, and in each case, the diagnostic is required
to cope with substring location information not being available (and
the code I posted in patch 2 of the kit makes it trivial to handle that
case from a diagnostic).  So we could simply have LTO use the
fallback mode.

There are two high-level approaches I've tried:

(a) capture the substring location information in the lexer/parser in
the frontend as it runs, and store it somehow.

(b) regenerate it "on-demand" when a diagnostic needs it.

Approach (b) is inherently going to be prone to the LTO issues you
describe, but it avoids adding to the CPU cycles/memory consumption for
the common case of not needing the information. [1]

Is approach (b) acceptable?

Thanks
Dave

[1] with the exception of the string concatenation records, but I
believe those are tiny


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-28 Thread Martin Sebor

On 07/28/2016 02:38 PM, Martin Sebor wrote:

On 07/28/2016 02:12 PM, David Malcolm wrote:

On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote:

On 27 July 2016 at 15:30, David Malcolm  wrote:

Perhaps it could live for now in c-format.c, since it is the only
place using it?


Martin Sebor [CC-ed] wants to use it from the middle-end:
   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
so it's unclear to me that c-format.c would be a better location.


Fine. He will have to figure out how to get a cpp_reader from the
middle-end, though.


It seems to me that on-demand reconstruction of source locations for
STRING_CST nodes is inherently frontend-specific: unless we have the
frontend record the information in some fe-independent way (which I
assume we *don't* want to do, for space-efficiency), we need to be able
to effectively re-run part of the frontend.

So maybe this needs to be a langhook; the c-family can use the global
cpp_reader * there, and everything else can return a "not supported"
code if a diagnostic requests substring location information (and the
diagnostic needs to be able to cope with that).


The problem with the lanhook approach, as I learned from my first
-Wformat-length attempt, is that it doesn't make the front end
implementation available to LTO.  So passes that run late enough
with LTO (like the latest version of the -Wformat-length pass
does) would not be bale to make use of it.


I'm sorry, I didn't mean to sound like I was dismissing the idea.
I agree that string processing is language and front-end specific.
Having the middle end call back into the front-end also seems like
the right thing to do, not just to make this case work, but others
like it as well.  So perhaps the problem to solve is how to teach
LTO to talk to the front end.  One way to do it would be to build
the front ends as shared libraries.

Martin


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-28 Thread Martin Sebor

On 07/28/2016 02:12 PM, David Malcolm wrote:

On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote:

On 27 July 2016 at 15:30, David Malcolm  wrote:

Perhaps it could live for now in c-format.c, since it is the only
place using it?


Martin Sebor [CC-ed] wants to use it from the middle-end:
   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
so it's unclear to me that c-format.c would be a better location.


Fine. He will have to figure out how to get a cpp_reader from the
middle-end, though.


It seems to me that on-demand reconstruction of source locations for
STRING_CST nodes is inherently frontend-specific: unless we have the
frontend record the information in some fe-independent way (which I
assume we *don't* want to do, for space-efficiency), we need to be able
to effectively re-run part of the frontend.

So maybe this needs to be a langhook; the c-family can use the global
cpp_reader * there, and everything else can return a "not supported"
code if a diagnostic requests substring location information (and the
diagnostic needs to be able to cope with that).


The problem with the lanhook approach, as I learned from my first
-Wformat-length attempt, is that it doesn't make the front end
implementation available to LTO.  So passes that run late enough
with LTO (like the latest version of the -Wformat-length pass
does) would not be bale to make use of it.

Martin


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-28 Thread David Malcolm
On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote:
> On 27 July 2016 at 15:30, David Malcolm  wrote:
> > > Perhaps it could live for now in c-format.c, since it is the only
> > > place using it?
> > 
> > Martin Sebor [CC-ed] wants to use it from the middle-end:
> >   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> > so it's unclear to me that c-format.c would be a better location.
> 
> Fine. He will have to figure out how to get a cpp_reader from the
> middle-end, though.

It seems to me that on-demand reconstruction of source locations for
STRING_CST nodes is inherently frontend-specific: unless we have the
frontend record the information in some fe-independent way (which I
assume we *don't* want to do, for space-efficiency), we need to be able
to effectively re-run part of the frontend.

So maybe this needs to be a langhook; the c-family can use the global
cpp_reader * there, and everything else can return a "not supported"
code if a diagnostic requests substring location information (and the
diagnostic needs to be able to cope with that).

> > There are various places it could live; but getting it working took
> > a
> > lot of effort to achieve - the currently proposed mixture of
> > libcpp,
> > input.c and c-format.c for the locations of the various pieces
> > works
> > (for example, auto_vec isn't available in libcpp).
> 
> I don't doubt it. I tried to do something similar in the past and I
> failed, this is why I ended up with the poor approximation that was
> in
> place until now. This is a significant step forward.

Thanks (for the current implementation, and for the kind words).

> Is libcpp still C? When would be the time to move it to C++ already
> and start using common utilities?

libcpp is very much C++: I converted the linemap types to use
inheritance as part of gcc 6 (and it helped a lot when implementing the
range-tracking stuff).

> Also, moving vec.h, sbitmap, etc to their own directory/library so
> that they can be used by other parts of the compiler (hey! maybe even
> by other parts of the toolchain?) is desirable. Richard has said in
> the past that he supports such moves. Did I understand correctly
> Richard?

FWIW, I'd want the selftest framework there too; part of the reason
things are in input.c rather than libcpp in the current patch is that
selftests aren't yet available from libcpp (and reworking that seems
orthogonal).

> > Given that both Martin and I have candidate patches that are
> > touching
> > the same area, I'd prefer to focus on getting this code in to
> > trunk,
> > rather than rewrite it out-of-tree, so that we can at least have
> > the
> > improvement to location-handling for Wformat.  Once the code is in
> > the
> > tree, it should be easier to figure out how to access it from the
> > middle-end.
> 
> Sure, I think this version is fine. I'm a big proponent of
> step-by-step, even if the steps are only approximations to the
> optimal
> solution :)
> It may be enough to motivate someone else more capable to improve
> over
> my poor approximations ;-)

:)

> > > [*] In an ideal world, we would have a language-agnostic
> > > diagnostics
> > > library
> > > that would include line-map and that would be used by libcpp and
> > > the
> > > rest of
> > > GCC, so that we can remove all the error-routines in libcpp and
> > > the
> > > awkward
> > > glue code that ties it into diagnostics.c.,
> > 
> > Agreed, though that may have to wait until gcc 8 at this point.
> > (Given that the proposed diagnostics library would use line maps,
> > and
> > would be used by libcpp, would it make sense to move the
> > diagnostics
> > into libcpp itself?  Diagnostics would seem to be intimately
> > related to
> > location-tracking)
> 
> I don't think so. There is nothing in diagnostic.* pretty-print.*
> input.* line-map.* that requires libcpp (and only two mentions of
> tree
> that could be easily abstracted out). This was a deliberate design
> goal of Gabriel and followed by most of us later working on
> diagnostics. Of course, cpp may make use of the new library, but also
> other parts of the toolchain (GAS?). The main obstacle I faced when
> trying to do this move was the build machinery to make both libcpp
> and
> gcc build and statically link with this new library.
> 
> Once that move is done, the main abstraction challenge to remove the
> glue is that libcpp has its own flags for options and diagnostics
> that
> are independent from those of gcc (see c_cpp_error in c-common.c). It
> would be great if libcpp used the common flags, but then one would
> have to figure out a way to reorder things so that the diagnostic
> library, libcpp and gcc can use (or avoid being dependent on) the
> same
> flags.

Thanks.

Dave


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-27 Thread Manuel López-Ibáñez
On 27 July 2016 at 15:30, David Malcolm  wrote:
>> Perhaps it could live for now in c-format.c, since it is the only
>> place using it?
>
> Martin Sebor [CC-ed] wants to use it from the middle-end:
>   https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
> so it's unclear to me that c-format.c would be a better location.

Fine. He will have to figure out how to get a cpp_reader from the
middle-end, though.

> There are various places it could live; but getting it working took a
> lot of effort to achieve - the currently proposed mixture of libcpp,
> input.c and c-format.c for the locations of the various pieces works
> (for example, auto_vec isn't available in libcpp).

I don't doubt it. I tried to do something similar in the past and I
failed, this is why I ended up with the poor approximation that was in
place until now. This is a significant step forward.

Is libcpp still C? When would be the time to move it to C++ already
and start using common utilities?

Also, moving vec.h, sbitmap, etc to their own directory/library so
that they can be used by other parts of the compiler (hey! maybe even
by other parts of the toolchain?) is desirable. Richard has said in
the past that he supports such moves. Did I understand correctly
Richard?

> Given that both Martin and I have candidate patches that are touching
> the same area, I'd prefer to focus on getting this code in to trunk,
> rather than rewrite it out-of-tree, so that we can at least have the
> improvement to location-handling for Wformat.  Once the code is in the
> tree, it should be easier to figure out how to access it from the
> middle-end.

Sure, I think this version is fine. I'm a big proponent of
step-by-step, even if the steps are only approximations to the optimal
solution :)
It may be enough to motivate someone else more capable to improve over
my poor approximations ;-)


>> [*] In an ideal world, we would have a language-agnostic diagnostics
>> library
>> that would include line-map and that would be used by libcpp and the
>> rest of
>> GCC, so that we can remove all the error-routines in libcpp and the
>> awkward
>> glue code that ties it into diagnostics.c.,
>
> Agreed, though that may have to wait until gcc 8 at this point.
> (Given that the proposed diagnostics library would use line maps, and
> would be used by libcpp, would it make sense to move the diagnostics
> into libcpp itself?  Diagnostics would seem to be intimately related to
> location-tracking)

I don't think so. There is nothing in diagnostic.* pretty-print.*
input.* line-map.* that requires libcpp (and only two mentions of tree
that could be easily abstracted out). This was a deliberate design
goal of Gabriel and followed by most of us later working on
diagnostics. Of course, cpp may make use of the new library, but also
other parts of the toolchain (GAS?). The main obstacle I faced when
trying to do this move was the build machinery to make both libcpp and
gcc build and statically link with this new library.

Once that move is done, the main abstraction challenge to remove the
glue is that libcpp has its own flags for options and diagnostics that
are independent from those of gcc (see c_cpp_error in c-common.c). It
would be great if libcpp used the common flags, but then one would
have to figure out a way to reorder things so that the diagnostic
library, libcpp and gcc can use (or avoid being dependent on) the same
flags.

Cheers,

Manuel.


Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-27 Thread David Malcolm
On Tue, 2016-07-26 at 19:05 +0100, Manuel López-Ibáñez wrote:
> On 26/07/16 18:11, David Malcolm wrote:
> 
> > gcc/ChangeLog:
> > * gcc.c (cpp_options): Rename string to...
> > (cpp_options_): ...this, to avoid clashing with struct in
> > cpplib.h.
> 
> It seems to me that you need this because  now gcc.c includes
> cpplib.h via 
> input.h, which seems wrong.
> 
> input.h was FE-independent (it depends on line-map.h but it is an
> accident of 
> history that line-map.h is in libcpp since it doesn't depend on
> anything from 
> libcpp [*]). Note that input.h is included in coretypes.h, so this
> means that 
> now cpplib.h is included almost everywhere! [**]
> 
> There is the following in coretypes.h:
> 
> /* Provide forward struct declaration so that we don't have to
> include
> all of cpplib.h whenever a random prototype includes a pointer.
> Note that the cpp_reader and cpp_token typedefs remain part of
> cpplib.h.  */
> 
> struct cpp_reader;
> struct cpp_token;
> 
> precisely to avoid including cpplib.h.
> 
> 
> If I understand correctly, cpplib.h is needed in input.h because of
> this 
> declaration:
> 
> +extern const char *get_source_range_for_substring (cpp_reader
> *pfile,
> +string_concat_db
> *concats,
> +location_t
> strloc,
> +enum cpp_ttype
> type,
> +int start_idx,
> int end_idx,
> +source_range
> *out_range);
> 
> 
> Does this really need to be in input.h ?  It seems something that
> only C-family 
> languages will be able to use. Note that you need a reader to use
> this 
> function, and for that, you need to already include cpplib.h.

Fair point; the attached modification to patch 1 compiles cleanly, and
moves it to a new header.

> Perhaps it could live for now in c-format.c, since it is the only
> place using it?

Martin Sebor [CC-ed] wants to use it from the middle-end:
  https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html
so it's unclear to me that c-format.c would be a better location.

There are various places it could live; but getting it working took a
lot of effort to achieve - the currently proposed mixture of libcpp,
input.c and c-format.c for the locations of the various pieces works
(for example, auto_vec isn't available in libcpp).

Given that both Martin and I have candidate patches that are touching
the same area, I'd prefer to focus on getting this code in to trunk,
rather than rewrite it out-of-tree, so that we can at least have the
improvement to location-handling for Wformat.  Once the code is in the
tree, it should be easier to figure out how to access it from the
middle-end.

> Cheers,
> 
>   Manuel.
> 
> [*] In an ideal world, we would have a language-agnostic diagnostics
> library 
> that would include line-map and that would be used by libcpp and the
> rest of 
> GCC, so that we can remove all the error-routines in libcpp and the
> awkward 
> glue code that ties it into diagnostics.c.,

Agreed, though that may have to wait until gcc 8 at this point.
(Given that the proposed diagnostics library would use line maps, and
would be used by libcpp, would it make sense to move the diagnostics
into libcpp itself?  Diagnostics would seem to be intimately related to
location-tracking)

> [**] And it seems that we are slowly undoing all the work that was
> done by 
> Andrew MacLeod to clean up the .h web and remove dependencies 
> (https://gcc.gnu.org/wiki/rearch).
> 
> From 09824cb27c0e817b29de1c7eb9b53c603116f13e Mon Sep 17 00:00:00 2001
From: David Malcolm 
Date: Wed, 27 Jul 2016 10:33:52 -0400
Subject: [PATCH] Avoid including cpplib.h from input.h

gcc/c-family/ChangeLog:
	* c-common.c: Include "substring-locations.h".

gcc/ChangeLog:
	* input.h: Don't include cpplib.h.
	(get_source_range_for_substring): Move to...
	* substring-locations.h: New header.
---
 gcc/c-family/c-common.c   |  1 +
 gcc/input.h   |  8 
 gcc/substring-locations.h | 30 ++
 3 files changed, 31 insertions(+), 8 deletions(-)
 create mode 100644 gcc/substring-locations.h

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index f4ffc0e..c4843db 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-iterator.h"
 #include "opts.h"
 #include "gimplify.h"
+#include "substring-locations.h"
 
 cpp_reader *parse_in;		/* Declared in c-pragma.h.  */
 
diff --git a/gcc/input.h b/gcc/input.h
index 24d9115..c17e440 100644
--- a/gcc/input.h
+++ b/gcc/input.h
@@ -22,7 +22,6 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_INPUT_H
 
 #include "line-map.h"
-#include 
 
 extern GTY(()) struct line_maps *line_table;
 
@@ -131,11 +130,4 @@ class GTY(()) 

Re: [PATCH 1/3] (v2) On-demand locations within string-literals

2016-07-26 Thread Manuel López-Ibáñez

On 26/07/16 18:11, David Malcolm wrote:


gcc/ChangeLog:
* gcc.c (cpp_options): Rename string to...
(cpp_options_): ...this, to avoid clashing with struct in
cpplib.h.


It seems to me that you need this because  now gcc.c includes cpplib.h via 
input.h, which seems wrong.


input.h was FE-independent (it depends on line-map.h but it is an accident of 
history that line-map.h is in libcpp since it doesn't depend on anything from 
libcpp [*]). Note that input.h is included in coretypes.h, so this means that 
now cpplib.h is included almost everywhere! [**]


There is the following in coretypes.h:

/* Provide forward struct declaration so that we don't have to include
   all of cpplib.h whenever a random prototype includes a pointer.
   Note that the cpp_reader and cpp_token typedefs remain part of
   cpplib.h.  */

struct cpp_reader;
struct cpp_token;

precisely to avoid including cpplib.h.


If I understand correctly, cpplib.h is needed in input.h because of this 
declaration:


+extern const char *get_source_range_for_substring (cpp_reader *pfile,
+  string_concat_db *concats,
+  location_t strloc,
+  enum cpp_ttype type,
+  int start_idx, int end_idx,
+  source_range *out_range);


Does this really need to be in input.h ?  It seems something that only C-family 
languages will be able to use. Note that you need a reader to use this 
function, and for that, you need to already include cpplib.h.


Perhaps it could live for now in c-format.c, since it is the only place using 
it?

Cheers,

Manuel.

[*] In an ideal world, we would have a language-agnostic diagnostics library 
that would include line-map and that would be used by libcpp and the rest of 
GCC, so that we can remove all the error-routines in libcpp and the awkward 
glue code that ties it into diagnostics.c.


[**] And it seems that we are slowly undoing all the work that was done by 
Andrew MacLeod to clean up the .h web and remove dependencies 
(https://gcc.gnu.org/wiki/rearch).