Re: [PATCH v4 8/8] diagnostics: Support generated data locations in SARIF output

2023-08-15 Thread Lewis Hyatt via Gcc-patches
On Tue, Aug 15, 2023 at 01:04:04PM -0400, David Malcolm wrote:
> On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> > The diagnostics routines for SARIF output need to read the source code back
> > in, so that they can generate "snippet" and "content" records, so they need 
> > to
> > be able to cope with generated data locations.  Add support for that in
> > diagnostic-format-sarif.cc.
> > 
> > gcc/ChangeLog:
> > 
> > * diagnostic-format-sarif.cc (class sarif_builder): Adapt interface
> > to support generated data locations.
> > (sarif_builder::maybe_make_physical_location_object): Change the
> > m_filenames hash_set to support generated data.
> > (sarif_builder::make_artifact_location_object): Use a source_id 
> > rather
> > than a plain file name.
> > (sarif_builder::maybe_make_region_object): Adapt to
> > expanded_location interface changes.
> > (sarif_builder::maybe_make_region_object_for_context): Likewise.
> > (sarif_builder::make_artifact_object): Likewise.
> > (sarif_builder::make_run_object): Handle generated data.
> > (sarif_builder::maybe_make_artifact_content_object): Likewise.
> > (get_source_lines): Likewise.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * c-c++-common/diagnostic-format-sarif-file-5.c: New test.
> 
> I'm not sure if generated data is allowed as part of a SARIF artefact,
> or if there's a more standard-compliant way of representing this; SARIF
> says an artefact is a "sequence of bytes addressable via a URI".
> 
> Can you post a simple example of the generated .sarif JSON please? 
> e.g. from the new test, so that we can see it looks like.
> 
> You could run it through:
> 
>   python -m json.tool 
> 
> to format it for easier reading.

For a simple example like:

_Pragma("GCC diagnostic ignored \"-Wnot-an-option\"")

for which the normal output is:

=
In buffer generated from t.cpp:1:
:1:24: warning: unknown option after ‘#pragma GCC diagnostic’ kind 
[-Wpragmas]
1 | GCC diagnostic ignored "-Wnot-an-option"
  |^
t.cpp:1:1: note: in <_Pragma directive>
1 | _Pragma("GCC diagnostic ignored \"-Wnot-an-option\"")
  | ^~~
=

The SARIF output does not end up referencing any generated data locations,
because those are logically part of the "expansion" of the _Pragma
directive, and it doesn't output macro expansions.  In order for SARIF to
currently do something with generated data, it needs to see a generated data
location in a non-macro context. The only way to get GCC to do that, right
now, is with -fdump-internal-locations, which is what the new test case
does. That just unfortunately generates a larger amount of output. I attached
it, in case that's still helpful, for the following program:

=
_Pragma("GCC diagnostic push")
=

I guess there's potentially already a problem here because 'python -m
json.tool' is unhappy with this output and refuses to process it:

=
Invalid \escape: line 1 column 3436 (char 3435)
=

The related text is:
=
{"location": {"uri": "", "uriBaseId": "PWD"},
"contents":{"text": "GCC diagnostic push\n\0"}
=

And the \0 is not allowed it seems?

I also attached the output of 'python -m json.tool' anyway, after manually
removing the \0.

Is it better to just skip these locations for now?

-Lewis
{"$schema": 
"https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json;,
 "version": "2.1.0", "runs": [{"tool": {"driver": {"name": "GNU C++17", 
"fullName": "GNU C++17 (GCC) version 14.0.0 20230811 (experimental) 
(x86_64-pc-linux-gnu)", "version": "14.0.0 20230811 (experimental)", 
"informationUri": "https://gcc.gnu.org/gcc-14/;, "rules": []}}, "invocations": 
[{"executionSuccessful": true, "toolExecutionNotifications": []}], 
"originalUriBaseIds": {"PWD": {"uri": "file:///home/lewis/"}}, "artifacts": 
[{"location": {"uri": "t.cpp", "uriBaseId": "PWD"}, "contents": {"text": 
"_Pragma(\"GCC diagnostic push\")\n"}, "sourceLanguage": "cplusplus"}, 
{"location": {"uri": "/usr/include/stdc-predef.h"}, "contents": {"text": "/* 
Copyright (C) 1991-2022 Free Software Foundation, Inc.\n   This file is part of 
the GNU C Library.\n\n   The GNU C Library is free software; you can 
redistribute it and/or\n   modify it under the terms of the GNU Lesser General 
Public\n 
   License as published by the Free Software Foundation; either\n   version 2.1 
of the License, or (at your option) any later version.\n\n   The GNU C Library 
is distributed in the hope that it will be useful,\n   but WITHOUT ANY 
WARRANTY; without even the implied warranty of\n   MERCHANTABILITY or FITNESS 
FOR A PARTICULAR PURPOSE.  See the GNU\n   Lesser General Public License for 
more details.\n\n   You should have received a copy of the GNU Lesser General 
Public\n   License along with the GNU C Library; if not, see\n   
.  

Re: [PATCH v4 8/8] diagnostics: Support generated data locations in SARIF output

2023-08-15 Thread David Malcolm via Gcc-patches
On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> The diagnostics routines for SARIF output need to read the source code back
> in, so that they can generate "snippet" and "content" records, so they need to
> be able to cope with generated data locations.  Add support for that in
> diagnostic-format-sarif.cc.
> 
> gcc/ChangeLog:
> 
> * diagnostic-format-sarif.cc (class sarif_builder): Adapt interface
> to support generated data locations.
> (sarif_builder::maybe_make_physical_location_object): Change the
> m_filenames hash_set to support generated data.
> (sarif_builder::make_artifact_location_object): Use a source_id rather
> than a plain file name.
> (sarif_builder::maybe_make_region_object): Adapt to
> expanded_location interface changes.
> (sarif_builder::maybe_make_region_object_for_context): Likewise.
> (sarif_builder::make_artifact_object): Likewise.
> (sarif_builder::make_run_object): Handle generated data.
> (sarif_builder::maybe_make_artifact_content_object): Likewise.
> (get_source_lines): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
> * c-c++-common/diagnostic-format-sarif-file-5.c: New test.

I'm not sure if generated data is allowed as part of a SARIF artefact,
or if there's a more standard-compliant way of representing this; SARIF
says an artefact is a "sequence of bytes addressable via a URI".

Can you post a simple example of the generated .sarif JSON please? 
e.g. from the new test, so that we can see it looks like.

You could run it through:

  python -m json.tool 

to format it for easier reading.


Thanks
Dave



[PATCH v4 8/8] diagnostics: Support generated data locations in SARIF output

2023-08-09 Thread Lewis Hyatt via Gcc-patches
The diagnostics routines for SARIF output need to read the source code back
in, so that they can generate "snippet" and "content" records, so they need to
be able to cope with generated data locations.  Add support for that in
diagnostic-format-sarif.cc.

gcc/ChangeLog:

* diagnostic-format-sarif.cc (class sarif_builder): Adapt interface
to support generated data locations.
(sarif_builder::maybe_make_physical_location_object): Change the
m_filenames hash_set to support generated data.
(sarif_builder::make_artifact_location_object): Use a source_id rather
than a plain file name.
(sarif_builder::maybe_make_region_object): Adapt to
expanded_location interface changes.
(sarif_builder::maybe_make_region_object_for_context): Likewise.
(sarif_builder::make_artifact_object): Likewise.
(sarif_builder::make_run_object): Handle generated data.
(sarif_builder::maybe_make_artifact_content_object): Likewise.
(get_source_lines): Likewise.

gcc/testsuite/ChangeLog:

* c-c++-common/diagnostic-format-sarif-file-5.c: New test.
---
 gcc/diagnostic-format-sarif.cc| 88 +++
 .../diagnostic-format-sarif-file-5.c  | 31 +++
 2 files changed, 82 insertions(+), 37 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/diagnostic-format-sarif-file-5.c

diff --git a/gcc/diagnostic-format-sarif.cc b/gcc/diagnostic-format-sarif.cc
index 1eff71962d7..c7c0e5d4b0a 100644
--- a/gcc/diagnostic-format-sarif.cc
+++ b/gcc/diagnostic-format-sarif.cc
@@ -174,7 +174,7 @@ private:
   json::array *maybe_make_kinds_array (diagnostic_event::meaning m) const;
   json::object *maybe_make_physical_location_object (location_t loc);
   json::object *make_artifact_location_object (location_t loc);
-  json::object *make_artifact_location_object (const char *filename);
+  json::object *make_artifact_location_object (source_id src);
   json::object *make_artifact_location_object_for_pwd () const;
   json::object *maybe_make_region_object (location_t loc) const;
   json::object *maybe_make_region_object_for_context (location_t loc) const;
@@ -197,9 +197,9 @@ private:
   json::object *make_reporting_descriptor_object_for_cwe_id (int cwe_id) const;
   json::object *
   make_reporting_descriptor_reference_object_for_cwe_id (int cwe_id);
-  json::object *make_artifact_object (const char *filename);
-  json::object *maybe_make_artifact_content_object (const char *filename) 
const;
-  json::object *maybe_make_artifact_content_object (const char *filename,
+  json::object *make_artifact_object (source_id src);
+  json::object *maybe_make_artifact_content_object (source_id src) const;
+  json::object *maybe_make_artifact_content_object (source_id src,
int start_line,
int end_line) const;
   json::object *make_fix_object (const rich_location _loc);
@@ -220,7 +220,11 @@ private:
  diagnostic group.  */
   sarif_result *m_cur_group_result;
 
-  hash_set  m_filenames;
+  /* If the second member is >0, then this is a buffer of generated content,
+ with that length, not a filename.  */
+  hash_set ,
+  int_hash  >
+   > m_filenames;
   bool m_seen_any_relative_paths;
   hash_set  m_rule_id_set;
   json::array *m_rules_arr;
@@ -787,7 +791,8 @@ sarif_builder::maybe_make_physical_location_object 
(location_t loc)
   /* "artifactLocation" property (SARIF v2.1.0 section 3.29.3).  */
   json::object *artifact_loc_obj = make_artifact_location_object (loc);
   phys_loc_obj->set ("artifactLocation", artifact_loc_obj);
-  m_filenames.add (LOCATION_FILE (loc));
+  const auto src = LOCATION_SRC (loc);
+  m_filenames.add ({src.get_filename_or_buffer (), src.get_buffer_len ()});
 
   /* "region" property (SARIF v2.1.0 section 3.29.4).  */
   if (json::object *region_obj = maybe_make_region_object (loc))
@@ -811,7 +816,7 @@ sarif_builder::maybe_make_physical_location_object 
(location_t loc)
 json::object *
 sarif_builder::make_artifact_location_object (location_t loc)
 {
-  return make_artifact_location_object (LOCATION_FILE (loc));
+  return make_artifact_location_object (LOCATION_SRC (loc));
 }
 
 /* The ID value for use in "uriBaseId" properties (SARIF v2.1.0 section 3.4.4)
@@ -823,10 +828,13 @@ sarif_builder::make_artifact_location_object (location_t 
loc)
or return NULL.  */
 
 json::object *
-sarif_builder::make_artifact_location_object (const char *filename)
+sarif_builder::make_artifact_location_object (source_id src)
 {
   json::object *artifact_loc_obj = new json::object ();
 
+  const auto filename = src.is_buffer ()
+? special_fname_generated () : src.get_filename_or_buffer ();
+
   /* "uri" property (SARIF v2.1.0 section 3.4.3).  */
   artifact_loc_obj->set ("uri", new json::string (filename));
 
@@ -912,9 +920,9 @@ sarif_builder::maybe_make_region_object (location_t loc)