[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #9 from David Malcolm --- (In reply to Hans-Peter Nilsson from comment #8) > (In reply to David Malcolm from comment #7) > > The invalid UTF-8 in the patch seems to have broken the server-side script: > > Maybe the not-really-utf8 files need to be marked in some way in the git > repo to be safely handled for future checkout and updates, including the > problematic scripting? However, reading gitattributes(5) it's not obvious > how. Perhaps https://www.git-scm.com/docs/gitattributes#_marking_files_as_binary though it's not clear if we can do that for individual files (or if it's worth bothering)
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #8 from Hans-Peter Nilsson --- (In reply to David Malcolm from comment #7) > The invalid UTF-8 in the patch seems to have broken the server-side script: Maybe the not-really-utf8 files need to be marked in some way in the git repo to be safely handled for future checkout and updates, including the problematic scripting? However, reading gitattributes(5) it's not obvious how.
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 David Malcolm changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from David Malcolm --- Should be fixed on trunk by r13-6861-gd495ea2b232f3e: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d495ea2b232f3eb50155d7c7362c09a744766746 https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d495ea2b232f3eb50155d7c7362c09a744766746 The invalid UTF-8 in the patch seems to have broken the server-side script: Enumerating objects: 51, done. Counting objects: 100% (51/51), done. Delta compression using up to 64 threads Compressing objects: 100% (29/29), done. Writing objects: 100% (29/29), 7.74 KiB | 1.29 MiB/s, done. Total 29 (delta 22), reused 0 (delta 0), pack-reused 0 remote: Traceback (most recent call last): remote: File "hooks/post_receive.py", line 118, in remote: post_receive(refs_data, args.submitter_email) remote: File "hooks/post_receive.py", line 65, in post_receive remote: submitter_email) remote: File "hooks/post_receive.py", line 47, in post_receive_one remote: update.send_email_notifications() remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189, in send_email_notifications remote: self.__email_new_commits() remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1031, in __email_new_commits remote: commit, self.get_standard_commit_email(commit)) remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1011, in __send_commit_email remote: default_diff=email.diff) remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946, in __maybe_get_email_custom_contents remote: hook_input=json.dumps(hooks_data), remote: File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps remote: return _default_encoder.encode(obj) remote: File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode remote: chunks = self.iterencode(o, _one_shot=True) remote: File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode remote: return _iterencode(o, 0) remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 13147: invalid start byte To git+ssh://gcc.gnu.org/git/gcc.git 13ec81eb4c3..d495ea2b232 master -> master
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #6 from joseph at codesourcery dot com --- For diagnosis of non-UTF-8 in strings / comments, see commit 0b8c57ed40f19086e30ce54faec3222ac21cc0df, "libcpp: Add -Winvalid-utf8 warning [PR106655]" (implementing a new C++ requirement).
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 Hans-Peter Nilsson changed: What|Removed |Added CC||hp at gcc dot gnu.org --- Comment #5 from Hans-Peter Nilsson --- While considering UTF-8 in SARIF files, please also have a look at PR105959.
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #4 from David Malcolm --- (In reply to Andrew Pinski from comment #2) > So I think there is a bug in that code ... The issue is in sarif_builder::maybe_make_artifact_content_object, which uses; char *text_utf8 = maybe_read_file (filename); where there's no guarantee that "text_utf8" is (ahem) actually utf-8. Sorry about that. Working on a fix to make it use the input.cc source-quoting machinery, which should handle encoding.
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #3 from David Malcolm --- (In reply to Andrew Pinski from comment #1) > I would have assumed you need -finput-charset= for the non-utf8 ones really > if your LANG/LANGUAGE is not set to C/UTF8 really. Yeah, but when complaining about encoding issues, the error message we emit should at least be properly encoded :/ It's a major pain for my integration testing where two(?) bad bytes in one source file lead to an unparseable .sarif file (out of thousands). When quoting source in the .sarif output, we should ensure that the final JSON output is all valid UTF-8, perhaps falling back to not quoting source for cases where e.g. - the source file isn't validly encoded, or - the -finput-charset= is wrong, or - the -finput-charset= is missing or - where the source file (erroneously) uses a mixture of different encodings in different parts of itself Probably should also check we do something sane for trojan source attacks
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #2 from Andrew Pinski --- https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Preprocessor-Options.html#index-finput-charset Even has the following: -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line option. Currently the command-line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s iconv library routine. So I think there is a bug in that code ...
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 --- Comment #1 from Andrew Pinski --- I would have assumed you need -finput-charset= for the non-utf8 ones really if your LANG/LANGUAGE is not set to C/UTF8 really.
[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098 David Malcolm changed: What|Removed |Added Last reconfirmed||2023-03-11 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1