[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

dmalcolm at gcc dot gnu.org via Gcc-bugs Fri, 10 Mar 2023 16:31:31 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098


--- Comment #3 from David Malcolm <dmalcolm at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> I would have assumed you need -finput-charset= for the non-utf8 ones really
> if your LANG/LANGUAGE is not set to C/UTF8 really.

Yeah, but when complaining about encoding issues, the error message we emit
should at least be properly encoded :/

It's a major pain for my integration testing where two(?) bad bytes in one
source file lead to an unparseable .sarif file (out of thousands).

When quoting source in the .sarif output, we should ensure that the final JSON
output is all valid UTF-8, perhaps falling back to not quoting source for cases
where e.g.
- the source file isn't validly encoded, or
- the -finput-charset= is wrong, or   
- the -finput-charset= is missing or
- where the source file (erroneously) uses a mixture of different encodings in
different 
parts of itself

Probably should also check we do something sane for trojan source attacks

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

Reply via email to