On Sun, 26 Sep 2021 09:55:00 GMT, Jie Fu <ji...@openjdk.org> wrote:

> Hi all,
> 
> I tried to build OpenJDK on Cygwin (Windows 2016 + VS2019).
> However, I failed with C4474 and C4778 warnings as below:
> 
> Compiling 100 properties into resource bundles for java.desktop
> Compiling 3038 files for java.base
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): error 
> C2220: the following warning is treated as an error
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): warning 
> C4778: 'sscanf' : unterminated format string 
> '%255[*\x01\x02\x03\x04\x05\x06\a\b\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f!"#$%&'*+,-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e\x97%n'
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): warning 
> C4474: 'sscanf' : too many arguments passed for format string
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): note: 
> placeholders and their parameters expect 1 variadic arguments, but 3 were 
> provided
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): warning 
> C4778: 'sscanf' : unterminated format string 
> '%1022[[);/\x01\x02\x03\x04\x05\x06\a\b\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f!"#$%&'*+,-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e%n'
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): warning 
> C4474: 'sscanf' : too many arguments passed for format string
> e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): note: 
> placeholders and their parameters expect 0 variadic arguments, but 2 were 
> provided
> 
> 
> The failure is caused by non-ASCII chars in the format string of sscanf 
> [1][2], which is non-portable on our Windows platform.
> In fact, these non-ASCII coding also triggers C4819 warning, which had been 
> disabled in JDK-8216154 [3].
> And I also found an article showing that sscanf may fail with non-ASCII in 
> the format string [4].
> 
> So it would be nice to remove these non-ASCII chars  (`\x80 ~ \xef`).
> And I think it's safe to do so.
> 
> This is because:
>  1) There are actually no non-ASCII chars for package/class/method/signature 
> names.
>  2) I don't think there is a use case, in which people will input non-ASCII 
> for `CompileCommand`.
> 
> You may argue that the non-ASCII may be used by the parser itself.
> But I didn't find that usage at all. (Please let me know if I miss something.)
> 
> So I suggest to remove these non-ASCII code to make HotSpot to be more 
> portable.
> And if we do so, we can also remove the only one 
> `PRAGMA_DISABLE_MSVC_WARNING(4819)` [5].
> 
> Testing:
>  - Build tests on Windows
>  - tier1~3 on Linux/x64
> 
> Thanks.
> Best regards,
> Jie
> 
> [1] 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L269
> [2] 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L319
> [3] 
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-January/032014.html
> [4] https://jeffpar.github.io/kbarchive/kb/047/Q47369/
> [5] 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L246

The current limitations of the MethodMather class are:

[1] `parse_method_pattern(char*& line, ...)` requires `line` to be a 
UTF8-encoded byte sequence. Essentially, this means that for -XX:CompileCommand 
to support non-ASCII characters, the JVM (and the shell that runs the JVM) must 
be using UTF8 character encoding.

Note that a "locale" contains 3 parts: language, country and character 
encoding. For example,

- en_US.utf8 (English language, United States, UTF8 character encoding)
- zh_CN.utf8 (Chinese language, China,  UTF8 character encoding)
- zh_CN.gbk (Chinese language, China,  GBK character encoding)

The first two support  non-ASCII characters in  -XX:CompileCommand, but the 
third one doesn't.

[2] MethodMather uses `sscanf`. It assumes that the JVM is running with UTF8 
character encoding because

- It uses `char*` strings returned by `sscanf` to compare with the bytes stored 
in Symbols. This requires `sscanf`  to return strings that are encoded in UTF8, 
because Symbols stores the string with UTF8-encoded bytes.
- It relies on range checking by `sscanf` to enforce the following 
restrictions. However, these restrictions are given with the RANGE macro which 
is UTF8 encoded bytes (and I suspect that these are incorrect when handling 
multi-byte UTF8-encoded characters):


// '\0' and 0xf0-0xff are disallowed in constant string values
// 0x20 ' ', 0x09 '\t' and, 0x2c ',' are used in the matching
// 0x5b '[' and 0x5d ']' can not be used because of the matcher
// 0x28 '(' and 0x29 ')' are used for the signature
// 0x2e '.' is always replaced before the matching
// 0x2f '/' is only used in the class name as package separator

==================================
Proposed solutions:

I don't think we can solve [1] easily. To handle non-ASCII characters that are 
non encoded in UTF8, we need to call NewPlatformString() in 
src/java.base/share/native/libjli/java.c. However, this executes Java code, but 
-XX:CompileCommand needs to be processed before any Java code execution.  ==> 
Proposal: IGNORE it for now.

For [2], there are two distinct issues:

(a) The restriction checks are invalid when the JVM is running in an non-UTF8 
encoding -- this is a moot point because we can't handle [1] anyway, so the 
data given to sscanf() is already bad.   => Proposal: IGNORE it for now

(b) VC++ compilation warning when methodMather.cpp is compiled in non-UTF8 
environments

This is just a warning, and (I think .....) it doesn't change the object file 
at all. I.e., the literal strings in methodMatcher.obj are exactly the same as 
if methodMather.cpp is compiled under a UTF8 environment.

Proposal: use pragma to disable the warning.
Assuming that my analysis for [1] and (a) is correct, there's no reason to fix 
the sscanf code. Disabling the warnings with pragma is the most painless and 
easiest way to handle this.

@DamonFool could you try this experiment:

- Implement the pragma and build two JDKs -- one in a Chinese Windows 
environment, and another in an English Windows environment 
- run `strings methodMatcher.obj` and see if the output is identical
- run the "CJK" test example in my previous comment, and see if you get 
identical results with both JDKs
  - On Windows, you may need to do this to force the terminal to be using UTF8 
code page. See 
https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line

(If this doesn't work, an alternative is to avoid using sscanf and write our 
own parser).

Thanks

-------------

PR: https://git.openjdk.java.net/jdk/pull/5704

Reply via email to