Duansg commented on code in PR #3810:
URL: https://github.com/apache/hertzbeat/pull/3810#discussion_r2418782393
##########
hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/prometheus/parser/OnlineParser.java:
##########
@@ -362,37 +365,123 @@ private static CharChecker parseLabelValue(InputStream
inputStream, StringBuilde
return new CharChecker(i);
}
+ /**
+ * Handles multi-byte UTF-8 character parsing from input stream.
+ * Reads additional bytes based on the first byte and validates the UTF-8
sequence.
+ * Appends the decoded character to the string builder or replacement
character if invalid.
+ *
+ * @param firstByte the first byte of the UTF-8 character sequence
+ * @param inputStream the input stream to read additional bytes from
+ * @param stringBuilder the string builder to append the decoded character
to
+ * @throws IOException if an I/O error occurs while reading from the input
stream
+ */
private static void handleUtf8Character(int firstByte, InputStream
inputStream, StringBuilder stringBuilder) throws IOException {
- List<Integer> bytes = new ArrayList<>();
- bytes.add(firstByte);
-
- int additionalBytes = getUtf8AdditionalByteCount(firstByte);
-
- for (int j = 0; j < additionalBytes; j++) {
- int nextByte = inputStream.read();
- if (nextByte == -1) break;
- bytes.add(nextByte);
+ byte[] byteArray = new byte[4];
+ byteArray[0] = (byte) firstByte;
+ int additionalBytes = calculateUtf8ContinuationBytes(firstByte);
+ if (additionalBytes == -1) {
+ appendInvalidCharacters(stringBuilder);
+ return;
}
+ int totalBytes = 1;
- byte[] byteArray = new byte[bytes.size()];
- for (int j = 0; j < bytes.size(); j++) {
- byteArray[j] = (byte) bytes.get(j).intValue();
+ for (int i = 0; i < additionalBytes; i++) {
+ int nextByte = inputStream.read();
+ if (nextByte == -1) {
+ appendInvalidCharacters(stringBuilder);
+ return;
+ }
+ // Verify subsequent byte format:10xxxxxx
+ if ((nextByte & 0xC0) != 0x80) {
+ appendInvalidCharacters(stringBuilder);
+ return;
Review Comment:
> There seems to be a lack of surrogate and U+10FFFF checks here ❤
@mengnankkkk Hi, thank you for your reply. Regarding the point you raised,
I have indeed considered it.
My consideration is:
1. It cannot appear in a standard UTF-8 data stream, as RFC 3629 explicitly
prohibits it.
2. I reviewed the documentation related to Prometheus. It only specifies
UTF-8 encoding but does not describe any associated code point detection.
3. The `label value` generated by the official Prometheus exporter will
never appear, as it is considered invalid UTF-8 within the official Prometheus
ecosystem (including the Go implementation and mainstream exporters) or when
using the Go standard library. Consequently, it will never appear in valid
metric outputs.
Therefore, a relatively lenient approach was adopted. Such scenarios are
extremely rare and only occur in non-standard implementations or when error
byte streams are intentionally generated manually. Consequently, I added a
`todo` annotation in the code to ensure additional detection and handling can
be implemented if necessary in the future.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]