[
https://issues.apache.org/jira/browse/FLINK-39805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084723#comment-18084723
]
Dennis-Mircea Ciupitu commented on FLINK-39805:
-----------------------------------------------
Hi [~pavelzeger]! Before opening a bug, please include a concrete reproduction
for what you are describing. Looking at the statements here, what this ticket
describes is not reproducible on the operator as it ships today.
The operator runtime image is {{eclipse-temurin:17-jre-jammy}}
({{{}Dockerfile{}}} line 38, the Maven image on line 20 is build-only). That
base
[explicitly|https://github.com/adoptium/containers/blob/main/17/jre/ubuntu/jammy/Dockerfile]
sets {{{}LANG=en_US.UTF-8{}}}, {{{}LC_ALL=en_US.UTF-8{}}}, and runs
{{{}locale-gen en_US.UTF-8{}}}. Nothing in the operator {{{}Dockerfile{}}},
{{{}docker-entrypoint.sh{}}}, or Helm chart overrides those (grepped {{LANG}} /
{{LC_ALL}} / {{file.encoding}} / {{JAVA_TOOL_OPTIONS}} / {{{}JAVA_OPTS{}}},
zero hits). So at runtime {{Charset.defaultCharset()}} resolves to UTF-8 and
the three {{getBytes()}} calls produce exactly what log4j, logback, and the
Kubernetes YAML parser expect. The empty {{Affected Versions}} field is
consistent with this.
To actually corrupt anything, the user has to deviate from the shipped setup:
rebuild on a non-UTF-8 base (distroless, alpine/musl, scratch-derived, hardened
internal), strip locale env in their Pod spec, or pass
{{-Dfile.encoding=US-ASCII}} via {{{}JVM_ARGS{}}}. None of those happen by
default.
Two asks:
# If you have a concrete reproduction (base image plus a pod template that
produces wrong bytes on disk), please attach it. Otherwise the failure is
hypothetical.
# If not, this can go maximum as a {{Improvement}} / {{[hotfix]}} PR. The
SpotBugs {{DM_DEFAULT_ENCODING}} rule you suggested is independently useful for
recurrence prevention.
In general, {{Bug}} + {{Major}} should require an observed failure with a
reproducer. For example, something like "Spotted in the code, could break under
conditions X" is real and worth fixing, but belongs in {{Improvement}} so
triage stays accurate and reviewers know exactly what is being hardened against.
So, if you do not have a concrete reproduction for this I'll close this bug as
Invalid. (cc [~gyfora])
> FlinkConfigBuilder uses platform-default charset when writing
> log/pod-template files
> ------------------------------------------------------------------------------------
>
> Key: FLINK-39805
> URL: https://issues.apache.org/jira/browse/FLINK-39805
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Pavel Zeger
> Priority: Major
>
> `flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java`,
> three calls:
> {code:java}
> File log4jConfFile = new File(tmpDir.getAbsolutePath(),
> CONFIG_FILE_LOG4J_NAME);
> Files.write(log4jConfFile.toPath(), log4jConf.getBytes());
> File logbackConfFile = new File(tmpDir.getAbsolutePath(),
> CONFIG_FILE_LOGBACK_NAME);
> Files.write(logbackConfFile.toPath(), logbackConf.getBytes());
> final File tmp = File.createTempFile(GENERATED_FILE_PREFIX + "podTemplate_",
> ".yaml");
> Files.write(tmp.toPath(), Serialization.asYaml(podTemplate).getBytes());{code}
> `String.getBytes()` (no-arg) encodes using the JVM’s
> Charset.defaultCharset(), which is environment-dependent. On most modern
> Linux containers it happens to be UTF-8, but:
> # On older Linux base images and on container runtimes that don’t set
> LANG=*UTF-8, the default falls back to US-ASCII or ISO-8859-1.
> # On Windows hosts the default is typically windows-1252 or another local
> code page.
> # In a JVM run with -Dfile.encoding=, the result depends on whatever the
> operator was started with.
> When this happens, any non-ASCII character in the user’s log4j.properties,
> logback.xml, or podTemplate.yaml (a UTF-8 emoji in a comment, an
> internationalised label key, an annotation containing a CJK character,
> non-breaking spaces in YAML, etc.) is corrupted.
> The pod template case is the most concerning. Users frequently add
> annotations / labels / env values containing non-ASCII characters (legitimate
> use cases: internationalised tenant labels, owner names with diacritics,
> region tags, etc.). A corrupted YAML written to the temp file is then passed
> to Kubernetes, which either rejects it (best case) or silently accepts a
> corrupted value (worst case).
>
> *Proposed fix*
> # Always use UTF-8 explicitly
> # Adding the SpotBugs DM_DEFAULT_ENCODING rule to the project would prevent
> recurrence.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)