René Zeidler created NIFI-12750:
-----------------------------------
Summary: ExecuteStreamCommand incorrectly decodes error stream
Key: NIFI-12750
URL: https://issues.apache.org/jira/browse/NIFI-12750
Project: Apache NiFi
Issue Type: Bug
Components: Extensions
Affects Versions: 2.0.0-M2, 1.25.0
Environment: any
Reporter: René Zeidler
Attachments: ExecuteStreamCommand_Encoding_Bug.json, encodingTest.sh,
image-2024-02-07-15-14-08-518.png, image-2024-02-07-15-14-54-841.png,
image-2024-02-07-15-20-11-684.png
h1. Summary
The ExecuteStreamCommand processor stores everything the invoked command writes
to the error stream (stderr) into the FlowFile attribute
{{{}execution.error{}}}.
When converting the bytes from the stream to a String, it interprets each
individual byte as a Unicode codepoint. When reading only single bytes this
effectively results in ISO-8859-1 (Latin-1).
Instead, it should use the system default encoding (like it already does for
writing stdout if Output Destination Attribute is set) or use a configurable
encoding (for both stdout and stderr).
h1. Details
When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding issues
are the responsibility of the flow designer, and NiFi has the
ConvertCharacterSet processor to deal with those issues.
When writing to attributes, the API uses Java String objects, which are
encoding agnostic (they represent Unicode codepoints, not bytes). Therefore,
processors receiving bytes have to interpret them using an encoding.
The ExecuteStreamCommand processor writes the output of the command (stdout) to
the Output Destination Attribute (if set). To do that, it convertes bytes into
a String using the system default encoding* by calling {{new String}} without
an encoding argument:
[https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499]
When converting stderr to a String to write into the {{execution.error}}
attribute, it uses this weird algorithm:
[https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517]
It reads individual bytes from the error stream (as {{{}int{}}}s) and casts
them to {{{}char{}}}s. What Java does in this case is interpret the integer as
a Unicode code point. For single bytes, this matches the ISO-8859-1 encoding.
Instead, it should use the same decoding method as for stdout.
h1. Reproduction steps
These steps are for a Linux environment, but can be adapted with a different
executable for Windows.
# Create the file /opt/nifi/data/encodingTest.sh (attached) with the following
contents and make it executable:
{quote}{{#/bin/bash}}
{{echo "|out static: ÄÖÜäöüß"}}
{{{}echo "|error static: ÄÖÜäöüß" >&2{}}}{{{}echo "|out arg: $1"{}}}
{{{}echo "|error arg: $1" >&2{}}}{{{}echo "|out arg hexdump:"{}}}
{{printf '%s' "$1" | od -A x -t x1z -v}}
{{echo "|error arg hexdump:" >&2}}
{{printf '%s' "$1" | od -A x -t x1z -v >&2}}{quote}The script writes identical
data to both stdout and stderr. It contains non-ASCII characters to make the
encoding issues visible.
# Import the attached flow or create it manually:
!image-2024-02-07-15-14-08-518.png|width=324,height=373!!image-2024-02-07-15-14-54-841.png|width=326,height=120!
# Run the GenerateFlowFile processor once and observe the attributes of the
FlowFile in the final queue:
!image-2024-02-07-15-20-11-684.png|width=523,height=195!
The output attribute (stdout) is correctly decoded. The execution.error
attribute (stderr) contains garbled text (UTF-8 bytes interpreted as ISO-8859-1
and reencoded in UTF-8).
h1. *On the system default encoding
The system default encoding is a property of the JVM. It is UTF-8 on Linux, but
Windows-1252 (or a different copepage depending on locale) in Windows
environments. It can be overriden using the {{file.encoding}} JVM arg on
startup.
Relying on the system default encoding is dangerous and can lead to subtle
bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670).
In this case, it might make sense to use the system default encoding, as it
concerns data passed between NiFi and another process that runs on the host
system. Also, the ProcessBuilder class used the create the process always
passes arguments in the system default encoding, and there doesn't seem a way
to change that. This behavior should probably be documented.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)