[jira] [Created] (NIFI-12750) ExecuteStreamCommand incorrectly decodes error stream

Jira Wed, 07 Feb 2024 06:34:05 -0800

René Zeidler created NIFI-12750:
-----------------------------------

             Summary: ExecuteStreamCommand incorrectly decodes error stream
                 Key: NIFI-12750
                 URL: https://issues.apache.org/jira/browse/NIFI-12750
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 2.0.0-M2, 1.25.0
         Environment: any
            Reporter: René Zeidler
         Attachments: ExecuteStreamCommand_Encoding_Bug.json, encodingTest.sh, 
image-2024-02-07-15-14-08-518.png, image-2024-02-07-15-14-54-841.png, 
image-2024-02-07-15-20-11-684.png


h1. Summary

The ExecuteStreamCommand processor stores everything the invoked command writes 
to the error stream (stderr) into the FlowFile attribute 
{{{}execution.error{}}}.

When converting the bytes from the stream to a String, it interprets each 
individual byte as a Unicode codepoint. When reading only single bytes this 
effectively results in ISO-8859-1 (Latin-1).

Instead, it should use the system default encoding (like it already does for 
writing stdout if Output Destination Attribute is set) or use a configurable 
encoding (for both stdout and stderr).
h1. Details

When reading/writing FlowFiles, NiFi always uses raw bytes, so encoding issues 
are the responsibility of the flow designer, and NiFi has the 
ConvertCharacterSet processor to deal with those issues.

When writing to attributes, the API uses Java String objects, which are 
encoding agnostic (they represent Unicode codepoints, not bytes). Therefore, 
processors receiving bytes have to interpret them using an encoding.

The ExecuteStreamCommand processor writes the output of the command (stdout) to 
the Output Destination Attribute (if set). To do that, it convertes bytes into 
a String using the system default encoding* by calling {{new String}} without 
an encoding argument:
[https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L499]

When converting stderr to a String to write into the {{execution.error}} 
attribute, it uses this weird algorithm:
[https://github.com/apache/nifi/blob/72f6d8a6800c643d5f283ae9bff6d7de25b503e9/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ExecuteStreamCommand.java#L507-L517]
It reads individual bytes from the error stream (as {{{}int{}}}s) and casts 
them to {{{}char{}}}s. What Java does in this case is interpret the integer as 
a Unicode code point. For single bytes, this matches the ISO-8859-1 encoding. 
Instead, it should use the same decoding method as for stdout.
h1. Reproduction steps

These steps are for a Linux environment, but can be adapted with a different 
executable for Windows.
 # Create the file /opt/nifi/data/encodingTest.sh (attached) with the following 
contents and make it executable:

{quote}{{#/bin/bash}}
{{echo "|out static: ÄÖÜäöüß"}}
{{{}echo "|error static: ÄÖÜäöüß" >&2{}}}{{{}echo "|out arg: $1"{}}}
{{{}echo "|error arg: $1" >&2{}}}{{{}echo "|out arg hexdump:"{}}}
{{printf '%s' "$1" | od -A x -t x1z -v}}
{{echo "|error arg hexdump:" >&2}}
{{printf '%s' "$1" | od -A x -t x1z -v >&2}}{quote}The script writes identical 
data to both stdout and stderr. It contains non-ASCII characters to make the 
encoding issues visible.
 # Import the attached flow or create it manually:
!image-2024-02-07-15-14-08-518.png|width=324,height=373!!image-2024-02-07-15-14-54-841.png|width=326,height=120!

 # Run the GenerateFlowFile processor once and observe the attributes of the 
FlowFile in the final queue:
!image-2024-02-07-15-20-11-684.png|width=523,height=195!
The output attribute (stdout) is correctly decoded. The execution.error 
attribute (stderr) contains garbled text (UTF-8 bytes interpreted as ISO-8859-1 
and reencoded in UTF-8).

h1. *On the system default encoding

The system default encoding is a property of the JVM. It is UTF-8 on Linux, but 
Windows-1252 (or a different copepage depending on locale) in Windows 
environments. It can be overriden using the {{file.encoding}} JVM arg on 
startup.

Relying on the system default encoding is dangerous and can lead to subtle 
bugs, like the ones I previously reported (NIFI-12669 and NIFI-12670).

In this case, it might make sense to use the system default encoding, as it 
concerns data passed between NiFi and another process that runs on the host 
system. Also, the ProcessBuilder class used the create the process always 
passes arguments in the system default encoding, and there doesn't seem a way 
to change that. This behavior should probably be documented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NIFI-12750) ExecuteStreamCommand incorrectly decodes error stream

Reply via email to