[ https://issues.apache.org/jira/browse/NIFI-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846695#comment-17846695 ]
David Handermann commented on NIFI-12669: ----------------------------------------- [~mattyb149] I pushed an update to the support branch, correct to the ByteArrayOutputStream.toString() method. The signature that takes a Charset was added in Java 10, so the change requires using the String name of the character set instead. > EvaluateXQuery processor incorrectly encodes result attributes > -------------------------------------------------------------- > > Key: NIFI-12669 > URL: https://issues.apache.org/jira/browse/NIFI-12669 > Project: Apache NiFi > Issue Type: Bug > Components: Configuration, Extensions > Environment: JVM with non-UTF-8 default encoding (e.g. default > Windows installation) > Reporter: René Zeidler > Assignee: Jim Steinebrey > Priority: Major > Labels: encoding, utf8, windows, xml > Fix For: 1.27.0, 2.0.0-M4 > > Attachments: EvaluateXQuery_Encoding_Bug.json, > image-2024-01-25-10-24-17-005.png, image-2024-01-25-10-31-35-200.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > h2. Environment > This issue affects environments where the JVM default encoding is not > {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they > usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue > on Linux, change the default encoding to {{windows-1252}} by adding the > following line to your {{{}bootstrap.conf{}}}: > {quote}{{java.arg.21=-Dfile.encoding=windows-1252}} > {quote} > h2. Summary > The EvaluateXQuery incorrectly encodes result values when storing them in > attributes. This causes non-ASCII characters to be garbled. > Example: > !image-2024-01-25-10-24-17-005.png! > h2. Steps to reproduce > # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment" > # Create a GenerateFlowFile processor with the following content: > {quote}{{<?xml version="1.0" encoding="UTF-8"?>}} > {{<myRoot>}} > {{ <myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>}} > {{</myRoot>}} > {quote} > # Connect the processor to an EvaluateXQuery processor. > Set the {{Destination}} to {{{}flowfile-attribute{}}}. > Create a custom property {{myData}} with value {{{}string(/myRoot/myData){}}}. > # Connect the outputs of the EvaluateXQuery processor to funnels to be able > to observe the result in the queue. > # Start the EvaluateXQuery processor and run the GenerateFlowFile processor > once. > The flow should look similar to this: > !image-2024-01-25-10-31-35-200.png! > I also attached a JSON export of the example flow. > # Observe the attributes of the resulting FlowFile in the queue. > h3. Expected Result > The FlowFile should contain an attribute {{myData}} with the value {{{}"This > text contains non-ASCII characters: ÄÖÜäöüßéèóò"{}}}. > h3. Actual Result > The attribute has the value "This text contains non-ASCII characters: > ÄÖÜäöüßéèóò". > h2. Root Cause Analysis > EvaluateXQuery uses the method > [{{formatItem}}|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L368-L372] > to write the query result to an attribute. This method calls > {{{}ByteArrayOutputStream{}}}'s > [toString|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/ByteArrayOutputStream.html#toString()] > method without an encoding argument, which then defaults to the default > charset of the environment. Bytes are always written to this output stream > using UTF-8 > ([.getBytes(StandardCharsets.UTF8)|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L397]). > When the default charset is not UTF-8, this results in UTF-8 bytes to be > interpreted in a different encoding when converting to a string, resulting in > garbled text (see above). -- This message was sent by Atlassian Jira (v8.20.10#820010)