[
https://issues.apache.org/jira/browse/NIFI-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pierre Villard updated NIFI-12669:
----------------------------------
Fix Version/s: 2.0.0-M4
> EvaluateXQuery processor incorrectly encodes result attributes
> --------------------------------------------------------------
>
> Key: NIFI-12669
> URL: https://issues.apache.org/jira/browse/NIFI-12669
> Project: Apache NiFi
> Issue Type: Bug
> Components: Configuration, Extensions
> Environment: JVM with non-UTF-8 default encoding (e.g. default
> Windows installation)
> Reporter: René Zeidler
> Assignee: Jim Steinebrey
> Priority: Major
> Labels: encoding, utf8, windows, xml
> Fix For: 2.0.0-M4, 1.27.0
>
> Attachments: EvaluateXQuery_Encoding_Bug.json,
> image-2024-01-25-10-24-17-005.png, image-2024-01-25-10-31-35-200.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> h2. Environment
> This issue affects environments where the JVM default encoding is not
> {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they
> usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue
> on Linux, change the default encoding to {{windows-1252}} by adding the
> following line to your {{{}bootstrap.conf{}}}:
> {quote}{{java.arg.21=-Dfile.encoding=windows-1252}}
> {quote}
> h2. Summary
> The EvaluateXQuery incorrectly encodes result values when storing them in
> attributes. This causes non-ASCII characters to be garbled.
> Example:
> !image-2024-01-25-10-24-17-005.png!
> h2. Steps to reproduce
> # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
> # Create a GenerateFlowFile processor with the following content:
> {quote}{{<?xml version="1.0" encoding="UTF-8"?>}}
> {{<myRoot>}}
> {{ <myData>This text contains non-ASCII characters: ÄÖÜäöüßéèóò</myData>}}
> {{</myRoot>}}
> {quote}
> # Connect the processor to an EvaluateXQuery processor.
> Set the {{Destination}} to {{{}flowfile-attribute{}}}.
> Create a custom property {{myData}} with value {{{}string(/myRoot/myData){}}}.
> # Connect the outputs of the EvaluateXQuery processor to funnels to be able
> to observe the result in the queue.
> # Start the EvaluateXQuery processor and run the GenerateFlowFile processor
> once.
> The flow should look similar to this:
> !image-2024-01-25-10-31-35-200.png!
> I also attached a JSON export of the example flow.
> # Observe the attributes of the resulting FlowFile in the queue.
> h3. Expected Result
> The FlowFile should contain an attribute {{myData}} with the value {{{}"This
> text contains non-ASCII characters: ÄÖÜäöüßéèóò"{}}}.
> h3. Actual Result
> The attribute has the value "This text contains non-ASCII characters:
> ÄÖÜäöüßéèóò".
> h2. Root Cause Analysis
> EvaluateXQuery uses the method
> [{{formatItem}}|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L368-L372]
> to write the query result to an attribute. This method calls
> {{{}ByteArrayOutputStream{}}}'s
> [toString|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/ByteArrayOutputStream.html#toString()]
> method without an encoding argument, which then defaults to the default
> charset of the environment. Bytes are always written to this output stream
> using UTF-8
> ([.getBytes(StandardCharsets.UTF8)|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/EvaluateXQuery.java#L397]).
> When the default charset is not UTF-8, this results in UTF-8 bytes to be
> interpreted in a different encoding when converting to a string, resulting in
> garbled text (see above).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)