[
https://issues.apache.org/jira/browse/NIFI-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Handermann resolved NIFI-12670.
-------------------------------------
Fix Version/s: 2.0.0-M4
Resolution: Fixed
> JoltTransform processors incorrectly encode/decode text in the Jolt
> Specification
> ---------------------------------------------------------------------------------
>
> Key: NIFI-12670
> URL: https://issues.apache.org/jira/browse/NIFI-12670
> Project: Apache NiFi
> Issue Type: Bug
> Components: Configuration, Extensions
> Affects Versions: 2.0.0-M1, 1.24.0, 1.25.0, 2.0.0-M2, 1.26.0, 2.0.0-M3
> Environment: JVM with non-UTF-8 default encoding (e.g. default
> Windows installation)
> Reporter: René Zeidler
> Assignee: Jim Steinebrey
> Priority: Minor
> Labels: encoding, jolt, json, utf8, windows
> Fix For: 2.0.0-M4
>
> Attachments: Jolt_Transform_Encoding_Bug.json,
> Jolt_Transform_Encoding_Bug_M2.json, image-2024-01-25-11-01-15-405.png,
> image-2024-01-25-11-59-56-662.png, image-2024-01-25-12-00-09-544.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> h2. Environment
> This issue affects environments where the JVM default encoding is not
> {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they
> usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue
> on Linux, change the default encoding to {{windows-1252}} by adding the
> following line to your {{{}bootstrap.conf{}}}:
> {quote}{{java.arg.21=-Dfile.encoding=windows-1252}}
> {quote}
> h2. Summary
> The Jolt Specification of both the JoltTransformJSON and JoltTransformRecord
> processors is read interally using the system default encoding, even though
> it is always stored in UTF-8. This causes non-ASCII characters to be garbled
> in the Jolt Specification, resulting in incorrect transformations (missing
> data or garbled keys).
> h2. Steps to reproduce
> # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
> # Create a GenerateFlowFile processor with the following content:
> {quote}{
> "regularString": "string with only ASCII characters",
> "umlautString": "string with non-ASCII characters: ÄÖÜäöüßéèóò",
> "keyWithÜmlaut": "any string"
> }
> {quote}
> # Connect the processor to a JoltTransformJSON and/or JoltTransformRecord
> processor.
> (If using the record based processor, use a default JsonTreeReader and
> JsonRecordSetWriter. The record reader/writer don't affect this bug.)
> Set the Jolt Specification to:
> {quote}[
> {
> "operation": "shift",
> "spec": {
> "regularString": "Remapped to Umlaut ÄÖÜ",
> "umlautString": "Umlaut String",
> "keyWithÜmlaut": "Key with Umlaut"
> }
> }
> ]
> {quote}
> # Connect the outputs of the Jolt processor(s) to funnels to be able to
> observe the result in the queue.
> # Start the Jolt processor(s) and run the GenerateFlowFile processor once.
> The flow should look similar to this:
> !image-2024-01-25-11-01-15-405.png!
> I also attached a JSON export of the example flow.
> # Observe the content of the resulting FlowFile(s) in the queue.
> h3. Expected Result
> !image-2024-01-25-12-00-09-544.png!
> h3. Actual Result
> !image-2024-01-25-11-59-56-662.png!
> * Remapped key containing non-ASCII characters is garbled, since the key
> value originated from the Jolt Specification.
> * The key "{{{}keyWithÜmlaut{}}}" could not be matched at all, since it
> contains non-ASCII characters, resulting in missing data in the output.
> h2. Root Cause Analysis
> Both processors use the
> {{[readTransform|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-jolt-bundle/nifi-jolt-processors/src/main/java/org/apache/nifi/processors/jolt/AbstractJoltTransform.java#L242-L249]}}
> method of {{AbstractJoltTransform}} to read the Jolt Specification property.
> This method uses an
> [{{InputStreamReader}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/InputStreamReader.html]
> without specifying an encoding, which then defaults to the default charset
> of the environment. Text properties are [always encoded in
> UTF-8|https://github.com/apache/nifi/blob/89836f32d017d77972a4de09c4e864b0e11899a8/nifi-api/src/main/java/org/apache/nifi/components/resource/StandardResourceReferenceFactory.java#L111].
> When the default charset is not UTF-8, this results in UTF-8 bytes to be
> interpreted in a different encoding when converting to a string, resulting in
> a garbled Jolt Specification being used.
> h2. Workaround
> This issue is not present when any attribute expression language is present
> in the Jolt Specification. Simply adding {{${literal('')}}} anywhere in the
> Jolt Specification works around this issue.
> This happens because [a different code path is
> used|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-jolt-bundle/nifi-jolt-processors/src/main/java/org/apache/nifi/processors/jolt/AbstractJoltTransform.java#L233-L237]
> when expression language is present.
> I don't know why the property is even read line-by-line using a stream reader
> when no expression language is present. It seems like just using
> {{getValue()}} would work fine even without expression language, and that
> method doesn't have the encoding bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)