[ 
https://issues.apache.org/jira/browse/NIFI-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847320#comment-17847320
 ] 

ASF subversion and git services commented on NIFI-12670:
--------------------------------------------------------

Commit b27fc46b60cea8ef47420254596bf3fdb15754f5 in nifi's branch 
refs/heads/main from Jim Steinebrey
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=b27fc46b60 ]

NIFI-12670 Read Jolt Transform with UTF-8 Encoding

- Specified UTF-8 encoding for reading Jolt Transform to avoid decoding issues 
on Windows or platforms with different default character sets

This closes #8842

Signed-off-by: David Handermann <[email protected]>


> JoltTransform processors incorrectly encode/decode text in the Jolt 
> Specification
> ---------------------------------------------------------------------------------
>
>                 Key: NIFI-12670
>                 URL: https://issues.apache.org/jira/browse/NIFI-12670
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Configuration, Extensions
>    Affects Versions: 2.0.0-M1, 1.24.0, 1.25.0, 2.0.0-M2, 1.26.0, 2.0.0-M3
>         Environment: JVM with non-UTF-8 default encoding (e.g. default 
> Windows installation)
>            Reporter: René Zeidler
>            Assignee: Jim Steinebrey
>            Priority: Minor
>              Labels: encoding, jolt, json, utf8, windows
>         Attachments: Jolt_Transform_Encoding_Bug.json, 
> Jolt_Transform_Encoding_Bug_M2.json, image-2024-01-25-11-01-15-405.png, 
> image-2024-01-25-11-59-56-662.png, image-2024-01-25-12-00-09-544.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h2. Environment
> This issue affects environments where the JVM default encoding is not 
> {{{}UTF-8{}}}. Standard Java installations on Windows are affected, as they 
> usually use the default encoding {{{}windows-1252{}}}. To reproduce the issue 
> on Linux, change the default encoding to {{windows-1252}} by adding the 
> following line to your {{{}bootstrap.conf{}}}:
> {quote}{{java.arg.21=-Dfile.encoding=windows-1252}}
> {quote}
> h2. Summary
> The Jolt Specification of both the JoltTransformJSON and JoltTransformRecord 
> processors is read interally using the system default encoding, even though 
> it is always stored in UTF-8. This causes non-ASCII characters to be garbled 
> in the Jolt Specification, resulting in incorrect transformations (missing 
> data or garbled keys).
> h2. Steps to reproduce
>  # Make sure NiFi runs with a non-UTF-8 default encoding, see "Environment"
>  # Create a GenerateFlowFile processor with the following content:
> {quote}{
>   "regularString": "string with only ASCII characters",
>   "umlautString": "string with non-ASCII characters: ÄÖÜäöüßéèóò",
>   "keyWithÜmlaut": "any string"
> }
> {quote}
>  # Connect the processor to a JoltTransformJSON and/or JoltTransformRecord 
> processor.
> (If using the record based processor, use a default JsonTreeReader and 
> JsonRecordSetWriter. The record reader/writer don't affect this bug.)
> Set the Jolt Specification to:
> {quote}[
>   {
>     "operation": "shift",
>     "spec": {
>       "regularString": "Remapped to Umlaut ÄÖÜ",
>       "umlautString": "Umlaut String",
>       "keyWithÜmlaut": "Key with Umlaut"
>     }
>   }
> ]
> {quote}
>  # Connect the outputs of the Jolt processor(s) to funnels to be able to 
> observe the result in the queue.
>  # Start the Jolt processor(s) and run the GenerateFlowFile processor once.
> The flow should look similar to this:
> !image-2024-01-25-11-01-15-405.png!
> I also attached a JSON export of the example flow.
>  # Observe the content of the resulting FlowFile(s) in the queue.
> h3. Expected Result
> !image-2024-01-25-12-00-09-544.png!
> h3. Actual Result
> !image-2024-01-25-11-59-56-662.png!
>  * Remapped key containing non-ASCII characters is garbled, since the key 
> value originated from the Jolt Specification.
>  * The key "{{{}keyWithÜmlaut{}}}" could not be matched at all, since it 
> contains non-ASCII characters, resulting in missing data in the output.
> h2. Root Cause Analysis
> Both processors use the 
> {{[readTransform|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-jolt-bundle/nifi-jolt-processors/src/main/java/org/apache/nifi/processors/jolt/AbstractJoltTransform.java#L242-L249]}}
>  method of {{AbstractJoltTransform}} to read the Jolt Specification property. 
> This method uses an 
> [{{InputStreamReader}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/io/InputStreamReader.html]
>  without specifying an encoding, which then defaults to the default charset 
> of the environment. Text properties are [always encoded in 
> UTF-8|https://github.com/apache/nifi/blob/89836f32d017d77972a4de09c4e864b0e11899a8/nifi-api/src/main/java/org/apache/nifi/components/resource/StandardResourceReferenceFactory.java#L111].
>  When the default charset is not UTF-8, this results in UTF-8 bytes to be 
> interpreted in a different encoding when converting to a string, resulting in 
> a garbled Jolt Specification being used.
> h2. Workaround
> This issue is not present when any attribute expression language is present 
> in the Jolt Specification. Simply adding {{${literal('')}}} anywhere in the 
> Jolt Specification works around this issue.
> This happens because [a different code path is 
> used|https://github.com/apache/nifi/blob/2e3f83eb54cbc040b5a1da5bce9a74a558f08ea4/nifi-nar-bundles/nifi-jolt-bundle/nifi-jolt-processors/src/main/java/org/apache/nifi/processors/jolt/AbstractJoltTransform.java#L233-L237]
>  when expression language is present.
> I don't know why the property is even read line-by-line using a stream reader 
> when no expression language is present. It seems like just using 
> {{getValue()}} would work fine even without expression language, and that 
> method doesn't have the encoding bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to