[jira] [Resolved] (NIFI-12708) UnpackContent should allow the user to specify a character set to apply in reading paths and filenames

Joe Witt (Jira) Wed, 07 Feb 2024 08:59:06 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joe Witt resolved NIFI-12708.
-----------------------------
    Resolution: Fixed

> UnpackContent should allow the user to specify a character set to apply in 
> reading paths and filenames
> ------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-12708
>                 URL: https://issues.apache.org/jira/browse/NIFI-12708
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Joe Witt
>            Assignee: Umar Hussain
>            Priority: Major
>             Fix For: 2.0.0
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> https://apachenifi.slack.com/archives/C0L9VCD47/p1706716977280569
> Timon Faerber
>   1 hour ago
> I am currently struggling with an encoding problem for unzipped files.
> The following:
> I have a .zip in my content, which Im not aware of how it was created (dont 
> know Character Set).
> Then I use UnpackContent processor.
> The path (folder) and filename is after that for unpacked files not encoded 
> in UTF-8 and the characters are output as ?.
> I have already tried this solutions like 
> https://community.cloudera.com/t5/Support-Questions/Unable-to-write-a-file-with-Chinese-Characters-filename-in/m-p/177183,
>  for example, but it does not work for me.
> Does anyone know another solution?
> Joe Witt
>   43 minutes ago
> If you take nifi out of the equation and just unpack the zip using a command 
> line tool - does it see the paths/names correctly?
> Joe Witt
>   43 minutes ago
> is there a sample zip you can share which has this problem?
> Umar Hussain
>   9 minutes ago
> We tried it with unzip on Linux and if we give parameter -O Cp347 the German 
> characters ü ä ö in the path appear correctly in output.
> But a simple unzip command also doesn't produce correct paths in output.
> Joe Witt
>   5 minutes ago
> Interesting.  So if you tell the zip program the encoding is cp347 the output 
> appears correct.  otherwise it is incorrect yes?
> New
> Umar Hussain
>   3 minutes ago
> Yes, I think its the encoding of zip and the zip was created on a windows 
> machine and on Linux it's by default a different one.
> The processor current implementation takes the platforms default encoding
> Joe Witt
>   3 minutes ago
> Yeah this is probably a good summary of behavior we need to consider.  
> https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or
> Stack OverflowStack Overflow
> Correctly decoding zip entry file names -- CP437, UTF-8 or?
> I recently wrote a zip file I/O library called zipzap, but I'm struggling 
> with correctly decoding zip entry file names from arbitrary zip files.
> Now, the PKWARE spec states:
> D.1 The ZIP format ...
> Joe Witt
>   2 minutes ago
> My guess is we need to allow the user to override the default behavior by 
> selecting the character set we'll read the filenames/paths as in some cases 
> of reading legacy app created zips



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NIFI-12708) UnpackContent should allow the user to specify a character set to apply in reading paths and filenames

Reply via email to