[
https://issues.apache.org/jira/browse/NIFI-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815364#comment-17815364
]
ASF subversion and git services commented on NIFI-12708:
--------------------------------------------------------
Commit e00d2b6d5e5ef7ffb726251b7edf94d0735ce7a8 in nifi's branch
refs/heads/main from Umar Hussain
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=e00d2b6d5e ]
NIFI-12708: add option in UnpackContent to specify encoding charset for
filenames in zip unpacking
This closes #8350
The processor can now take a filename encoding parameter and pass it to zip
unpacking. This will allow
user to unzip files with specific encoding to get correct filenames in output.
This for example help with zip files created on Windows which by default uses
Cp437 for filename encoding.
If the filename contains special character like German alphabet ä, ü etc.,
decoding this with Linux's
default encoding usually UTF8 output will contain `?` in it. When the same file
is processed with property
set with `Cp437`, the processor outputs correct filenames with special
characters preserved.
Signed-off-by: Joseph Witt <[email protected]>
> UnpackContent should allow the user to specify a character set to apply in
> reading paths and filenames
> ------------------------------------------------------------------------------------------------------
>
> Key: NIFI-12708
> URL: https://issues.apache.org/jira/browse/NIFI-12708
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Joe Witt
> Assignee: Umar Hussain
> Priority: Major
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> https://apachenifi.slack.com/archives/C0L9VCD47/p1706716977280569
> Timon Faerber
> 1 hour ago
> I am currently struggling with an encoding problem for unzipped files.
> The following:
> I have a .zip in my content, which Im not aware of how it was created (dont
> know Character Set).
> Then I use UnpackContent processor.
> The path (folder) and filename is after that for unpacked files not encoded
> in UTF-8 and the characters are output as ?.
> I have already tried this solutions like
> https://community.cloudera.com/t5/Support-Questions/Unable-to-write-a-file-with-Chinese-Characters-filename-in/m-p/177183,
> for example, but it does not work for me.
> Does anyone know another solution?
> Joe Witt
> 43 minutes ago
> If you take nifi out of the equation and just unpack the zip using a command
> line tool - does it see the paths/names correctly?
> Joe Witt
> 43 minutes ago
> is there a sample zip you can share which has this problem?
> Umar Hussain
> 9 minutes ago
> We tried it with unzip on Linux and if we give parameter -O Cp347 the German
> characters ü ä ö in the path appear correctly in output.
> But a simple unzip command also doesn't produce correct paths in output.
> Joe Witt
> 5 minutes ago
> Interesting. So if you tell the zip program the encoding is cp347 the output
> appears correct. otherwise it is incorrect yes?
> New
> Umar Hussain
> 3 minutes ago
> Yes, I think its the encoding of zip and the zip was created on a windows
> machine and on Linux it's by default a different one.
> The processor current implementation takes the platforms default encoding
> Joe Witt
> 3 minutes ago
> Yeah this is probably a good summary of behavior we need to consider.
> https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or
> Stack OverflowStack Overflow
> Correctly decoding zip entry file names -- CP437, UTF-8 or?
> I recently wrote a zip file I/O library called zipzap, but I'm struggling
> with correctly decoding zip entry file names from arbitrary zip files.
> Now, the PKWARE spec states:
> D.1 The ZIP format ...
> Joe Witt
> 2 minutes ago
> My guess is we need to allow the user to override the default behavior by
> selecting the character set we'll read the filenames/paths as in some cases
> of reading legacy app created zips
--
This message was sent by Atlassian Jira
(v8.20.10#820010)