[ https://issues.apache.org/jira/browse/NIFI-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joe Witt updated NIFI-12708: ---------------------------- Fix Version/s: 2.0.0 > UnpackContent should allow the user to specify a character set to apply in > reading paths and filenames > ------------------------------------------------------------------------------------------------------ > > Key: NIFI-12708 > URL: https://issues.apache.org/jira/browse/NIFI-12708 > Project: Apache NiFi > Issue Type: Improvement > Reporter: Joe Witt > Assignee: Umar Hussain > Priority: Major > Fix For: 2.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > https://apachenifi.slack.com/archives/C0L9VCD47/p1706716977280569 > Timon Faerber > 1 hour ago > I am currently struggling with an encoding problem for unzipped files. > The following: > I have a .zip in my content, which Im not aware of how it was created (dont > know Character Set). > Then I use UnpackContent processor. > The path (folder) and filename is after that for unpacked files not encoded > in UTF-8 and the characters are output as ?. > I have already tried this solutions like > https://community.cloudera.com/t5/Support-Questions/Unable-to-write-a-file-with-Chinese-Characters-filename-in/m-p/177183, > for example, but it does not work for me. > Does anyone know another solution? > Joe Witt > 43 minutes ago > If you take nifi out of the equation and just unpack the zip using a command > line tool - does it see the paths/names correctly? > Joe Witt > 43 minutes ago > is there a sample zip you can share which has this problem? > Umar Hussain > 9 minutes ago > We tried it with unzip on Linux and if we give parameter -O Cp347 the German > characters ü ä ö in the path appear correctly in output. > But a simple unzip command also doesn't produce correct paths in output. > Joe Witt > 5 minutes ago > Interesting. So if you tell the zip program the encoding is cp347 the output > appears correct. otherwise it is incorrect yes? > New > Umar Hussain > 3 minutes ago > Yes, I think its the encoding of zip and the zip was created on a windows > machine and on Linux it's by default a different one. > The processor current implementation takes the platforms default encoding > Joe Witt > 3 minutes ago > Yeah this is probably a good summary of behavior we need to consider. > https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or > Stack OverflowStack Overflow > Correctly decoding zip entry file names -- CP437, UTF-8 or? > I recently wrote a zip file I/O library called zipzap, but I'm struggling > with correctly decoding zip entry file names from arbitrary zip files. > Now, the PKWARE spec states: > D.1 The ZIP format ... > Joe Witt > 2 minutes ago > My guess is we need to allow the user to override the default behavior by > selecting the character set we'll read the filenames/paths as in some cases > of reading legacy app created zips -- This message was sent by Atlassian Jira (v8.20.10#820010)