[
https://issues.apache.org/jira/browse/OAK-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258780#comment-17258780
]
Julian Reschke commented on OAK-9304:
-------------------------------------
So...
It *really* is important not to use non-ASCII chars in the source code so it's
clear what's going on. It seems that you were testing the NFD encoding of "a
umlaut" which indeed is not ISO-8859-1. (See example source below).
Also, *never* use the String constructor (for byte[]) without specifying the
charset, the outcome is platform dependent.
*If* the intent is to strip non-ASCII characters, the simplest way is to copy
char-by-char to a new String and remove/replace these characters.
Finally, my suspicion is that the problem you want to solve is somewhere else:
where the desired field value of Content-Disposition is sent to Azure. If that
happens as a query parameter, it itself may need encoding or recoding. (If you
can point me at the source or the docs I might be able to help).
So, here's the test code:
{noformat}
public class EncTest {
public static void main(String[] args) {
System.out.println("Test with NFC");
test("uml\u00e4ut.jpg");
System.out.println("");
System.out.println("Test with NFD");
test("umla\u0308ut.jpg");
System.out.println("");
}
public static void test(String input) {
Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
Charset UTF_8 = Charset.forName("UTF-8");
System.out.println("input: " + input);
byte[] bytes = ISO_8859_1.encode(input).array();
dump(bytes);
System.out.println("output (parsed as ISO-8859-1): " + new
String(bytes, ISO_8859_1));
System.out.println("output (parsed as UTF-8): " + new String(bytes,
UTF_8));
}
private static void dump(byte[] bytes) {
StringBuffer sb = new StringBuffer();
for (byte b : bytes) {
sb.append(String.format("%02x ", b));
}
System.out.println(sb);
}
}
{noformat}
Output:
{noformat}
Test with NFC
input: umläut.jpg
75 6d 6c e4 75 74 2e 6a 70 67
output (parsed as ISO-8859-1): umläut.jpg
output (parsed as UTF-8): uml?ut.jpg
Test with NFD
input: umla?ut.jpg
75 6d 6c 61 3f 75 74 2e 6a 70 67
output (parsed as ISO-8859-1): umla?ut.jpg
output (parsed as UTF-8): umla?ut.jpg
{noformat}
> Filename with special characters in direct download URI Content-Disposition
> are causing HTTP 400 errors from Azure
> ------------------------------------------------------------------------------------------------------------------
>
> Key: OAK-9304
> URL: https://issues.apache.org/jira/browse/OAK-9304
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: blob-cloud, blob-cloud-azure, blob-plugins
> Affects Versions: 1.36.0
> Reporter: Matt Ryan
> Assignee: Matt Ryan
> Priority: Major
>
> When generating a direct download URI for a filename with certain
> non-standard characters in the name, it can cause the resulting signed URI to
> be considered invalid by some blob storage services (Azure in particular).
> This can lead to blob storage services being unable to service the URl
> request.
> For example, a filename of "Ausländische.jpg" currently requests a
> Content-Disposition header that looks like:
> {noformat}
> inline; filename="Ausländische.jpg"; filename*=UTF-8''Ausla%CC%88ndische.jpg
> {noformat}
> Azure blob storage service fails trying to parse a URI with that
> Content-Disposition header specification in the query string. It instead
> should look like:
> {noformat}
> inline; filename="Ausla?ndische.jpg"; filename*=UTF-8''Ausla%CC%88ndische.jpg
> {noformat}
>
> The "filename" portion of the Content-Disposition needs to consist of
> ISO-8859-1 characters, per [https://tools.ietf.org/html/rfc6266#section-4.3]
> in this paragraph:
> {quote}The parameters "filename" and "filename*" differ only in that
> "filename*" uses the encoding defined in RFC5987, allowing the use of
> characters not present in the ISO-8859-1 character set ISO-8859-1.
> {quote}
> Note that the purpose of this ticket is to address compatibility issues with
> blob storage services, not to ensure ISO-8859-1 compatibility. However, by
> encoding the "filename" portion using standard Java character set encoding
> conversion (e.g. {{Charsets.ISO_8859_1.encode(fileName)}}), we can generate a
> URI that works with Azure, delivers the proper Content-Disposition header in
> responses, and generates the proper client result (meaning, the correct name
> for the downloaded file).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)