[ 
https://issues.apache.org/jira/browse/OAK-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258780#comment-17258780
 ] 

Julian Reschke commented on OAK-9304:
-------------------------------------

So...

It *really* is important not to use non-ASCII chars in the source code so it's 
clear what's going on. It seems that you were testing the NFD encoding of "a 
umlaut" which indeed is not ISO-8859-1. (See example source below).

Also, *never* use the String constructor (for byte[]) without specifying the 
charset, the outcome is platform dependent.

*If* the intent is to strip non-ASCII characters, the simplest way is to copy 
char-by-char to a new String and remove/replace these characters.

Finally, my suspicion is that the problem you want to solve is somewhere else: 
where the desired field value of Content-Disposition is sent to Azure. If that 
happens as a query parameter, it itself may need encoding or recoding. (If you 
can point me at the source or the docs I might be able to help).

So, here's the test code:

{noformat}
public class EncTest {

    public static void main(String[] args) {
        System.out.println("Test with NFC");
        test("uml\u00e4ut.jpg");
        System.out.println("");

        System.out.println("Test with NFD");
        test("umla\u0308ut.jpg");
        System.out.println("");
    }

    public static void test(String input) {
        Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
        Charset UTF_8 = Charset.forName("UTF-8");
        System.out.println("input: " + input);
        byte[] bytes = ISO_8859_1.encode(input).array();
        dump(bytes);
        System.out.println("output (parsed as ISO-8859-1): " + new 
String(bytes, ISO_8859_1));
        System.out.println("output (parsed as UTF-8): " + new String(bytes, 
UTF_8));
    }

    private static void dump(byte[] bytes) {
        StringBuffer sb = new StringBuffer();
        for (byte b : bytes) {
            sb.append(String.format("%02x ", b));
        }
        System.out.println(sb);
    }
}
{noformat}

Output:

{noformat}
Test with NFC
input: umläut.jpg
75 6d 6c e4 75 74 2e 6a 70 67 
output (parsed as ISO-8859-1): umläut.jpg
output (parsed as UTF-8): uml?ut.jpg

Test with NFD
input: umla?ut.jpg
75 6d 6c 61 3f 75 74 2e 6a 70 67 
output (parsed as ISO-8859-1): umla?ut.jpg
output (parsed as UTF-8): umla?ut.jpg
{noformat}

> Filename with special characters in direct download URI Content-Disposition 
> are causing HTTP 400 errors from Azure
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-9304
>                 URL: https://issues.apache.org/jira/browse/OAK-9304
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: blob-cloud, blob-cloud-azure, blob-plugins
>    Affects Versions: 1.36.0
>            Reporter: Matt Ryan
>            Assignee: Matt Ryan
>            Priority: Major
>
> When generating a direct download URI for a filename with certain 
> non-standard characters in the name, it can cause the resulting signed URI to 
> be considered invalid by some blob storage services (Azure in particular).  
> This can lead to blob storage services being unable to service the URl 
> request.
> For example, a filename of "Ausländische.jpg" currently requests a 
> Content-Disposition header that looks like:
> {noformat}
> inline; filename="Ausländische.jpg"; filename*=UTF-8''Ausla%CC%88ndische.jpg 
> {noformat}
> Azure blob storage service fails trying to parse a URI with that 
> Content-Disposition header specification in the query string.  It instead 
> should look like:
> {noformat}
> inline; filename="Ausla?ndische.jpg"; filename*=UTF-8''Ausla%CC%88ndische.jpg 
> {noformat}
>  
> The "filename" portion of the Content-Disposition needs to consist of 
> ISO-8859-1 characters, per [https://tools.ietf.org/html/rfc6266#section-4.3] 
> in this paragraph:
> {quote}The parameters "filename" and "filename*" differ only in that 
> "filename*" uses the encoding defined in RFC5987, allowing the use of 
> characters not present in the ISO-8859-1 character set ISO-8859-1.
> {quote}
> Note that the purpose of this ticket is to address compatibility issues with 
> blob storage services, not to ensure ISO-8859-1 compatibility.  However, by 
> encoding the "filename" portion using standard Java character set encoding 
> conversion (e.g. {{Charsets.ISO_8859_1.encode(fileName)}}), we can generate a 
> URI that works with Azure, delivers the proper Content-Disposition header in 
> responses, and generates the proper client result (meaning, the correct name 
> for the downloaded file).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to