[ 
https://issues.apache.org/jira/browse/OAK-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Dürig updated OAK-7279:
-------------------------------
    Labels: tech-debt  (was: )

> segment-tar update from java 7 to java 8 may break persisted names using 
> invalid characters
> -------------------------------------------------------------------------------------------
>
>                 Key: OAK-7279
>                 URL: https://issues.apache.org/jira/browse/OAK-7279
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-tar
>            Reporter: Julian Reschke
>            Priority: Minor
>              Labels: tech-debt
>
> segment-tar relies on {{String.getBytes()}} when persisting strings such as 
> item names.
> The problem is that the behavior for this has been changed in Java 8 with 
> respect to invalid strings (here: null characters and unpaired surrogates).
> In Java 7, these would roundtrip, as Java was using the so-called "modified 
> UTF-8" encoding (see 
> https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8).
>  This will produce byte sequence that are *not* valid UTF-8.
> Java 7 will read them back, but Java 8 will map the non-conforming byte 
> sequences to the Unicode replacement character. Note that in particular, 
> multiple child entries might get identical names as a consequence.
> I'm not sure about the severity of this, and whether something needs to be 
> done about it. AFAIC, this is another good reason to reject invalid strings 
> as early as possible in the stack.
> cc [~mduerig]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to