[
https://issues.apache.org/jira/browse/AVRO-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated AVRO-4065:
---------------------------------
Labels: pull-request-available (was: )
> Do Not Copy Array Contents when Expanding UTF-8 Arrays
> ------------------------------------------------------
>
> Key: AVRO-4065
> URL: https://issues.apache.org/jira/browse/AVRO-4065
> Project: Apache Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.12.0
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.13.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The following code snippet in UTF8.java shows that when an array is re-sized
> to accept a new String, the contents of the original array are copied into
> the new array.
> I'm not sure the intent here, but the expectation is that the array is being
> resized precisely because it is about to be overwritten, hence why the cache
> String is blown away too.
> {code:java}
> /**
> * Set length in bytes. Should called whenever byte content changes, even
> if the
> * length does not change, as this also clears the cached String.
> */
> public Utf8 setByteLength(int newLength) {
> SystemLimitException.checkMaxStringLength(newLength);
> if (this.bytes.length < newLength) {
> this.bytes = Arrays.copyOf(this.bytes, newLength);
> }
> this.length = newLength;
> this.string = null;
> this.hash = 0;
> return this;
> }
> {code}
> If we peek at the JDK code, it's simply creating a new array, and then
> copying over the existing contents, but also zero-padding.
> {code:java}
> public static byte[] copyOf(byte[] original, int newLength) {
> byte[] copy = new byte[newLength];
> System.arraycopy(original, 0, copy, 0,
> Math.min(original.length, newLength));
> return copy;
> }
> {code}
> So this is problematic for a few reasons... number one is that it is wasted
> CPU cycle to copy the data and then immediately overwrite it. Second, it's
> not well documented/understood/expected what the state of the String is after
> calling this method. It makes sense that when truncating the string the value
> is a prefix, but unexpected what the behavior is when expanding the String
> without knowing the underlying code (i.e., padding of zeros).
>
> So, the most sane thing is that no assumptions should made by the caller
> what the side-effects of calling this method are. If they want to get a
> Substring, then they can call #subSequence.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)