[
https://issues.apache.org/jira/browse/HADOOP-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated HADOOP-17901:
----------------------------------
Attachment: HADOOP-17901-001.patch
> Performance degradation in Text.append() after HADOOP-16951
> -----------------------------------------------------------
>
> Key: HADOOP-17901
> URL: https://issues.apache.org/jira/browse/HADOOP-17901
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Critical
> Attachments: HADOOP-17901-001.patch
>
>
> We discovered a serious performance degradation in {{Text.append()}}.
> The problem is that the logic which intends to increase the size of the
> backing array does not work as intended.
> It's very difficult to spot, so I added extra logs to see what happens.
> Let's add 4096 bytes of textual data in a loop:
> {noformat}
> public static void main(String[] args) {
> Text text = new Text();
> String toAppend = RandomStringUtils.randomAscii(4096);
> for(int i = 0; i < 100; i++) {
> text.append(toAppend.getBytes(), 0, 4096);
> }
> }
> {noformat}
> With some debug printouts, we can observe:
> {noformat}
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(251)) -
> length: 24576, len: 4096, utf8ArraySize: 4096, bytes.length: 30720
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(253)) - length
> + (length >> 1): 36864
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(254)) - length
> + len: 28672
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:ensureCapacity(287))
> - >>> enhancing capacity from 30720 to 36864
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(251)) -
> length: 28672, len: 4096, utf8ArraySize: 4096, bytes.length: 36864
> 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(253)) - length
> + (length >> 1): 43008
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(254)) - length
> + len: 32768
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:ensureCapacity(287))
> - >>> enhancing capacity from 36864 to 43008
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(251)) -
> length: 32768, len: 4096, utf8ArraySize: 4096, bytes.length: 43008
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(253)) - length
> + (length >> 1): 49152
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(254)) - length
> + len: 36864
> 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:ensureCapacity(287))
> - >>> enhancing capacity from 43008 to 49152
> ...
> {noformat}
> After a certain number of {{append()}} calls, subsequent capacity increments
> are small.
> It's because the difference between two {{length + (length >> 1)}} values is
> always 6144 bytes. Because the size of the backing array is trailing behind
> the calculated value, the increment will also be 6144 bytes. This means that
> new arrays are constantly created.
> Suggested solution: don't calculate the capacity in advance based on length.
> Instead, pass the required minimum to {{ensureCapacity()}}. Then the
> increment should depend on the actual size of the byte array if the desired
> capacity is larger.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]