[
https://issues.apache.org/jira/browse/HDFS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153913#comment-14153913
]
Kihwal Lee commented on HDFS-7174:
----------------------------------
I propose children list to be automatically switching to a more efficient
implementation when the number of children goes over a given threshold. The
one I came up with uses hash-partitioned smaller ArrayLists. It will no longer
completely sort the list, but for the applications that need to create a large
number (e.g. over 50K entries) will likely care more about performance than the
list being sorted. By setting the threshold high, normal apps won't get
affected.
Here is the result of some measurements.
{noformat}
Unit is in milliseconds
Single ArrayList
==================================================
random add random get iterating
==================================================
32K 221 45 5
128K 977 178 10
512K 12,843 967 13
1M 49,551 2,214 20
3M 472,408 8,254 43
5M 1,442,971 15,091 66
==================================================
Hashed 1024 ArrayLists (sorted within each list)
==================================================
random add random get iterating
==================================================
32K 166 51 8
128K 296 149 13
512K 970 809 19
1M 2,013 1,974 34
3M 7,698 6,997 93
5M 14,126 12,614 133
==================================================
Auto Switching at 32768
==================================================
random add random get iterating
==================================================
32K 239 41 7
128K 369 158 13
512K 1,020 822 18
1M 2,162 1,947 34
3M 8,013 7,036 94
5M 15,224 13,509 130
The one time conversion took 28ms.
{noformat}
> Support for more efficient large directories
> --------------------------------------------
>
> Key: HDFS-7174
> URL: https://issues.apache.org/jira/browse/HDFS-7174
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
>
> When the number of children under a directory grows very large, insertion
> becomes very costly. E.g. creating 1M entries takes 10s of minutes. This is
> because the complexity of an insertion is O\(n\). As the size of a list
> grows, the overhead grows n^2. (integral of linear function). It also causes
> allocations and copies of big arrays.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)