[jira] [Commented] (HDFS-7174) Support for more efficient large directories

Kihwal Lee (JIRA) Tue, 30 Sep 2014 15:41:53 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153913#comment-14153913
 ]


Kihwal Lee commented on HDFS-7174:
----------------------------------

I propose children list to be automatically switching to a more efficient 
implementation when the number of children goes over a given threshold.  The 
one I came up with uses hash-partitioned smaller ArrayLists. It will no longer 
completely sort the list, but for the applications that need to create a large 
number (e.g. over 50K entries) will likely care more about performance than the 
list being sorted.  By setting the threshold high, normal apps won't get 
affected.

Here is the result of some measurements.

{noformat}
Unit is in milliseconds

Single ArrayList
==================================================
      random add   random get    iterating
==================================================
32K        221            45          5
128K       977           178         10
512K    12,843           967         13
1M      49,551         2,214         20
3M     472,408         8,254         43
5M   1,442,971        15,091         66
==================================================

Hashed 1024 ArrayLists (sorted within each list)
==================================================
      random add   random get    iterating
==================================================
32K        166            51          8
128K       296           149         13
512K       970           809         19
1M       2,013         1,974         34
3M       7,698         6,997         93
5M      14,126        12,614        133
==================================================

Auto Switching at 32768
==================================================
      random add   random get    iterating
==================================================
32K        239            41          7
128K       369           158         13
512K     1,020           822         18
1M       2,162         1,947         34
3M       8,013         7,036         94
5M      15,224        13,509        130
The one time conversion took 28ms.
{noformat}


> Support for more efficient large directories
> --------------------------------------------
>
>                 Key: HDFS-7174
>                 URL: https://issues.apache.org/jira/browse/HDFS-7174
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When the number of children under a directory grows very large, insertion 
> becomes very costly.  E.g. creating 1M entries takes 10s of minutes.  This is 
> because the complexity of an insertion is O\(n\). As the size of a list 
> grows, the overhead grows n^2. (integral of linear function).  It also causes 
> allocations and copies of big arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7174) Support for more efficient large directories

Reply via email to