[ 
https://issues.apache.org/jira/browse/HDFS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157325#comment-14157325
 ] 

Kihwal Lee commented on HDFS-7174:
----------------------------------

bq. I wonder, is it worthwhile to try to store the inodeid as an long[] 
directly? 
For majority of cases, I don't think so. If we fix the issue of inode table 
lookup requiring creation of a dummy object, it may be feasible.

bq.  I suspect the performance issues can be related to cache-related issues.
The biggest performance hit is due to the way insertion is done in ArrayList. 
An insertion requires copying of average of 1/2 n entries.  Also as the list 
gets bigger, a new one with a bigger size needs to be allocated, copied and 
partially copied again. For big directories, we are talking megabytes. Far 
better performance can be obtained by using different data structures, but I 
believe ArrayList (originally primitive array) was chosen to minimize the 
memory usage.

> Support for more efficient large directories
> --------------------------------------------
>
>                 Key: HDFS-7174
>                 URL: https://issues.apache.org/jira/browse/HDFS-7174
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-7174.new.patch, HDFS-7174.patch, HDFS-7174.patch
>
>
> When the number of children under a directory grows very large, insertion 
> becomes very costly.  E.g. creating 1M entries takes 10s of minutes.  This is 
> because the complexity of an insertion is O\(n\). As the size of a list 
> grows, the overhead grows n^2. (integral of linear function).  It also causes 
> allocations and copies of big arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to