RE: [jira] Created: (HADOOP-713) dfs list operation is too expensive

Hairong Kuang Wed, 15 Nov 2006 13:39:27 -0800

"Dfs -du" is implemented using list requests. So I propose that we support
two types of list: one computing subtree size and one not.


Hairong

-----Original Message-----
From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 15, 2006 1:11 PM
To: [email protected]
Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too
expensive

It is not free.  As I understand it, we are recursively walking the
namespace tree with every ls to get this.

This is not a scalable design.  Even posix doesn't do this!

This is a performance problem that will only get worse.  I suggest removing
this performance mistake and documenting the existence of dfs -du, which is
a rather familiar solution to most users.

On Nov 15, 2006, at 12:19 PM, Yoram Arnon wrote:

>  I opt for displaying the size in bytes for now, since it's computed 
> anyway, is readily available for free, and improves the UI.
> If/when we fix HADOOP-713 we can replace the computation of size with 
> a better value for #files.
> Let's not prevent an improvement just because it might change in the 
> future.
> Yoram
>
>> -----Original Message-----
>> From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, November 14, 2006 7:10 PM
>> To: [email protected]
>> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too 
>> expensive
>>
>> So let's display nothing for now and revisit this once we have a 
>> cleaner CRC story.
>>
>>
>> On Nov 14, 2006, at 10:55 AM, Hairong Kuang wrote:
>>
>>> Setting the size of a directory to be the # of files is a good idea. 
>>> But the problem is that dfs name node has no idea of checksum
>> files. So the
>>> number
>>> of files include that of checksum files. But what's displayed at the 
>>> client side has filtered out the checksum files. So the # of files 
>>> does not match what's really displayed at the client side.
>>>
>>> Hairong
>>>
>>> -----Original Message-----
>>> From: Arkady Borkovsky [mailto:[EMAIL PROTECTED]
>>> Sent: Monday, November 13, 2006 5:07 PM
>>> To: [email protected]
>>> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too 
>>> expensive
>>>
>>> When listing a directory, for directory entries it may be more 
>>> useful to display the number of files in a directory, rather than 
>>> the number of bytes used by all the files in the directory and its 
>>> subdirectories.
>>> This a subjective opinion -- comments?
>>>
>>> (Currently, the value displayed subdirectory is "0")
>>>
>>> On Nov 13, 2006, at 3:25 PM, Hairong Kuang (JIRA) wrote:
>>>
>>>> dfs list operation is too expensive
>>>> -----------------------------------
>>>>
>>>>                  Key: HADOOP-713
>>>>                  URL:
>> http://issues.apache.org/jira/browse/HADOOP-713
>>>>              Project: Hadoop
>>>>           Issue Type: Improvement
>>>>           Components: dfs
>>>>     Affects Versions: 0.8.0
>>>>             Reporter: Hairong Kuang
>>>>
>>>>
>>>> A list request to dfs returns an array of DFSFileInfo. A
>> DFSFileInfo
>>>> of a directory contains a field called contentsLen, indicating its 
>>>> size  which gets computed at the namenode side by resursively going 
>>>> through its subdirs. At the same time, the whole dfs directory tree 
>>>> is locked.
>>>>
>>>> The list operation is used a lot by DFSClient for listing a 
>>>> directory, getting a file's size and # of replicas, and getting the
>> size of dfs.
>>>> Only the last operation needs the field contentsLen to be computed.
>>>>
>>>> To reduce its cost, we can add a flag to the list request.
>>>> ContentsLen
>>>> is computed If the flag is set. By default, the flag is false.
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> If you think it was sent incorrectly contact one of the
>>>> administrators:
>>>> http://issues.apache.org/jira/secure/Administrators.jspa
>>>> -
>>>> For more information on JIRA, see:
>>>> http://www.atlassian.com/software/jira
>>>>
>>>>
>>>
>>>
>>
>>
>

RE: [jira] Created: (HADOOP-713) dfs list operation is too expensive

Reply via email to