"Dfs -du" is implemented using list requests. So I propose that we support two types of list: one computing subtree size and one not.
Hairong -----Original Message----- From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 15, 2006 1:11 PM To: [email protected] Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too expensive It is not free. As I understand it, we are recursively walking the namespace tree with every ls to get this. This is not a scalable design. Even posix doesn't do this! This is a performance problem that will only get worse. I suggest removing this performance mistake and documenting the existence of dfs -du, which is a rather familiar solution to most users. On Nov 15, 2006, at 12:19 PM, Yoram Arnon wrote: > I opt for displaying the size in bytes for now, since it's computed > anyway, is readily available for free, and improves the UI. > If/when we fix HADOOP-713 we can replace the computation of size with > a better value for #files. > Let's not prevent an improvement just because it might change in the > future. > Yoram > >> -----Original Message----- >> From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED] >> Sent: Tuesday, November 14, 2006 7:10 PM >> To: [email protected] >> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too >> expensive >> >> So let's display nothing for now and revisit this once we have a >> cleaner CRC story. >> >> >> On Nov 14, 2006, at 10:55 AM, Hairong Kuang wrote: >> >>> Setting the size of a directory to be the # of files is a good idea. >>> But the problem is that dfs name node has no idea of checksum >> files. So the >>> number >>> of files include that of checksum files. But what's displayed at the >>> client side has filtered out the checksum files. So the # of files >>> does not match what's really displayed at the client side. >>> >>> Hairong >>> >>> -----Original Message----- >>> From: Arkady Borkovsky [mailto:[EMAIL PROTECTED] >>> Sent: Monday, November 13, 2006 5:07 PM >>> To: [email protected] >>> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too >>> expensive >>> >>> When listing a directory, for directory entries it may be more >>> useful to display the number of files in a directory, rather than >>> the number of bytes used by all the files in the directory and its >>> subdirectories. >>> This a subjective opinion -- comments? >>> >>> (Currently, the value displayed subdirectory is "0") >>> >>> On Nov 13, 2006, at 3:25 PM, Hairong Kuang (JIRA) wrote: >>> >>>> dfs list operation is too expensive >>>> ----------------------------------- >>>> >>>> Key: HADOOP-713 >>>> URL: >> http://issues.apache.org/jira/browse/HADOOP-713 >>>> Project: Hadoop >>>> Issue Type: Improvement >>>> Components: dfs >>>> Affects Versions: 0.8.0 >>>> Reporter: Hairong Kuang >>>> >>>> >>>> A list request to dfs returns an array of DFSFileInfo. A >> DFSFileInfo >>>> of a directory contains a field called contentsLen, indicating its >>>> size which gets computed at the namenode side by resursively going >>>> through its subdirs. At the same time, the whole dfs directory tree >>>> is locked. >>>> >>>> The list operation is used a lot by DFSClient for listing a >>>> directory, getting a file's size and # of replicas, and getting the >> size of dfs. >>>> Only the last operation needs the field contentsLen to be computed. >>>> >>>> To reduce its cost, we can add a flag to the list request. >>>> ContentsLen >>>> is computed If the flag is set. By default, the flag is false. >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> - >>>> If you think it was sent incorrectly contact one of the >>>> administrators: >>>> http://issues.apache.org/jira/secure/Administrators.jspa >>>> - >>>> For more information on JIRA, see: >>>> http://www.atlassian.com/software/jira >>>> >>>> >>> >>> >> >> >
