Re: [jira] Created: (HADOOP-713) dfs list operation is too expensive

Arkady Borkovsky Wed, 15 Nov 2006 23:29:35 -0800

Eric,

there is some difference between the foundation components and userfacing components like UI.While foundation is expected to be stable and compatible from releaseto release,UI is expected to evolve, continuously becoming more and more usefuland powerful, and providing as much of useful functionality at anygiven time as possible. The specific issue discussed in this thread(what datums to show in directory listing for a subdirectory) is prettyminor -- the more the better, commensurate with resources, and nothingtoo misleading is the answer,But as the matter of design principles, user facing components havedifferent nature than the infrastructure. There can be no "time bomb"in the UI.


-- ab

On Nov 15, 2006, at 10:23 PM, Eric Baldeschwieler wrote:

Come on. This is a time bomb. Let's fix it. Let's not wire it intoour web UI. That makes tree browsing dangerously expensive and setsus up to have users expect this misfeature be supported.

The goal is to keep things simple. Expanding the deployment ofunsustainable / unscalable features is a distraction.


Name node lockups are hardly a hypothetical problem for us.

-1


On Nov 15, 2006, at 2:06 PM, Yoram Arnon wrote:

I agree with all that, except that that's how the ls command worksnow,

performance issues and all, and that will change only when we fix
HADOOP-713. Until then, using that field is free - it's being computed
anyway.

That said, HADOOP-713 not a current pain point. Users running ls isprettymuch a non issue, since it's a rare operation, and it takes afraction of asecond on the name node with our largish dfs. M-R jobs don't reallypay apenalty for this behaviour, since they normally execute on the lastlevel of

the tree anyway, where the current behaviour is desirable.

With all that in mind, the bug may stay in the queue for a while,until more

important issues are addressed.
Until then, we may as well get a better UI.

Yoram

-----Original Message-----
From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 15, 2006 1:11 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Created: (HADOOP-713) dfs list operation
is too expensive

It is not free.  As I understand it, we are recursively walking the
namespace tree with every ls to get this.

This is not a scalable design.  Even posix doesn't do this!

This is a performance problem that will only get worse.  I suggest
removing this performance mistake and documenting the existence of
dfs -du, which is a rather familiar solution to most users.

On Nov 15, 2006, at 12:19 PM, Yoram Arnon wrote:

 I opt for displaying the size in bytes for now, since it's
computed anyway,
is readily available for free, and improves the UI.
If/when we fix HADOOP-713 we can replace the computation of size
with a
better value for #files.
Let's not prevent an improvement just because it might change in
the future.
Yoram

-----Original Message-----
From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 14, 2006 7:10 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Created: (HADOOP-713) dfs list operation
is too expensive

So let's display nothing for now and revisit this once we have a
cleaner CRC story.

On Nov 14, 2006, at 10:55 AM, Hairong Kuang wrote:

Setting the size of a directory to be the # of files is a good
idea. But the
problem is that dfs name node has no idea of checksum

files. So the

number
of files include that of checksum files. But what's displayed at
the client
side has filtered out the checksum files. So the # of files does
not match
what's really displayed at the client side.

Hairong

-----Original Message-----
From: Arkady Borkovsky [mailto:[EMAIL PROTECTED]
Sent: Monday, November 13, 2006 5:07 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Created: (HADOOP-713) dfs list

operation is too

expensive

When listing a directory, for directory entries it may be more
useful to
display the number of files in a directory, rather than the number
of bytes
used by all the files in the directory and its subdirectories.
This a subjective opinion -- comments?

(Currently, the value displayed subdirectory is "0")

On Nov 13, 2006, at 3:25 PM, Hairong Kuang (JIRA) wrote:

dfs list operation is too expensive
-----------------------------------

                 Key: HADOOP-713
                 URL:

http://issues.apache.org/jira/browse/HADOOP-713

             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.8.0
            Reporter: Hairong Kuang


A list request to dfs returns an array of DFSFileInfo. A

DFSFileInfo

of a directory contains a field called contentsLen,

indicating its

size  which gets computed at the namenode side by

resursively going

through its subdirs. At the same time, the whole dfs directory
tree is
locked.

The list operation is used a lot by DFSClient for listing a
directory,
getting a file's size and # of replicas, and getting the

size of dfs.

Only the last operation needs the field contentsLen to

be computed.


To reduce its cost, we can add a flag to the list request.
ContentsLen
is computed If the flag is set. By default, the flag is false.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

Re: [jira] Created: (HADOOP-713) dfs list operation is too expensive

Reply via email to