[jira] [Commented] (HDFS-7878) API - expose an unique file identifier

Colin Patrick McCabe (JIRA) Fri, 13 Mar 2015 16:25:01 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361303#comment-14361303
 ]

Colin Patrick McCabe commented on HDFS-7878:
--------------------------------------------

bq. Jing wrote: Could you please add more details here? Note that the getFileId 
API in the current patch only calls getFileStatus and returns the inode id 
field contained in the HdfsFileStatus. Or you mean the client is making both 
calls separately? Then why the subclass approach can solve this?

My point is that if the client makes two different calls to getFileStatus, the 
file status could change in between.  So we could end up with the ID of one 
file and the other details of another file.  This is also inefficient, clearly, 
since we're doing 2x the RPCs to the NameNode that we need to.  And since the 
NN is the hardest part of HDFS to scale (it hasn't been scaled horizontally) 
this is another concern.

bq. If you call getFileStatus and open currently, you can have the same problem 
- status from one file, open from different file.

Sure, and we ought to fix this too, by making it possible for the client to get 
{{FileStatus}} from a {{DFSInputStream}}.  It would be as easy as just having a 
method inside DFSInputStream that called 
{{open(/.reserved/.inodes/<inode-id-of-file)}}.

bq. Sergey wrote: ID allows to overcome this by getting ID first, then using 
ID-based path. Of course if ID is obtained separately there's no guarantee but 
there's no way to overcome this.

It seems like there is a very easy way to overcome this... just add an abstract 
function inside {{FileStatus}} that either throws {{OperationNotSupported}} or 
returns the inode ID.  Then FileStatus objects returned from HDFS (and any 
other function that has user-visible inode IDs) can return the inode ID, and 
the default implementation can be throwing {{OperationNotSupported}}.  We do 
1/2 the RPCs of the current patch, put 1/2 the load on the NN, and don't open 
up another race condition.

What do you think?

> API - expose an unique file identifier
> --------------------------------------
>
>                 Key: HDFS-7878
>                 URL: https://issues.apache.org/jira/browse/HDFS-7878
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch
>
>
> See HDFS-487.
> Even though that is resolved as duplicate, the ID is actually not exposed by 
> the JIRA it supposedly duplicates.
> INode ID for the file should be easy to expose; alternatively ID could be 
> derived from block IDs, to account for appends...
> This is useful e.g. for cache key by file, to make sure cache stays correct 
> when file is overwritten.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7878) API - expose an unique file identifier

Reply via email to