[
https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361303#comment-14361303
]
Colin Patrick McCabe commented on HDFS-7878:
--------------------------------------------
bq. Jing wrote: Could you please add more details here? Note that the getFileId
API in the current patch only calls getFileStatus and returns the inode id
field contained in the HdfsFileStatus. Or you mean the client is making both
calls separately? Then why the subclass approach can solve this?
My point is that if the client makes two different calls to getFileStatus, the
file status could change in between. So we could end up with the ID of one
file and the other details of another file. This is also inefficient, clearly,
since we're doing 2x the RPCs to the NameNode that we need to. And since the
NN is the hardest part of HDFS to scale (it hasn't been scaled horizontally)
this is another concern.
bq. If you call getFileStatus and open currently, you can have the same problem
- status from one file, open from different file.
Sure, and we ought to fix this too, by making it possible for the client to get
{{FileStatus}} from a {{DFSInputStream}}. It would be as easy as just having a
method inside DFSInputStream that called
{{open(/.reserved/.inodes/<inode-id-of-file)}}.
bq. Sergey wrote: ID allows to overcome this by getting ID first, then using
ID-based path. Of course if ID is obtained separately there's no guarantee but
there's no way to overcome this.
It seems like there is a very easy way to overcome this... just add an abstract
function inside {{FileStatus}} that either throws {{OperationNotSupported}} or
returns the inode ID. Then FileStatus objects returned from HDFS (and any
other function that has user-visible inode IDs) can return the inode ID, and
the default implementation can be throwing {{OperationNotSupported}}. We do
1/2 the RPCs of the current patch, put 1/2 the load on the NN, and don't open
up another race condition.
What do you think?
> API - expose an unique file identifier
> --------------------------------------
>
> Key: HDFS-7878
> URL: https://issues.apache.org/jira/browse/HDFS-7878
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch
>
>
> See HDFS-487.
> Even though that is resolved as duplicate, the ID is actually not exposed by
> the JIRA it supposedly duplicates.
> INode ID for the file should be easy to expose; alternatively ID could be
> derived from block IDs, to account for appends...
> This is useful e.g. for cache key by file, to make sure cache stays correct
> when file is overwritten.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)