[
https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139332#comment-16139332
]
Chris Douglas commented on HDFS-7878:
-------------------------------------
[~sershe] I'm finishing a patch, and will post it this week. This took a long
detour through HDFS-6984. We seem to be down to whether we should work directly
with {{PathHandle}} instances or with {{FileStatus}} instances.
While {{FileSystem}} has been trending toward {{FileStatus}}-based APIs for a
[long
time|https://issues.apache.org/jira/browse/HADOOP-6198?focusedCommentId=12744611&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12744611],
I agree 1) we don't yet have consensus on this approach and 2) that trend has
stalled as {{FileSystem}} evolution accommodates new features, rather than
improving base functionality. Nonetheless, I'd like to make a case for it, here.
Repeating discussion of its (in)efficiency discussed
[elsewhere|https://issues.apache.org/jira/browse/HDFS-6984?focusedCommentId=15755358&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15755358],
the overhead of passing {{FileStatus}} instead of a {{PathHandle}} is rarely
significant, and solvable in cases managing thousands or millions of handles
are in memory. [~sershe], can you comment on whether this overhead would be
significant for your use case?
Mostly, I want to avoid a proliferation of similar APIs in {{FileSystem}}. We
need something for open. Let's solve that, and see whether it's useful.
*Alternatively*, we could create a new API designed around {{PathHandle}}
instances. By way of example, we could use have an API like:
{code:java}
public class FileSystem /* blah blah */ {
/* blah blah */
public DirectFS direct() { /* ... */ }
}
interface DirectFS {
FSDataInputStream open(PathHandle file /* opts */);
FSDataOutputStream create(PathHandle parent, Path child /* opts */);
FSDataOutputStream append(PathHandle file /* opts */);
boolean rename(Path src, PathHandle dst /* opts */);
boolean rename(PathHandle src, Path dst /* opts */);
boolean rename(PathHandle src, PathHandle dst /* opts */);
RemoteIterator<FileStatus> listFileStatus(PathHandle dir /* opts */);
/* globFileStatus, ACLs, xattr, etc. */
}
{code}
{{DirectFS}} calls would follow the {{FileSystem}} specification, with the
additional requirement that every {{PathHandle}} resolves to the entity
extracted from a {{FileStatus}} from that {{FileSystem}}, to the extent that is
enforceable on that {{FileSystem}}. Among the opts, {{FileSystem}}
implementations could add specific criteria that change how strictly the "same
entity" constraint is enforced. This would be an alternative way to encode the
quasi-read-committed semantics in HADOOP-12077. It would also admit cases like
the one that motivated this JIRA i.e., "I don't care where this file is in the
namespace, just open it".
In the fullness of time, _maybe_ we would implement this. However, all the use
cases we have now are much, much simpler. We need a consistent handle to read
files, we have some ideas how to implement this for other storage systems like
S3/Azure. If the {{PathHandle}} API gets no traction, at least we have a
reasonable fallback so {{FileSystem::open(FileStatus, ...)}} can dispatch to
{{FileSystem::open(Path, ...)}}.
[~andrew.wang] and [[email protected]], please, please let me know if this is
OK with you soon, so we can finish this.
> API - expose an unique file identifier
> --------------------------------------
>
> Key: HDFS-7878
> URL: https://issues.apache.org/jira/browse/HDFS-7878
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Labels: BB2015-05-TBR
> Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch,
> HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch,
> HDFS-7878.06.patch, HDFS-7878.patch
>
>
> See HDFS-487.
> Even though that is resolved as duplicate, the ID is actually not exposed by
> the JIRA it supposedly duplicates.
> INode ID for the file should be easy to expose; alternatively ID could be
> derived from block IDs, to account for appends...
> This is useful e.g. for cache key by file, to make sure cache stays correct
> when file is overwritten.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]