[
https://issues.apache.org/jira/browse/HDFS-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746989#comment-15746989
]
Chris Douglas commented on HDFS-6984:
-------------------------------------
bq. This means we need to be able to add new fields to FileStatus, and that
PathHandle should be an opaque PB-encoded blob. It doesn't require
cross-serialization between FileStatus and HdfsFileStatus, and we don't
necessarily need FileStatus to be serializable either.
Yes, I agree. To add new fields to {{FileStatus}}, we were changing its
serialization in 3.x to PB (since updating {{Writable}} is problematic). If
we're going to use PB, then arranging its fields so it _could_ be compatible
with {{HdfsFileStatus}} was just anticipating future changes; you're right,
cross-serialization is not a requirement. By arranging the type this way, the
{{bytes}} field is accessible deserialized into a vanilla {{FileStatus}}, but
if one were to deserialize into {{HdfsFileStatus}} then the {{PathHandle}}
could have typed, versioned fields without extra steps. It's free to make this
easy, so the patch doesn't make it hard.
The current proposal in HDFS-7878 offers {{open(FileStatus)}} instead of
{{open(PathHandle)}} to avoid adding new functions to {{FileSystem}} for
listing, deleting, renaming, etc. against a {{PathHandle}}. It'd be convenient
if, as is fairly common, processes could just throw the result of a
{{FileSystem}} query to a service/framework using some of the automatic
serialization (e.g., {{Serializable}} in Spark and {{Writable}} in MapReduce),
but we could move it to a library. Again, I'm pretty ambivalent about removing
{{Writable}} from {{FileStatus}}, since we still have it on other data returned
by {{FileSystem}} (like {{Path}}), but it is largely vestigial in 3.x.
bq, Is HDFS-9806 using the FileSystem interface to access the backing storage
system? It sounds like we need to be able to add a new field to FileStatus, but
not necessarily cross-serialization (or serialization) of a FileStatus.
Yes, that's correct. We just need a consistent place to add "metadata that
uniquely identifies a file" so we can handle it consistently across
{{FileSystem}} implementations.
bq. Regarding cross-serialization, yes we can save some copies by making
HdfsFileStatus implement FileStatus and returning an HdfsFileStatus directly in
FileSystem, but HdfsFileStatus has even more fields that a user is unlikely to
care about. It also seems weird from an API perspective to include extra,
inaccessible data in the serialized version of a class.
If the record doesn't leave the process or the deserializer knew to use a
{{HdfsFileStatus}} then it'd be accessible. Unlike today, a client could use
the data we return from the NN if we didn't throw it out.
I hear you on the inefficiency of carrying excess data per record, but the NN
is a more critical bottleneck, so perhaps we should focus on not returning the
unused data at all. Among clients, unless they're caching thousands of
{{FileStatus}} objects in memory, the overhead is negligible, and the solution
is to build efficient composites (i.e., instead of shaving redundant bytes per
record, we can compress repeated data across millions of records).
> In Hadoop 3, make FileStatus serialize itself via protobuf
> ----------------------------------------------------------
>
> Key: HDFS-6984
> URL: https://issues.apache.org/jira/browse/HDFS-6984
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.0.0-alpha1
> Reporter: Colin P. McCabe
> Assignee: Colin P. McCabe
> Labels: BB2015-05-TBR
> Attachments: HDFS-6984.001.patch, HDFS-6984.002.patch,
> HDFS-6984.003.patch, HDFS-6984.nowritable.patch
>
>
> FileStatus was a Writable in Hadoop 2 and earlier. Originally, we used this
> to serialize it and send it over the wire. But in Hadoop 2 and later, we
> have the protobuf {{HdfsFileStatusProto}} which serves to serialize this
> information. The protobuf form is preferable, since it allows us to add new
> fields in a backwards-compatible way. Another issue is that already a lot of
> subclasses of FileStatus don't override the Writable methods of the
> superclass, breaking the interface contract that read(status.write) should be
> equal to the original status.
> In Hadoop 3, we should just make FileStatus serialize itself via protobuf so
> that we don't have to deal with these issues. It's probably too late to do
> this in Hadoop 2, since user code may be relying on the existing FileStatus
> serialization there.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]