[jira] [Commented] (HDFS-6984) In Hadoop 3, make FileStatus serialize itself via protobuf

Chris Douglas (JIRA) Tue, 13 Dec 2016 18:27:56 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746989#comment-15746989
 ]


Chris Douglas commented on HDFS-6984:
-------------------------------------

bq. This means we need to be able to add new fields to FileStatus, and that 
PathHandle should be an opaque PB-encoded blob. It doesn't require 
cross-serialization between FileStatus and HdfsFileStatus, and we don't 
necessarily need FileStatus to be serializable either.
Yes, I agree. To add new fields to {{FileStatus}}, we were changing its 
serialization in 3.x to PB (since updating {{Writable}} is problematic). If 
we're going to use PB, then arranging its fields so it _could_ be compatible 
with {{HdfsFileStatus}} was just anticipating future changes; you're right, 
cross-serialization is not a requirement. By arranging the type this way, the 
{{bytes}} field is accessible deserialized into a vanilla {{FileStatus}}, but 
if one were to deserialize into {{HdfsFileStatus}} then the {{PathHandle}} 
could have typed, versioned fields without extra steps. It's free to make this 
easy, so the patch doesn't make it hard.

The current proposal in HDFS-7878 offers {{open(FileStatus)}} instead of 
{{open(PathHandle)}} to avoid adding new functions to {{FileSystem}} for 
listing, deleting, renaming, etc. against a {{PathHandle}}. It'd be convenient 
if, as is fairly common, processes could just throw the result of a 
{{FileSystem}} query to a service/framework using some of the automatic 
serialization (e.g., {{Serializable}} in Spark and {{Writable}} in MapReduce), 
but we could move it to a library. Again, I'm pretty ambivalent about removing 
{{Writable}} from {{FileStatus}}, since we still have it on other data returned 
by {{FileSystem}} (like {{Path}}), but it is largely vestigial in 3.x.

bq, Is HDFS-9806 using the FileSystem interface to access the backing storage 
system? It sounds like we need to be able to add a new field to FileStatus, but 
not necessarily cross-serialization (or serialization) of a FileStatus.

Yes, that's correct. We just need a consistent place to add "metadata that 
uniquely identifies a file" so we can handle it consistently across 
{{FileSystem}} implementations.

bq. Regarding cross-serialization, yes we can save some copies by making 
HdfsFileStatus implement FileStatus and returning an HdfsFileStatus directly in 
FileSystem, but HdfsFileStatus has even more fields that a user is unlikely to 
care about. It also seems weird from an API perspective to include extra, 
inaccessible data in the serialized version of a class.

If the record doesn't leave the process or the deserializer knew to use a 
{{HdfsFileStatus}} then it'd be accessible. Unlike today, a client could use 
the data we return from the NN if we didn't throw it out.

I hear you on the inefficiency of carrying excess data per record, but the NN 
is a more critical bottleneck, so perhaps we should focus on not returning the 
unused data at all. Among clients, unless they're caching thousands of 
{{FileStatus}} objects in memory, the overhead is negligible, and the solution 
is to build efficient composites (i.e., instead of shaving redundant bytes per 
record, we can compress repeated data across millions of records).

> In Hadoop 3, make FileStatus serialize itself via protobuf
> ----------------------------------------------------------
>
>                 Key: HDFS-6984
>                 URL: https://issues.apache.org/jira/browse/HDFS-6984
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Colin P. McCabe
>            Assignee: Colin P. McCabe
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-6984.001.patch, HDFS-6984.002.patch, 
> HDFS-6984.003.patch, HDFS-6984.nowritable.patch
>
>
> FileStatus was a Writable in Hadoop 2 and earlier.  Originally, we used this 
> to serialize it and send it over the wire.  But in Hadoop 2 and later, we 
> have the protobuf {{HdfsFileStatusProto}} which serves to serialize this 
> information.  The protobuf form is preferable, since it allows us to add new 
> fields in a backwards-compatible way.  Another issue is that already a lot of 
> subclasses of FileStatus don't override the Writable methods of the 
> superclass, breaking the interface contract that read(status.write) should be 
> equal to the original status.
> In Hadoop 3, we should just make FileStatus serialize itself via protobuf so 
> that we don't have to deal with these issues.  It's probably too late to do 
> this in Hadoop 2, since user code may be relying on the existing FileStatus 
> serialization there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-6984) In Hadoop 3, make FileStatus serialize itself via protobuf

Reply via email to