[ 
https://issues.apache.org/jira/browse/HDFS-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738840#comment-15738840
 ] 

Chris Douglas commented on HDFS-6984:
-------------------------------------

bq. Can we split the Serializable stuff into a separate change? Changing 
booleans into Boolean objects is additional overhead, and combining these two 
changes makes it harder to review.
Reluctantly... this will create multiple, conflicting patches, but OK. The two 
are somewhat related, in that {{InodeType}} is required in the PB impl, so 
{{Serializable}} should consider that missing field from the stream (i.e., 
null) as an error. That said, nothing enforces or relies on consistent 
semantics between the two serializations.

bq. Also, I am not a Java serialization expert, but IIUC maintaining 
compatibility is difficult, which means maybe people should just use PB anyway.
[[email protected]] requested it for Spark, so if there's a way to configure 
Spark to use PB then we can avoid it. Even distcp doesn't really have a 
cross-version compatibility requirement, so this seemed mostly harmless.

bq. how about we simply stop implementing Writable altogether in FileStatus?
Agreed, I don't think this functionality is often used. Though your point about 
the ACL bit and the encryption bit highlights why I'd like some PB 
representation for HDFS-7878, and why the 2.x PB semantics are useful.

Consider {{DistributedFileSystem#listStatusInternal}}. Each HdfsFileStatus 
instance copies its fields to a new FileStatus object by 
{{HdfsFileStatus#makeQualified}}, those HdfsFileStatus fields having been 
populated by copying fields from the PB record returned from RPC. Setting aside 
the inefficiency of creating copies, this also drops information returned by 
the NN to construct the generic {{FileStatus}}, which (per your summary of 
HDFS-6326) discards even more information to preserve {{Writable}} 
compatibility and performance.

In HDFS-7878, a {{PathHandle}} (or whatever) needs some FS-specific metadata to 
be opaquely serialized as {{bytes}} in FileStatus, but resolved by a FileSystem 
instance. If a receiver deserializes a FileStatus object without knowing the 
source FS (i.e., as a vanilla FileStatusProto object), a serialized 
HdfsFileStatus object will populate those fields it understands, and (in PB 
2.x) the record will retain unknown fields. The unit test in v003 verifies this.

[~sershe] cited a case in Hive/LLAP where this object needs to be passed across 
process boundaries (as splits, etc.), and HDFS-9806 needs a nonce from the FS 
to detect changes. Neither knows the FileStatus type before it's read, but by 
using PB the {{PathHandle}} data are addressable and the basic fields (path, 
type, owner, etc.) are available, and that's enough.

If HDFS were to return HdfsFileStatus wrapping the HdfsFileStatusProto from 
which it's derived, that would not only be more efficient (avoiding copies), 
but serializing it and reading it back as a vanilla FileStatus would also work 
(again, per v003).

There's no particular reason that requires implementing the {{Writable}} 
contract, though. I'm ambivalent about removing it, since HdfsFileStatus could 
be modified to write HdfsFileStatusProto objects compatible with FileStatus 
equally well through some utility class.

bq. I'm guessing FileStatus was originally made Writable as a shortcut for 
DistCp
Hadoop RPC used to require {{Writable}} types, so it was probably to implement 
{{listStatus}} and {{getFileStatus}}.

> In Hadoop 3, make FileStatus serialize itself via protobuf
> ----------------------------------------------------------
>
>                 Key: HDFS-6984
>                 URL: https://issues.apache.org/jira/browse/HDFS-6984
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Colin P. McCabe
>            Assignee: Colin P. McCabe
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-6984.001.patch, HDFS-6984.002.patch, 
> HDFS-6984.003.patch, HDFS-6984.nowritable.patch
>
>
> FileStatus was a Writable in Hadoop 2 and earlier.  Originally, we used this 
> to serialize it and send it over the wire.  But in Hadoop 2 and later, we 
> have the protobuf {{HdfsFileStatusProto}} which serves to serialize this 
> information.  The protobuf form is preferable, since it allows us to add new 
> fields in a backwards-compatible way.  Another issue is that already a lot of 
> subclasses of FileStatus don't override the Writable methods of the 
> superclass, breaking the interface contract that read(status.write) should be 
> equal to the original status.
> In Hadoop 3, we should just make FileStatus serialize itself via protobuf so 
> that we don't have to deal with these issues.  It's probably too late to do 
> this in Hadoop 2, since user code may be relying on the existing FileStatus 
> serialization there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to