[
https://issues.apache.org/jira/browse/HADOOP-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998482#comment-12998482
]
Matt Foley commented on HADOOP-6949:
------------------------------------
Regarding VersionedProtocols: By searching for all implementations of the
VersionedProtocol API "getProtocolVersion(String, long)", and the values they
return, I found the following 16 version constants:
hdfs.protocol.ClientDatanodeProtocol.versionID; (8)
hdfs.protocol.ClientProtocol.versionID; (65)
hdfs.server.protocol.DatanodeProtocol.versionID; (27)
hdfs.server.protocol.InterDatanodeProtocol.versionID; (6)
hdfs.server.protocol.NamenodeProtocol.versionID; (5)
mapred.AdminOperationsProtocol.versionID; (3)
mapred.InterTrackerProtocol.versionID; (31)
mapred.TaskUmbilicalProtocol.versionID; (19)
mapreduce.protocol.ClientProtocol.versionID; (36)
hadoop.security.authorize.RefreshAuthorizationPolicyProtocol.versionID; (1)
hadoop.security.RefreshUserMappingsProtocol.versionID; (1)
hadoop.ipc.AvroRpcEngine.VERSION; (0)
TEST:
hadoop.security.TestDoAsEffectiveUser.TestProtocol.versionID (1)
hadoop.ipc.MiniRPCBenchmark.MiniProtocol.versionID; (1)
hadoop.ipc.TestRPC.TestProtocol.versionID; (1)
mapred.TestTaskCommit.MyUmbilical unnamed constant (0)
The first nine are clearly production version numbers. The next three (two
security and one avro) do not seem to have ever been incremented and I wonder
if they need to be now. The last four are test-specific and I think should not
be incremented. So please advise:
1. Do all the first nine protocols use WritableRPCEngine and therefore need to
have their version numbers incremented?
2. Do the next three need to have their version numbers incremented for this
change?
3. Do you agree that the four Test protocol versions should not change?
4. Did I miss any that you are aware of?
Thank you. I will put together the versioning patch when we have consensus on
what to change.
Konstantin, regarding your suggestion to extend this enhancement to arrays of
non-Primitive objects:
There is a simple way to extend this approach to arrays of Strings and
Writables. I coded it up and have it available, it adds an
ArrayWritable.Internal very similar to ArrayPrimitiveWritable.Internal. Nice
and clean. Of course the size improvement isn't as dramatic since the ratio of
label size to object size isn't as bad as with primitives, but the performance
improvement is still there (not having to go through the decision loop for
every element of a long array).
However, using it would cause a significant change in semantics:
The current ObjectWritable can handle non-homogeneous arrays containing
different kinds of Writables, and nulls.
The optimization we are discussing here is removing the type-tagging of every
array entry, thereby assuming that the array is in fact strictly homogeneous,
including having no null entries. There is also the question of what type
declaration the container array has on entry, and what type it should have on
exit. In the current code the only restriction on array type is
(Writable[]).type isAssignableFrom (X[]).type and
X.type isAssignableFrom x[i].getClass() for all elements i
Also in the current code, the array produced on the receiving side is always
simply Writable[].
Questions:
1. Is it acceptable to assume/mandate that all arrays of Writables passed in to
RPC shall be homogeneous and have no null elements? Note that this is a very
strict form of homogeneity, forbidding even subclass instances, because, for
example, if you define a class FooWritable and a subclass SubFooWritable, you
can put them both in an array of declared type FooWritable[], but the
receiving-side deserializer will ONLY produce objects of type FooWritable, and
will fail entirely unless the serialized output of FooWritable.write() and
SubFooWritable.write() happen to be compatible (which is too complicated an
exception to try to explain, IMO).
2. On the receiving side, should we (i) continue producing an array of type
Writable[], or (ii) preserve the type of the array during the implicit encoding
process, or (iii) produce an array of componentType same as the actual element
type, assuming the array is indeed strictly homogeneous so all elements are the
same type? All three are easily done, but have implications about what can be
done with the array later.
I would like to get the ArrayPrimitiveWritable committed soon, so unless the
answers to the above are really obvious to all interested parties, maybe I
should open a new Jira for a longer discussion?
Finally, regarding putting it in v22, it should port trivially, but is it
acceptable to have an incompatible protocol change in a released version?
Thanks.
> Reduces RPC packet size for primitive arrays, especially long[], which is
> used at block reporting
> -------------------------------------------------------------------------------------------------
>
> Key: HADOOP-6949
> URL: https://issues.apache.org/jira/browse/HADOOP-6949
> Project: Hadoop Common
> Issue Type: Improvement
> Components: io
> Reporter: Navis
> Assignee: Matt Foley
> Fix For: 0.23.0
>
> Attachments: ObjectWritable.diff, arrayprim.patch, arrayprim_v4.patch
>
> Original Estimate: 10m
> Remaining Estimate: 10m
>
> Current implementation of oah.io.ObjectWritable marshals primitive array
> types as general object array ; array type string + array length + (element
> type string + value)*n
> It would not be needed to specify each element types for primitive arrays.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira