Some changes to Record I/O interfaces
-------------------------------------

                 Key: HADOOP-2030
                 URL: https://issues.apache.org/jira/browse/HADOOP-2030
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Vivek Ratan


I wanted to suggest some changes to the Record I/O interfaces. 

Under org.apache.hadoop.record, _RecordInput_ and _RecordOutput_ are the 
interfaces to serialize and deserialize basic types for Java-generated stubs. 
All the methods in _RecordInput_ and _RecordOutput_ take a parameter, a string, 
called 'tag'. As far as I can see, this tag is used only for XML-based 
serialization, to write out the name of the field that is being serialized.A 
lot of the  methods ignore it. My proposal is to eliminate this parameter, for 
a number of reasons: 
- We don't need to write the name of a field when serializing in XML. None of 
the other serializers (for binary or CSV) write out the name of a field - we 
only write the field value. The generated stubs know which field is associated 
with which value (and now, with type information support, the field name is 
part of the type information and is not required to be serialized along with 
the field data). In fact, even in XML, I don't see the field name being read 
back in, so it serves no purpose whatsoever. 
- The tag is used occasionally in the error message, but again this can be 
handled better by the caller of _RecordInput_ and _RecordOutput_. 
- The tag is also used to detect whether a record is nested or not. In CSV, we 
wrap nested records with "s{}". We also want to know whether a record is nested 
or the top-most, so that we add a newline at the end of a top-most record. If a 
tag is empty, it is assumed that the record is the top-most. This is using the 
tag parameter to mean something else. It's far more readable to just pass in a 
boolean to _startRecord()_ and _endRecord()_ which directly indicates whether 
the record is nested or not. Or, add two additional methods to _RecordOutput_ 
and _RecordInput_: _start()_ and _stop()_, which are called at the beginning 
and end of every top-most record while _startRecord()_ and _endRecord()_ are 
used only for nested records. The former's slightly better, IMO, but each 
method is much better than using an empty tag to indicate a top-level record.

The issue with tags brings up a related issue. Sometimes, we may need to pass 
in additional information to _RecordInput_ or _RecordOutput_. For example, 
suppose we do need to write the field name along with the field value. We can 
think of such a requirement in two ways. A) Such decisions of what to 
serialize/deserialize are independent of the format/protocol that the data is 
serialized in. If we want to write something else, that should be written 
separately by the stub. So, if we want to serialize the field name before a 
field value, a stub should call _RecordOutput.writeString(<field name>)_ first, 
followed by _RecordOutput.writeInt(<field value>)_. The methods in 
_RecordInput_ and _RecordOutput_  are the lowest level methods and they should 
just be concerned with writing individual types.  B) What if a protocol wants 
to write things differently? For example, we may want to write the field name 
before the field value for XML only (for debugging sake, or for whatever else). 
Or it may be that the field name and field value need to be enclosed in certain 
tags that can't happen if you write them separately. In these cases, methods in 
_RecordInput_ and _RecordOutput_ need to be passed additional information. This 
can be done by providing an optional parameter for these methods. Maybe a 
structure/class containing field information, or a reference to the field 
itself (the Tag parameter was meant to serve a similar purpose, but just 
passing in a String may be inadequate). For now, there is no real need for 
either of these situations, so we should be OK with getting rid of the tag 
parameter. 

Similar changes need to be done to the C++ side, where we have _OArchive_ and 
_IArchive_: 
- The tag parameter needs to be removed
- _startRecord()_ and _endRecord()_ in _OArchive_ and _IArchive_ need to take a 
boolean parameter that indicates whether the record is nested or not
- Currently, both _startRecord()_ and _endRecord()_ in  _IArchive_ take an 
additional parameter, a reference to a hadoop record. This is never used 
anywhere not required (the corresponding methods in _RecordInput_ and 
_RecordOutput_ don't take any parameters, which is the right thing to do), and 
should be removed. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to