Adding versioning to Record I/O
-------------------------------

                 Key: HADOOP-1883
                 URL: https://issues.apache.org/jira/browse/HADOOP-1883
             Project: Hadoop
          Issue Type: New Feature
            Reporter: Vivek Ratan


There is a need to add versioning support to Record I/O. Users frequently 
update DDL files, usually by adding/removing fields, but do not want to change 
the name of the data structure. They would like older & newer deserializers to 
read as much data as possible. For example, suppose Record I/O is used to 
serialize/deserialize log records, each of which contains a message and a 
timestamp. An initial data definition could be as follows:
{code}
class MyLogRecord {
  ustring msg;
  long timestamp;
}
{code}
Record I/O creates a class, _MyLogRecord_, which represents a log record and 
can serialize/deserialize itself. Now, suppose newer log records additionally 
contain a severity level. A user would want to update the definition for a log 
record but use the same class name. The new definition would be:
{code}
class MyLogRecord {
  ustring msg;
  long timestamp;
  int severity;
}
{code}
Users would want a new deserializer to read old log records (and perhaps use a 
default value for the severity field), and an old deserializer to read newer 
log records (and skip the severity field).

This requires some concept of versioning in Record I/O, or rather, the 
additional ability to read/write type information of a record. The following is 
a proposal to do this. 

Every Record I/O Record will have type information which represents how the 
record is structured (what fields it has, what types, etc.). This type 
information, represented by the class _RecordTypeInfo_, is itself 
serializable/deserializable. Every Record supports a method 
_getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are 
expected to serialize this type information (by calling 
_RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, 
for example, or at the beginning of a file). Using the same DDL as above, 
here's how we could serialize log records: 
{code}
FileOutputStream fOut = new FileOutputStream("data.log");
CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
...
// get the type information for MyLogRecord
RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
// ask it to write itself out
typeInfo.serialize(csvOut);
...
// now, serialize a bunch of records
while (...) {
   MyLogRecord log = new MyLogRecord();
   // fill up the MyLogRecord object
  ...
  // serialize
  log.serialize(csvOut);
}
{code}
In this example, the type information of a Record is serialized fist, followed 
by contents of various records, all into the same file. 

Every Record also supports a method that allows a user to set a filter for 
deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a 
parameter. This filter represents the type information of the data that is 
being deserialized. When deserializing, the Record uses this filter (if one is 
set) to figure out what to read. Continuing with our example, here's how we 
could deserialize records:
{code}
FileInputStream fIn = new FileInputStream("data.log");
// we know the record was written in CSV format
CsvRecordInput csvIn = new CsvRecordInput(fIn);
...
// we know the type info is written in the beginning. read it. 
RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
// deserialize it
typeInfoFilter.deserialize(csvIn);
// let MyLogRecord know what to expect
MyLogRecord.setRTIFilter(typeInfoFilter);
// deserialize each record
while (there is data in file) {
  MyLogRecord log = new MyLogRecord();
  log.read(csvIn);
  ...
}
{code}

The filter is optional. If not provided, the deserializer expects data to be in 
the same format as it would serialize. (Note that a filter can also be provided 
for serializing, forcing the serializer to write information in the format of 
the filter, but there is no use case for this functionality yet). 

What goes in the type information for a record? The type information for each 
field in a Record is made up of:
   1. a unique field ID, which is the field name. 
   2. a type ID, which denotes the type of the field (int, string, map, etc). 

The type information for a composite type contains type information for each of 
its fields. This approach is somewhat similar to the one taken by [Facebook's 
Thrift|http://developers.facebook.com/thrift/], as well as by Google's Sawzall. 
The main difference is that we use field names as the field ID, whereas Thrift 
and Sawzall use user-defined field numbers. While field names take more space, 
they have the big advantage that there is no change to support existing DDLs. 

When deserializing, a Record looks at the filter and compares it with its own 
set of {field name, field type} tuples. If there is a field in the data that it 
doesn't know about it, it skips it (it knows how many bytes to skip, based on 
the filter). If the deserialized data does not contain some field values, the 
Record gives them default values. Additionally, we could allow users to 
optionally specify default values in the DDL. The location of a field in a 
structure does not matter. This lets us support reordering of fields. Note that 
there is no change required to the DDL syntax, and very minimal changes to 
client code (clients just need to read/write type information, in addition to 
record data). 

This scheme gives us an addition powerful feature: we can build a generic 
serializer/deserializer, so that users can read all kinds of data without 
having access to the original DDL or the original stubs. As long as you know 
where the type information of a record is serialized, you can read all kinds of 
data. One can also build a simple UI that displays the structure of data 
serialized in any generic file. This is very useful for handling data across 
lots of versions. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to