[ 
https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527961
 ] 

Vivek Ratan commented on HADOOP-1883:
-------------------------------------

Let me answer #2 first. Remember, with versioning, we're looking to figure out 
which fields have changed. You need both a field name and its type to uniquely 
identify a field. Suppose we have: 
{code}
class a{
  int s;
  char c;
}
{code}

Now, suppose I replace the second field: 
{code}
class a{
  int s;
  long c;
}
{code}

A deserializer generated from the new class, when reading code serialized by 
the old class, needs to understand that the second field in the serialized data 
needs to be skipped, i.e., 'char c' is different from from 'long c'. Similarly, 
if I changed just the name of a field, and not its type, it is a different 
field (this one is more debatable, but it's safer to assume that it's a 
different field). So the type information needs to contain both the field name 
and field type, which a deserializer matches with the field name and type of 
its own fields to see what to read and what to skip. [BTW, your example is true 
in that the class is invalid because both fields have the same name, but I 
think it's the wrong example for what we're discussing]. 

Now, alternatively, I could have the user assign a unique field number to each 
field (which Thrift and Sawzall do). I could have something like this: 
{code}
class a{
  1: int s;
  2: char c;
}
{code}
If a user changes a field, they decide whether the field number changes. They 
SHOULD change the field number if the field type changes (and maybe if only the 
name changes). Any deserializer will depend on the user-generated field numbers 
to decide what fields to read and what to skip. So, for example, you could 
change the class as follows: 
{code}
class a{
  1:int s;
  3:long c;
}
{code}
Here, the changed field has been given a new field number. 

Either approach works. The latter (using field numbers) is more space-efficient 
as we just need to keep track of field numbers (and types, in order to know how 
many bytes to skip). But it requires manual user support and is more prone to 
errors. We're recommending using the former as the space inefficiency is 
minuscule (since you presumably will store the type information once, for 
hundreds and thousands of records), and you don't have to change DDLs and 
depend on users generating field numbers. 


> Adding versioning to Record I/O
> -------------------------------
>
>                 Key: HADOOP-1883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1883
>             Project: Hadoop
>          Issue Type: New Feature
>            Reporter: Vivek Ratan
>
> There is a need to add versioning support to Record I/O. Users frequently 
> update DDL files, usually by adding/removing fields, but do not want to 
> change the name of the data structure. They would like older & newer 
> deserializers to read as much data as possible. For example, suppose Record 
> I/O is used to serialize/deserialize log records, each of which contains a 
> message and a timestamp. An initial data definition could be as follows:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and 
> can serialize/deserialize itself. Now, suppose newer log records additionally 
> contain a severity level. A user would want to update the definition for a 
> log record but use the same class name. The new definition would be:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
>   int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use 
> a default value for the severity field), and an old deserializer to read 
> newer log records (and skip the severity field).
> This requires some concept of versioning in Record I/O, or rather, the 
> additional ability to read/write type information of a record. The following 
> is a proposal to do this. 
> Every Record I/O Record will have type information which represents how the 
> record is structured (what fields it has, what types, etc.). This type 
> information, represented by the class _RecordTypeInfo_, is itself 
> serializable/deserializable. Every Record supports a method 
> _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are 
> expected to serialize this type information (by calling 
> _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, 
> for example, or at the beginning of a file). Using the same DDL as above, 
> here's how we could serialize log records: 
> {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
>    MyLogRecord log = new MyLogRecord();
>    // fill up the MyLogRecord object
>   ...
>   // serialize
>   log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist, 
> followed by contents of various records, all into the same file. 
> Every Record also supports a method that allows a user to set a filter for 
> deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a 
> parameter. This filter represents the type information of the data that is 
> being deserialized. When deserializing, the Record uses this filter (if one 
> is set) to figure out what to read. Continuing with our example, here's how 
> we could deserialize records:
> {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it. 
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
>   MyLogRecord log = new MyLogRecord();
>   log.read(csvIn);
>   ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be 
> in the same format as it would serialize. (Note that a filter can also be 
> provided for serializing, forcing the serializer to write information in the 
> format of the filter, but there is no use case for this functionality yet). 
> What goes in the type information for a record? The type information for each 
> field in a Record is made up of:
>    1. a unique field ID, which is the field name. 
>    2. a type ID, which denotes the type of the field (int, string, map, etc). 
> The type information for a composite type contains type information for each 
> of its fields. This approach is somewhat similar to the one taken by 
> [Facebook's Thrift|http://developers.facebook.com/thrift/], as well as by 
> Google's Sawzall. The main difference is that we use field names as the field 
> ID, whereas Thrift and Sawzall use user-defined field numbers. While field 
> names take more space, they have the big advantage that there is no change to 
> support existing DDLs. 
> When deserializing, a Record looks at the filter and compares it with its own 
> set of {field name, field type} tuples. If there is a field in the data that 
> it doesn't know about it, it skips it (it knows how many bytes to skip, based 
> on the filter). If the deserialized data does not contain some field values, 
> the Record gives them default values. Additionally, we could allow users to 
> optionally specify default values in the DDL. The location of a field in a 
> structure does not matter. This lets us support reordering of fields. Note 
> that there is no change required to the DDL syntax, and very minimal changes 
> to client code (clients just need to read/write type information, in addition 
> to record data). 
> This scheme gives us an addition powerful feature: we can build a generic 
> serializer/deserializer, so that users can read all kinds of data without 
> having access to the original DDL or the original stubs. As long as you know 
> where the type information of a record is serialized, you can read all kinds 
> of data. One can also build a simple UI that displays the structure of data 
> serialized in any generic file. This is very useful for handling data across 
> lots of versions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to