[jira] Commented: (HADOOP-1883) Adding versioning to Record I/O

Vivek Ratan (JIRA) Tue, 23 Oct 2007 09:53:53 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537069
 ]


Vivek Ratan commented on HADOOP-1883:
-------------------------------------


Thanks for the feedback, Milind. This was the first patch, a prototype, and is 
weak on documentation. That will be fixed in the next patch. I will also be 
adding test cases, and removing code that has been commented out. 

My feedback's below. I've spent a fair bit of time thinking about the class 
design for the type information functionality, both in terms of 
simplicity/extensibility, and also for performance, so that we don't create 
extra objects or spend too much time dealing with type information. Therefore 
I'm addressing your concerns individually.

>> TypeID should extend Record, since it needs to be serialized and 
>> deserialized. 
TypeID is not a Record. Yes, it can write itself to a stream, but that's the 
only similarity. A Record implements WritableComparable and Cloneable, neither 
of which are needed for TypeID. TypeID, and its children, are very different 
from Record. They represent the ID of a Jute type. 

>> Please consider making RIOType a private static inner class.
RIOType is referred to outside of TypeID (in RecordTypeInfo, for example). It 
cannot be private.

>> TypeInfo should extend Record. (See comment on TypeID above.)
As I mentioned for TypeID, TypeInfo is also not a Record and shouldn't extend 
it. 

>> RecordTypeInfo.java:Should extend TypeInfo, once TypeInfo extends Record.
>>That would move the hairy serialization and deserialization code (the 
>>genericRead* methods) to TypeInfo, and TypeID where they belong.
Again, RecordTypeInfo is actually a Record, as you can serialize/deserialize 
and compare it. It , however, is not a TypeInfo. RecordTypeInfo represents the 
type information for a record. It is the class that a Jute user sees.  TypeInfo 
and TypeID are internal classes used by RecordTypeInfo to represent type 
information (though they do show up in the generated code). Think of it this 
way: Type information for a record, represented by RecordTypeInfo, is a 
collection of type information for each of its fields. Each field's type 
information, represented by a TypeInfo object, is comprised of the field name 
(a string) and a field type ID (a TypeID object). So TypeInfo and TypeID are 
helper classes used by RecordTypeInfo. 
The genericRead* methods belong to RecordTypeInfo as they are used during 
deserialization. Not sure why you think they're 'hairy'. They're required 
because there is a hierarchy of TypeID classes and RecordTypeInfo does not know 
which to create unless it reads the type of a field during deserialization. And 
having read the type, it can itself create the right TypeID object - this is 
clean, simple, and minimizes performance impact. In fact, this same code will 
be used for generic deserialization. Any particular reasons you think the code 
is hairy or belongs elsewhere? 

>> The skip method really belongs to TypeID. So, we can get rid of Utils.java.
Not really. skip() is a general purpose method and has nothing to do with a 
TypeID. It's used during deserialization by RecordTypeInfo. I placed it in a 
separate class because it really is a utility method, and it can be used in 
other places (generic deserialization, for example). I realize that Utils only 
has this one method for now, but that's where it belongs. 

>> Please consider making these constants into static methods that take a 
>> string as parameter, and return another string (prepended by that constant).
Why? These really are constants. Turning them into method calls will cost us a 
method invocation each time they're used, which seems completely unnecessary. 
What do I gain by using static methods? What I've done is a pretty standard way 
of defining constants in Java. 

>> All types have an additional inner class now that extends from generic 
>> CppType, and adds one method getTypeIDObjectString() that returns the TypeID 
>> string. These classes can be eliminated by providing this method in the base 
>> class CppType. CppType's constructor should be modified to take as parameter 
>> the TypeIDString. Is there any other functionality that these inner 
>> type-specific classes are providing ?

I've defined inner classes for the CPP types for the following reasons: 
- they mirror more closely the inner classes for the Java types
- even though they really have only one method for now, this is the right way 
to do OO. Rather than pass a string to the constructor, each class overrides 
the getTypeIDObjectString() method, just as with the Java classes. Makes it 
much easier for change - tomorrow, if the method does other stuff besides just 
return a string (which it actually did ,before I optimized the code out), then 
this design makes it so much easier. I don't like passing subclass-specific 
stuff to the constructor, just to save on a method. It's not good design. I 
realize we have additional classes, but so what? There's no performance impact, 
and the code, IMO, is much cleaner, symmetric, and more OO. 
- it's quite possible that we will add other methods to these classes in the 
future. The Java classes already have other methods. I didn't look at 
redesigning the C++ classes, but I would think that the C++ inner classes would 
mirror the Java classes quite closely. This way, we don't have two different 
algorithms or control flows, one for Java and one for C++. 

>> Method getID() should be modified to prepend RIO_PREFIX, rather than 
>> prepending RIO_PREFIX everywhere to getId()'s return value.
I'm ambivalent on this one. getID() is supposed to append a level number to a 
string. The string is prefixed by RIO_PREFIX  in some cases (where we can clash 
with user defined variables; see HADOOP-1994), but there are other cases where 
the string passed to getID() is not prefixed by RIO_PREFIX. So we can't add 
RIO_PREFIX in getID(). Well, we could, but the generated code would look a 
little more ugly (RIO_PREFIX, while required, does not make for pretty variable 
names). Right now, the Jute code looks a little more ugly with RIO_PREFIX being 
explicitly prefixed in some cases, but I'd prefer that over slightly uglier 
generated code, which the user is more likely to look at. This is a pretty 
small matter, so if you have stronger feelings about it than I do, I can make 
the change. 

>> genCompareBytes() and genSlurpBytes() code is not modified to use 
>> RIO_PREFIXES.
This is because this code does not use the DDL-defined fields, so we don't need 
RIO_PREFIX (again, see HADOOP-1994). 

>> Generate another private method called deserializeWithFilter. That will make 
>> generated code more readable.
deserialize() is really 'deserialize with filter'. It checks if a filter is 
present. If not, it calls deserializeWithoutFilter(). Otherwise it continues. 
Seems unnecessary to have another method called deserializeWithFilter, and have 
deserialize() just do a check for the presence of a filter and call either 
deserializeWithFilter() or deserializeWithoutFilter(). I don't think this will 
enhance readability much. 

>> In C++ generated code, deserializeWithoutFilter need not be virtual
deserializeWithoutFilter(), which really is the old deserialize(), is virtual 
because deserialize() is virtual. I think we either want both to be virtual or 
both to not be virtual. Since deserialize() was virtual in the old code, I also 
made deserializeWithoutFilter() also virtual. 



> Adding versioning to Record I/O
> -------------------------------
>
>                 Key: HADOOP-1883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1883
>             Project: Hadoop
>          Issue Type: New Feature
>            Reporter: Vivek Ratan
>            Assignee: Vivek Ratan
>         Attachments: 1883_patch01, 1883_patch02, example.zip, example.zip
>
>
> There is a need to add versioning support to Record I/O. Users frequently 
> update DDL files, usually by adding/removing fields, but do not want to 
> change the name of the data structure. They would like older & newer 
> deserializers to read as much data as possible. For example, suppose Record 
> I/O is used to serialize/deserialize log records, each of which contains a 
> message and a timestamp. An initial data definition could be as follows:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and 
> can serialize/deserialize itself. Now, suppose newer log records additionally 
> contain a severity level. A user would want to update the definition for a 
> log record but use the same class name. The new definition would be:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
>   int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use 
> a default value for the severity field), and an old deserializer to read 
> newer log records (and skip the severity field).
> This requires some concept of versioning in Record I/O, or rather, the 
> additional ability to read/write type information of a record. The following 
> is a proposal to do this. 
> Every Record I/O Record will have type information which represents how the 
> record is structured (what fields it has, what types, etc.). This type 
> information, represented by the class _RecordTypeInfo_, is itself 
> serializable/deserializable. Every Record supports a method 
> _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are 
> expected to serialize this type information (by calling 
> _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, 
> for example, or at the beginning of a file). Using the same DDL as above, 
> here's how we could serialize log records: 
> {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
>    MyLogRecord log = new MyLogRecord();
>    // fill up the MyLogRecord object
>   ...
>   // serialize
>   log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist, 
> followed by contents of various records, all into the same file. 
> Every Record also supports a method that allows a user to set a filter for 
> deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a 
> parameter. This filter represents the type information of the data that is 
> being deserialized. When deserializing, the Record uses this filter (if one 
> is set) to figure out what to read. Continuing with our example, here's how 
> we could deserialize records:
> {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it. 
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
>   MyLogRecord log = new MyLogRecord();
>   log.read(csvIn);
>   ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be 
> in the same format as it would serialize. (Note that a filter can also be 
> provided for serializing, forcing the serializer to write information in the 
> format of the filter, but there is no use case for this functionality yet). 
> What goes in the type information for a record? The type information for each 
> field in a Record is made up of:
>    1. a unique field ID, which is the field name. 
>    2. a type ID, which denotes the type of the field (int, string, map, etc). 
> The type information for a composite type contains type information for each 
> of its fields. This approach is somewhat similar to the one taken by 
> [Facebook's Thrift|http://developers.facebook.com/thrift/], as well as by 
> Google's Sawzall. The main difference is that we use field names as the field 
> ID, whereas Thrift and Sawzall use user-defined field numbers. While field 
> names take more space, they have the big advantage that there is no change to 
> support existing DDLs. 
> When deserializing, a Record looks at the filter and compares it with its own 
> set of {field name, field type} tuples. If there is a field in the data that 
> it doesn't know about it, it skips it (it knows how many bytes to skip, based 
> on the filter). If the deserialized data does not contain some field values, 
> the Record gives them default values. Additionally, we could allow users to 
> optionally specify default values in the DDL. The location of a field in a 
> structure does not matter. This lets us support reordering of fields. Note 
> that there is no change required to the DDL syntax, and very minimal changes 
> to client code (clients just need to read/write type information, in addition 
> to record data). 
> This scheme gives us an addition powerful feature: we can build a generic 
> serializer/deserializer, so that users can read all kinds of data without 
> having access to the original DDL or the original stubs. As long as you know 
> where the type information of a record is serialized, you can read all kinds 
> of data. One can also build a simple UI that displays the structure of data 
> serialized in any generic file. This is very useful for handling data across 
> lots of versions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1883) Adding versioning to Record I/O

Reply via email to