[ https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537069 ]
Vivek Ratan commented on HADOOP-1883: ------------------------------------- Thanks for the feedback, Milind. This was the first patch, a prototype, and is weak on documentation. That will be fixed in the next patch. I will also be adding test cases, and removing code that has been commented out. My feedback's below. I've spent a fair bit of time thinking about the class design for the type information functionality, both in terms of simplicity/extensibility, and also for performance, so that we don't create extra objects or spend too much time dealing with type information. Therefore I'm addressing your concerns individually. >> TypeID should extend Record, since it needs to be serialized and >> deserialized. TypeID is not a Record. Yes, it can write itself to a stream, but that's the only similarity. A Record implements WritableComparable and Cloneable, neither of which are needed for TypeID. TypeID, and its children, are very different from Record. They represent the ID of a Jute type. >> Please consider making RIOType a private static inner class. RIOType is referred to outside of TypeID (in RecordTypeInfo, for example). It cannot be private. >> TypeInfo should extend Record. (See comment on TypeID above.) As I mentioned for TypeID, TypeInfo is also not a Record and shouldn't extend it. >> RecordTypeInfo.java:Should extend TypeInfo, once TypeInfo extends Record. >>That would move the hairy serialization and deserialization code (the >>genericRead* methods) to TypeInfo, and TypeID where they belong. Again, RecordTypeInfo is actually a Record, as you can serialize/deserialize and compare it. It , however, is not a TypeInfo. RecordTypeInfo represents the type information for a record. It is the class that a Jute user sees. TypeInfo and TypeID are internal classes used by RecordTypeInfo to represent type information (though they do show up in the generated code). Think of it this way: Type information for a record, represented by RecordTypeInfo, is a collection of type information for each of its fields. Each field's type information, represented by a TypeInfo object, is comprised of the field name (a string) and a field type ID (a TypeID object). So TypeInfo and TypeID are helper classes used by RecordTypeInfo. The genericRead* methods belong to RecordTypeInfo as they are used during deserialization. Not sure why you think they're 'hairy'. They're required because there is a hierarchy of TypeID classes and RecordTypeInfo does not know which to create unless it reads the type of a field during deserialization. And having read the type, it can itself create the right TypeID object - this is clean, simple, and minimizes performance impact. In fact, this same code will be used for generic deserialization. Any particular reasons you think the code is hairy or belongs elsewhere? >> The skip method really belongs to TypeID. So, we can get rid of Utils.java. Not really. skip() is a general purpose method and has nothing to do with a TypeID. It's used during deserialization by RecordTypeInfo. I placed it in a separate class because it really is a utility method, and it can be used in other places (generic deserialization, for example). I realize that Utils only has this one method for now, but that's where it belongs. >> Please consider making these constants into static methods that take a >> string as parameter, and return another string (prepended by that constant). Why? These really are constants. Turning them into method calls will cost us a method invocation each time they're used, which seems completely unnecessary. What do I gain by using static methods? What I've done is a pretty standard way of defining constants in Java. >> All types have an additional inner class now that extends from generic >> CppType, and adds one method getTypeIDObjectString() that returns the TypeID >> string. These classes can be eliminated by providing this method in the base >> class CppType. CppType's constructor should be modified to take as parameter >> the TypeIDString. Is there any other functionality that these inner >> type-specific classes are providing ? I've defined inner classes for the CPP types for the following reasons: - they mirror more closely the inner classes for the Java types - even though they really have only one method for now, this is the right way to do OO. Rather than pass a string to the constructor, each class overrides the getTypeIDObjectString() method, just as with the Java classes. Makes it much easier for change - tomorrow, if the method does other stuff besides just return a string (which it actually did ,before I optimized the code out), then this design makes it so much easier. I don't like passing subclass-specific stuff to the constructor, just to save on a method. It's not good design. I realize we have additional classes, but so what? There's no performance impact, and the code, IMO, is much cleaner, symmetric, and more OO. - it's quite possible that we will add other methods to these classes in the future. The Java classes already have other methods. I didn't look at redesigning the C++ classes, but I would think that the C++ inner classes would mirror the Java classes quite closely. This way, we don't have two different algorithms or control flows, one for Java and one for C++. >> Method getID() should be modified to prepend RIO_PREFIX, rather than >> prepending RIO_PREFIX everywhere to getId()'s return value. I'm ambivalent on this one. getID() is supposed to append a level number to a string. The string is prefixed by RIO_PREFIX in some cases (where we can clash with user defined variables; see HADOOP-1994), but there are other cases where the string passed to getID() is not prefixed by RIO_PREFIX. So we can't add RIO_PREFIX in getID(). Well, we could, but the generated code would look a little more ugly (RIO_PREFIX, while required, does not make for pretty variable names). Right now, the Jute code looks a little more ugly with RIO_PREFIX being explicitly prefixed in some cases, but I'd prefer that over slightly uglier generated code, which the user is more likely to look at. This is a pretty small matter, so if you have stronger feelings about it than I do, I can make the change. >> genCompareBytes() and genSlurpBytes() code is not modified to use >> RIO_PREFIXES. This is because this code does not use the DDL-defined fields, so we don't need RIO_PREFIX (again, see HADOOP-1994). >> Generate another private method called deserializeWithFilter. That will make >> generated code more readable. deserialize() is really 'deserialize with filter'. It checks if a filter is present. If not, it calls deserializeWithoutFilter(). Otherwise it continues. Seems unnecessary to have another method called deserializeWithFilter, and have deserialize() just do a check for the presence of a filter and call either deserializeWithFilter() or deserializeWithoutFilter(). I don't think this will enhance readability much. >> In C++ generated code, deserializeWithoutFilter need not be virtual deserializeWithoutFilter(), which really is the old deserialize(), is virtual because deserialize() is virtual. I think we either want both to be virtual or both to not be virtual. Since deserialize() was virtual in the old code, I also made deserializeWithoutFilter() also virtual. > Adding versioning to Record I/O > ------------------------------- > > Key: HADOOP-1883 > URL: https://issues.apache.org/jira/browse/HADOOP-1883 > Project: Hadoop > Issue Type: New Feature > Reporter: Vivek Ratan > Assignee: Vivek Ratan > Attachments: 1883_patch01, 1883_patch02, example.zip, example.zip > > > There is a need to add versioning support to Record I/O. Users frequently > update DDL files, usually by adding/removing fields, but do not want to > change the name of the data structure. They would like older & newer > deserializers to read as much data as possible. For example, suppose Record > I/O is used to serialize/deserialize log records, each of which contains a > message and a timestamp. An initial data definition could be as follows: > {code} > class MyLogRecord { > ustring msg; > long timestamp; > } > {code} > Record I/O creates a class, _MyLogRecord_, which represents a log record and > can serialize/deserialize itself. Now, suppose newer log records additionally > contain a severity level. A user would want to update the definition for a > log record but use the same class name. The new definition would be: > {code} > class MyLogRecord { > ustring msg; > long timestamp; > int severity; > } > {code} > Users would want a new deserializer to read old log records (and perhaps use > a default value for the severity field), and an old deserializer to read > newer log records (and skip the severity field). > This requires some concept of versioning in Record I/O, or rather, the > additional ability to read/write type information of a record. The following > is a proposal to do this. > Every Record I/O Record will have type information which represents how the > record is structured (what fields it has, what types, etc.). This type > information, represented by the class _RecordTypeInfo_, is itself > serializable/deserializable. Every Record supports a method > _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are > expected to serialize this type information (by calling > _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, > for example, or at the beginning of a file). Using the same DDL as above, > here's how we could serialize log records: > {code} > FileOutputStream fOut = new FileOutputStream("data.log"); > CsvRecordOutput csvOut = new CsvRecordOutput(fOut); > ... > // get the type information for MyLogRecord > RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo(); > // ask it to write itself out > typeInfo.serialize(csvOut); > ... > // now, serialize a bunch of records > while (...) { > MyLogRecord log = new MyLogRecord(); > // fill up the MyLogRecord object > ... > // serialize > log.serialize(csvOut); > } > {code} > In this example, the type information of a Record is serialized fist, > followed by contents of various records, all into the same file. > Every Record also supports a method that allows a user to set a filter for > deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a > parameter. This filter represents the type information of the data that is > being deserialized. When deserializing, the Record uses this filter (if one > is set) to figure out what to read. Continuing with our example, here's how > we could deserialize records: > {code} > FileInputStream fIn = new FileInputStream("data.log"); > // we know the record was written in CSV format > CsvRecordInput csvIn = new CsvRecordInput(fIn); > ... > // we know the type info is written in the beginning. read it. > RecordTypeInfo typeInfoFilter = new RecordTypeInfo(); > // deserialize it > typeInfoFilter.deserialize(csvIn); > // let MyLogRecord know what to expect > MyLogRecord.setRTIFilter(typeInfoFilter); > // deserialize each record > while (there is data in file) { > MyLogRecord log = new MyLogRecord(); > log.read(csvIn); > ... > } > {code} > The filter is optional. If not provided, the deserializer expects data to be > in the same format as it would serialize. (Note that a filter can also be > provided for serializing, forcing the serializer to write information in the > format of the filter, but there is no use case for this functionality yet). > What goes in the type information for a record? The type information for each > field in a Record is made up of: > 1. a unique field ID, which is the field name. > 2. a type ID, which denotes the type of the field (int, string, map, etc). > The type information for a composite type contains type information for each > of its fields. This approach is somewhat similar to the one taken by > [Facebook's Thrift|http://developers.facebook.com/thrift/], as well as by > Google's Sawzall. The main difference is that we use field names as the field > ID, whereas Thrift and Sawzall use user-defined field numbers. While field > names take more space, they have the big advantage that there is no change to > support existing DDLs. > When deserializing, a Record looks at the filter and compares it with its own > set of {field name, field type} tuples. If there is a field in the data that > it doesn't know about it, it skips it (it knows how many bytes to skip, based > on the filter). If the deserialized data does not contain some field values, > the Record gives them default values. Additionally, we could allow users to > optionally specify default values in the DDL. The location of a field in a > structure does not matter. This lets us support reordering of fields. Note > that there is no change required to the DDL syntax, and very minimal changes > to client code (clients just need to read/write type information, in addition > to record data). > This scheme gives us an addition powerful feature: we can build a generic > serializer/deserializer, so that users can read all kinds of data without > having access to the original DDL or the original stubs. As long as you know > where the type information of a record is serialized, you can read all kinds > of data. One can also build a simple UI that displays the structure of data > serialized in any generic file. This is very useful for handling data across > lots of versions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.