Dear Serialization-Boosters,

In the past weeks I have been to busy at work to contribute much to the serialization debate, but I managed to find some time now. It seems to me that the discussion is drifting too far into semantic debates and I would like to refocus by proposing to split the problem vertically and discuss a "bottom up" approach to a serialization library, where I start with the most basic serialization and add higher levels on top of the lower ones.

1) "Definition of serialization": I want avery broad definition: Serialization for me is the conversion of the content of an object into a sequential stream. (I am not talking about C++ I/O streams here. It does not matter whether this stream is text or binary, or holes punched into a tape - this is an archive specific implementation detail). Deserialization is the reverse process of converting a sequential stream into object contents. While in many applications the process has to be reversible to be useful, this is not needed for all cases (e.g. output for debugging purposes or output to be read by another program). I hope that we can agree so far.

In the following I will use the term serialization to mean both serialization and deserialization but will focus the examples on serialization to keep the text shorter.

2) "Serialization engine". Next I propose that we agree on serialization of the basic data types: char, short, int, long, long long, float, double, long double and their signed/unsigned variants where appropriate. The abstract archive class should provide for the serialization of these basic data types and also contain optimized functions to serialize contiguous arrays of these types (e.g. a string, or an array of basic data types). The concrete archive class provides for the actual serialization of these types into the archive format (could be text, native binary, XDR, or whatever). As I became aware here the main issue seems to be the optimized functions for arrays of basic data types, which I view as being essential - otherwise I don't believe there is any controversy so far. Since the implementation of the higher level functionality of the archive classes in the following will be built on top of this basic functionality and will write to the stream only by writing these basic data types, I propose to separate out the functionality of serializing basic data types into a "serialization engine". That way, the format specificity (text vs. native binary vs. XDR, ...) is encapsulated in the serialization engine. If we agree on this, then we could start by defining an interface for the serialization engine.

3) "Archive preamble": Next, there is the question of a preamble of the archive. There we need flexibility to enable compatibility with formats given by other applications and compatibility with legacy formats. I do not view the standardization of a preamble a task for Boost. Rather, the preamble should be archive-format specific and Boost just provides the framework for many archives (as well as several useful standard archive formats). Again, I believe that we have no disagreements here, or am I mistaken?

4) "Serialization of UDT (user defined types)": is the next level up. Since (as I did in my 8-year old serialization library), just overloading operator<< for UDT will not allow the advanced functionality provided by Robert's library, I propose to follow Robert's ideas of a serialization<T> template or to implement similar functionality using free functions.

5) "Versioning": The next level for me is versioning support. We have discussed versioning support on a per-archive and a per-class level. I would like to see both variants supported. Per-class versioning is more flexible, but has two disadvantages: i) it introduces overhead and ii) it writes extra information into the stream, which might make the output incompatible with some applications.
Regarding i: we have to write the version number for each UDT encountered, but want to write it only once per UDT. We thus have to keep track of which UDTs have been serialized so far, and whenever a new UDT is encountered, its version number must be written to the archive. This introduces overhead, especially if many small objects have to be serialized.
I see a two-pronged approach as the best solution:
a) both per-archive versioning, per-class versioning and no versioning should be supported for compatibility with other formats (issue ii) above)
b) if per-class versioning is used, it should be possible to turn it off for some classes by a traits class - this will get rid of the overhead (issue i) above) when versioning is turned off for a UDT.

6) "Advanced functionality": Robert's serialization library includes further functionality, such as the serialization of pointers and of polymorphic types. Here I want to focus on serialization of pointers. I have not checked the implementation of Robert's library in detail, and thus please correct me if I view this wrongly. Serialization of pointers requires the conversion of a pointer to an integer. When serializing objects, the archive thus has to keep track of the addresses of objects, in order to later convert pointers into numbers. This again introduces overhead. Robert addresses this partially in his library by showing how to bypass this system for a UDT. His approach however requires that if I want to bypass the pointer serialization mechanism for a type T, then I have to re-implement serialization of all standard containers of type T, such as std::vector<T>, std::list<T>, std::stack<T>, ... for my type T. My proposal that I have mentioned before is, to just add another traits type, which specifies whether for a type T the pointer serialization scheme can be bypassed (like versioning above) and a faster, optimized serialization used.

Thus to summarize, I propose to split the serialization archive into a serialization engine, doing the serialization of the basic types, and an archive class which takes the engine as a template parameter. The archive class can contain as add-ons, versioning, pointer serialization, polymorphic types, etc. It is important for me to have:

a) flexible preambles to the archive
b) versioning support either: never, per-class, per-object and the latter selective specified by traits
c) it should be possible to turn off advanced functionality, such as pointer serialization selectively by traits classes, which should result in a no-overhead solution.

I believe that Robert's library has most of what it takes to get to such a solution and am willing to help with implementing.

Matthias

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Reply via email to