[boost] some thoughts about serialisation

Ares Lagae Sun, 12 Jan 2003 05:22:18 -0800

On the yahoo groups, i followed some discussions about a possible boost serialization 
library
i followed them with great interest because im also working on a serialization library
for what its worth (im not a boost developer), these are some thoughts about 
serialization:


1) Serialization is based on reflection (introspection, MOP, ... whatever) and to 
implement serialisation there has to be a reflection library first. Try to decouple 
serialization from reflection as much as possible. Because serialization needs 
reflection
anyway and there are other uses for reflection, it is better to stay more general and 
decouple the two (eg one could use the reflection library withouth the serialzation 
lib, but not the other way around).

2) Reflection requires to know about properties of classes, like the base classes, the 
data members with their types and names, and the member functions, with their types 
and names.
For example, one could implement the classes Class, BaseClass<class, base>, 
DataMember<class, type>(data member pointer, name) and MemberFuncion<class, return, 
args>(member function pointer, name).
For each class C a method describe() could be implemented, making a Class<C> object, 
and adding to that object BaseClass<C, X> for all the baseclasses, DataMember<C, 
type>(data member pointer, name) for all the data members and MemberFuncion<C, return,
args>(member function pointer, name) for all the member functions.
Aditionnaly,  DataMember<class, type>(data member pointer, name) should have 
functionality type get() and set(type value), MemberFuncion<C, return, args>(member 
function pointer, name) should have functionallity retrun invoke(instance, args).
The Class<C> object should support methods like getDataMember(name) and 
getMemberFunction(name).
This describe system only relies on things the compiler knows at compile tyme, and 
therefor one could imagine a compiler that generates the describe method 
automatically. There is no need to add aditional (non static) class members, because 
reflection
inherently is about classes, and not about class methods.

3) reflection is the difficult part, serialisation is the easy part. Given a 
reflection system, all the serialization method must do is query the list of data 
members, if the type of the data member is primitive, the data member should be 
serialized
directly, if the data member is not a primitive type, we should again query the data 
members of the type, and repeating the process in a recursive way.
Clearly for pointers some care must be taken. When serializing a pointer, the system 
should create a handle. The handle is constructed with an id (each handdle has an 
unique id) the address of the pointer, and the data the pointer points to. The handle
then is serialized and remembered by the serialization subsystem. Next time we 
encounter a pointer, we check if we already have a handle for the pointer, and if we 
have, we only put the handle ID in the serialization stream. This ensures the data can 
be
deserialized properly, and withouth overhead.

4) there are some pitfalls involved with serialization

- some data members can not be serialized in an easy way (eg pointers, because the 
only have meaning on the local machine), and some data members can not be serialized 
at all. For example sockets or open files. These are so called transient data members.
Due to restrictions of c++, at first sight also references can not be serialized. A 
sheme to solve this is quite difficult. Althouh, one knows that references are 
typically initialized in a constructor, and that transient members are typically 
created from
non-transient data (eg, an attribute char * fileName would be non-transient, but the 
file pointer for it can be created if the value of fileName is known). So we could 
imagine serializable classes to have a method onSerialized() to initialize transient
data members. I would have to see how java handles transient data members.

- how does deserialisation and construcors go together ? I dont see "the approach" to 
handle this, but this is what i think about it: when an object is deserialized, one 
allocates memory for the object withouth invoking a constructor (we do not want to
call eg the default constructor and initialize all attributes with default values to 
overwrite them next) or with a contructor invokation containing no code, set the data 
members in the serialization stream, and then call a constructor (knowing that data
members have been deserialized) (to eg init reference data members and transients). 

- in this context is would be a good practice not to pass transient data across 
classes. Eg, instead of passing socket handles, one could make a Socket wrapper class, 
and pass this one instead. The Socket class will know how to initialize his transient
data (for example create a socket from a deserialized hostname data member).

- the ultimate goal of serialization is not writing classes to disk (this is one of 
the goals) but store classes in a generic way. When one serializes class on a little 
endian machine, and send it over the network to a big endian machine, it should be
deserialized properly. This is one of the most important properties of serialization 
and must be adressed. Although, when one knows the class will only be serialized to 
and from the local disk on the same machine, one wants to avoid the overhead of
platform independant writes. The serialization subsystem should support different 
external formats. According to me, these are essential:
* local binary format, only to serialize fast eg to local disk, this format should not 
be used to eg transmit data to other hosts
* local text format, same notes as previous, but lets the user easilly edit the data. 
text streams using using the local locale could be used
* global binary format, XDR comes to mind
* global text format (one cannot use regular text streams here), XML comes to mind

- serialization typically deals with only one instance of an object. This object, and 
all objects connected to it, are serialized (flattened). This serialization process is 
atomic. The object structure (eg pointers in the object) can not (must not when
using eg threads) change during serialization. In the same way, different 
serialisation calls are not atomic, and therefore must be completely independant. 
Consequent serializations of the same instance of an object to the same archive are 
and must be
totally independant. For example, suppose object A contains a BPtr *B, and we 
serialize A. (*BPtr) will be serialized also, in the same atomic "serialization unit", 
if we next serialize A again, (*BPtr) will also be serailized again. Questions like 
"is the
object pointed to by A::BPtr serialized again the second time i serialize A ? Also 
when it points to the same object ?" are irelevant. The anwser is "yes" twice because 
it are 2 different serialization units and not one atomic serialization unit, and
therefore the system has no control of what happens in between.

just my 2 cents,
Ares Lagae

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] some thoughts about serialisation

Reply via email to