Re: [boost] Re: Reminder: Serialization Library Review

Matthias Troyer Sun, 17 Nov 2002 01:21:16 -0800

On Sunday, November 17, 2002, at 07:22 AM, Robert Ramey wrote:

From: Matthias Troyer <[EMAIL PROTECTED]>

Imagine I use a platform where long is
64-bit, write it to the archive and then read it again on a platform
where long is 32-bit. This will cause major problems.

Suppose you have a number on the first platform that exceeds 32
significant bits.  What happens when the number is loaded onto
the second platform.  Are the high order bits truncated? How
do you address this problem now?  If none of your longs
are larger than 32 significant bits then there is not problem.
If some are, the 32 machine can't represent them.
This can't cause any problems you don't have already.

It can cause troubles, since for my portable codes I use int64_t or int32_t to be portable. In order for the library to write numbers in binary consistently we should also serialize them as 64-bit ore 32-bit. How do you do that when the bit size can vary from platform to platform? Do you check at runtime what the number of bits is and dispatch to serialization for that number of bits?

No, it seems that in the binary file you just write out the sizes of the integers and just fail the loading if the bit numbers don't agree. Using the fixed-bit-size integers instead would allow your binary files to be much more portable.

It also prevents the use of archive format that rely on fixed bit sizes (such as XDR or
any other platform independent binary format). My suggestion thus is to
change the types in these functions to int8_t, int16_t, int32_t, as was
already done for int64_t. That way portable implementations will be
possible.

I believe that you could just typedef the above on both platforms and use a text archive
and every thing would just fine. The text archive represents all numbers
as arbitrary length integers which would be converted correctly on
save as well as load.

As I mentioned in the introductory part of my post, text archives are much longer than binary ones and thus cause bandwidth problems for some applications. Note that the option to compress the archive after writing works a) only if you serialize into files (which is only one use) and b) does not address the bandwidth problem of first writing the large text files.

2.) The second problem is speed when serializing large containers of
basic data types, e.g. a vector<double> or ublas vectors and matrices.
In my applications these can easily by hundreds of megabyte in size. In
the current implementation, serializing a std::vector<double>(10000000)
requires ten million virtual function calls. In order to prevent this,
I propose to add extra virtual functions (like the operator<< above),
which serialize C-arrays of basic data tyes, i.e. functions like
Serialization version 6 which was submitted for review includes
serialization of C-arrays. It is documented in the reference
under the title "Serialization Implementations included in the Library"
and a test case was added to test.cpp.

Yes, but it does so by calling the virtual operator << for each element, which is very slow if you
call it millions of times.

In conjunction with this, the serialization for std::vector and for
ublas vectors, etc. has to be adapted to make use of these optimized
serialization functions for basic data types.

The library permits override of the included implementations.
Of course,  this has to be up to the person who finds the
the included implementation inconvenient in some way as he is
the only one who knows what he wants changed.

That will not work since overriding is a compile-time decision while I decide the archive format at runtime and thus need to have these optimized functions available as virtual functions.

the serialization of very large numbers of small objects. The current
library shows a way to optimize this (in reference.html#large), but it
is rather cumbersome. As it is now, I have to reimplement the
serialization of std::vector<T>, or std::list<T>, etc., for all such
types T. In almost all of my codes I have a large number of small
objects of various types for which I know that I will never serialize a
pointer. I would thus propose the following:
i) add a traits class to specify whether ever a pointer to an object
will be serialized or if it should be treated as a small object for
which serialization should be optimized
ii) specialize the serialization of the standard library containers for
these small objects, using the mechanism in the documentation.
That way I just need to specify a trait for my object and it will be
serialized efficiently
I would be loath to implement this idea. Basically, instead of overloading
the serializations that you want to speed up, you want to require
all of us to specify traites for every class we want to serialize.

No, for the user who does not care about it nothing must be changed in his code at all!

Wwe can have a general template that defaults to the full and non-optimized serialization method for all classes for which we have not specialized it. That means no extra codes for the standard user, while the user who needs to optimize large collections of small objects would just provide a traits class, instead or reimplementing the serialization of all the standard containers for all his classes that need to be optimized. An example could be:

template <class T>
struct serialization_traits {
static const bool optimize_serialization=false;
};

Thus the traits class is written for all classes that do not need to be optimized. Only for the classes that I need to optimize I would need to just write:

template <> struct serialization_traits<MySmallClass> {
static const bool optimize_serialization=false;
}

and the operator << would dispatch based on the value of this trait, somehow like that:
template <class T, class A>
basic_oarchive& operator<<(basic_oarchive& a, const std::vector<T,A>& v)
{
return dispatch_serialization<serialization_traits::optimize_serialization>::se rialize(a,v);
}

It would
make things harder to use. Also, the current implementation - like much
boost code - stretches current compilers to the breaking point. Its
already much more complex to implement than I expected and
I already have much difficulty accomdating all the differences
in C++ implementations.

This would keep things as easy to use, no extra coding is required for those who do not care about the optimization, but life would be MUCH easier for other users. If you like, I can try to find time to implement this in the library. Also, since use just simple template specialization no modern compiler should have more problems than it already has with the library.

Java has runtime reflexion which is used to
obtain all the information required for serialization.  Also,
Java has a much more limited usage of pointers so certain
problems we are dealing with don't come up.  I don't believe
that all the data structures can be unambiguously mapped
to java.

Could Java data structures be mapped to C++ then, to be able to read Java serialized files? But probably this is then not the scope of this library anyways but might be interesting as a later extension.

I would like to see a platform-independent binary archive format (e.g.
using XDR), but am also willing to contribute that myself once the
interface has been finalized.
Thank you. Note that none of the comments made so far have any
impact on the interfaces defined by the base classes basic_[i|o]archive,

except that I prefer int16_t, int32_t, ... instead of short and long

So there is no reason you can't get started now.  As you can see
from the 3 derivations included in the package, making your own
XDRarchive is a pretty simple proposition.

I'll do that once I get the library to compile, and will send it to you.

As was already remarked by others, I would like to see documentation on
exactly which functions a new archive type has to implement.
Wouldn't be easier just to look at the basic_[i|o]archive code?

I like the documentation to be self-contained. A documentation page including a synposis of asic_[i|o]archive, and showing which functions to implement would make it easier then scanning through the header file, past all the pragmas, comments and other classes until one finds the class definition.

Perhaps
we might want to break out text_archive an native binary archive
into separate headers. That might make it more obvious that
these derivations aren't really part of the library but rather more
like popular examples.

That makes sense.

I thank you for your effort in replying in such a detailed manner to my comments and want to quickly summarize the open issues:

i) as you already use int64_t for 64-bit integers why not also use int32_t, int16_t, etc? That would make it more consistent and MUCH easier to implement portable binary formats!

ii) support for optimization of serialization by a traits class would be extremely important and helpful, without incurring any extra coding efforts for standard users!

iii) additional virtual functions to serialize large arrays of data (e.g. dense vectors and matrices) instead of calling operator << for each of the (possible millions of) elements is still needed for optimization and to make use of corresponding functions for some binary serialization formats (e.g. in XDR or for PVM)

I would volunteer to implement ii) and iii) into the library if you agree and do not want to do it yourself.

With best regards,

Matthias

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] Re: Reminder: Serialization Library Review

Reply via email to