Re: [boost] Reminder: Serialization Library Review

Matthias Troyer Sat, 16 Nov 2002 08:45:15 -0800

Before coming to my detailed review I would like to thank Robert for all his work and for contributing it to Boost.

I want to start with a general comment about text vs. binary archives. While a text archive is nice to look at, and can also be compressed after writing to save disk space there is one important reasons why I will always use binary formats:

i) when using serialization in passing messages between processes (using MPI, PVM or another message passing library), I am often restricted by bandwidth, especially when sending large vectors or matrices. Then converting numbers to text instead of sending them as binary numbers will make my codes slow down by a factor of 2-3.

ii) as my simulation programs usually run several weeks, I serialize the state of the simulation at the end of every batch job (usually every 24 hours). At that time 128-256 CPUs write about 100 MByte each to a central file server, which severely overloads the server already now. Thus, again I want to keep the file sizes as small as possible and consider a support for efficient portable binary serialization formats essential.

On Friday, November 15, 2002, at 05:13 PM, David Abrahams wrote:

   Here are some questions you might want to answer in your review:

      What is your evaluation of the design?

I like the design as apparently much thought was put into being able to serialize polymorphic classes

I see two serious problems however that have to be addressed before I can vote for inclusion into boost.

1.) The first problem are the basic data types used in the archive:

virtual basic_oarchive & operator<<(signed char _Val) = 0;
virtual basic_oarchive & operator<<(unsigned char _Val) = 0;
virtual basic_oarchive & operator<<(char _Val) = 0;
virtual basic_oarchive & operator<<(short _Val) = 0;
virtual basic_oarchive & operator<<(unsigned short _Val) = 0;
virtual basic_oarchive & operator<<(int _Val) = 0;
virtual basic_oarchive & operator<<(unsigned int _Val) = 0;
virtual basic_oarchive & operator<<(long _Val) = 0;
virtual basic_oarchive & operator<<(unsigned long _Val) = 0;
virtual basic_oarchive & operator<<(float _Val) = 0;
virtual basic_oarchive & operator<<(double _Val) = 0;
virtual basic_oarchive & operator<<(long double _Val) = 0;
#ifndef BOOST_NO_INT64_T
virtual basic_oarchive & operator<<(int64_t _Val) = 0;
virtual basic_oarchive & operator<<(uint64_t _Val) = 0;
#endif

shot, int and long have no defined bit size, and can thus never be used for portable serialization. Imagine I use a platform where long is 64-bit, write it to the archive and then read it again on a platform where long is 32-bit. This will cause major problems. It also prevents the use of archive format that rely on fixed bit sizes (such as XDR or any other platform independent binary format). My suggestion thus is to change the types in these functions to int8_t, int16_t, int32_t, as was already done for int64_t. That way portable implementations will be possible.

The second big problem I see concerns speed and efficiency when large containers of small classes have to be serialized.

2.) The second problem is speed when serializing large containers of basic data types, e.g. a vector<double> or ublas vectors and matrices. In my applications these can easily by hundreds of megabyte in size. In the current implementation, serializing a std::vector<double>(10000000) requires ten million virtual function calls. In order to prevent this, I propose to add extra virtual functions (like the operator<< above), which serialize C-arrays of basic data tyes, i.e. functions like

virtual void basic_oarchive::save_array(const int32_t*, std::size_t n)

which as default just call the operator<< n times, but which can be overridden in specialized archive types. Examples are serialization into a PVM buffer, where the pvm_pkint function accepts any array of integers, or the use of memcpy to copy the array in native binary format into a buffer, or binary write functions into a file. Having these extra functions allows implementors of archives to make use of fast functions for arrays of data.

In conjunction with this, the serialization for std::vector and for ublas vectors, etc. has to be adapted to make use of these optimized serialization functions for basic data types.

3.) While I consider the two issues above show stoppers for the use of the library in serious scientific simulations with large data sets, the next issue is not as serious but would be easy to address. It concerns the serialization of very large numbers of small objects. The current library shows a way to optimize this (in reference.html#large), but it is rather cumbersome. As it is now, I have to reimplement the serialization of std::vector<T>, or std::list<T>, etc., for all such types T. In almost all of my codes I have a large number of small objects of various types for which I know that I will never serialize a pointer. I would thus propose the following:

i) add a traits class to specify whether ever a pointer to an object will be serialized or if it should be treated as a small object for which serialization should be optimized

ii) specialize the serialization of the standard library containers for these small objects, using the mechanism in the documentation.

That way I just need to specify a trait for my object and it will be serialized efficiently

4.) I am confused about registering polymorphic types. If one program reads an archive written by another program, do both have to register all the types in exactly the same order, or is it OK if the program reading the archive registers only a subset of types and in another order? I need that when an evaluation program reads only the first part of a file (e.g. only the base class), without reading the rest of the serialized data of the derived class. Can I read the base class from an archive into which I serialized the derived class?
This is important for programs which just act on the information in the base class.

5.) This is a point for discussion an no criticism about the library. Instead of polluting the global namespace with a serialization class, I would prefer to implement serialization with free functions save and load instead.

6.) Finally, if I am correctly informed, the Java language includes serialization and has a portable archive format. Could this library be made compatible with this Java language standard, i.e. might it be possible to create an archive format which can read such Java serialization files?

      What is your evaluation of the implementation?

I would like to see a platform-independent binary archive format (e.g. using XDR), but am also willing to contribute that myself once the interface has been finalized.

      What is your evaluation of the documentation?

As was already remarked by others, I would like to see documentation on exactly which functions a new archive type has to implement. Also, it is unclear (see point 4 above) if the registration of types has to be identical in all programs accessing the same serialized data

      What is your evaluation of the potential usefulness of the
      library?

extremely useful once the issues above have been sorted out

      Did you try to use the library?  With what compiler?  Did you
      have any problems?

I tried to use the library but could not compile it under MacOS X 10.2 with gcc 3.1
Compiling the file "demo.cpp" gives me the error:

../../boost/serialization/serialization_imp.hpp:382: sorry, not implemented: `
tree_list' not supported by dump_expr

Thus unfortunately I could not do detailed tests of speed and file sizes

      How much effort did you put into your evaluation? A glance? A
      quick reading? In-depth study?

half a day now and more time with previous versions.

      Are you knowledgeable about the problem domain?

yes, I have implemented my own serialization library eight years ago and used it for many years.

   And finally, every review should answer this question:

      Do you think the library should be accepted as a Boost library?
      Be sure to say this explicitly so that your other comments don't
      obscure your overall opinion.

Overall I like the library, and believe that it will not be hard to address the issues 1-2 above which I consider show stoppers, and issues 3-4 which I consider serious.
I will vote yes if these issues can be resolved.

Robert, many thanks for your efforts - I would love to use the library in my programs once it is suitable.

Best regards,

Matthias

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Re: [boost] Reminder: Serialization Library Review

Reply via email to