[boost] Re: Serialization -- limits, and variable length integers

Dave Harris Sun, 17 Nov 2002 08:30:54 -0800

In-Reply-To: <[EMAIL PROTECTED]>
>From the headers...
>    typedef unsigned char version_type;     // upto 255 versions
>    namespace serialization_detail {
>        typedef unsigned short class_id_type;   // upto 64k kinds
>                                                // of objects
>        typedef int object_id_type;             // upto 2G objects
>    }


It seems to me these limits are arbitrary, and in some cases rather low. 
Wouldn't it be better, and more general, to use int or long?

On a related note, I think variable length integers ought to be supported 
as primitive. For example, consider something like:

    void basic_oarchive::save_vri( unsigned long x ) {
        bool more_to_come = true;
        
        while (more_to_come) {
            unsigned char low_bits = x & 0x7f;
            x >>= 7;
            more_to_come = (x == 0);
            unsigned char high_bit = more_to_come  ? 0x80 : 0x00;
            *this << (high_bit | low_bits);
        };
    }

    unsigned long basic_iarchive::load_vri() {
        unsigned long x = 0;
        bool more_to_come = true;

        while (more_to_come) {
            unsigned char bits;
            *this >> bits;
            x = (x << 7) | (bits & 0x7f);
            more_to_come = (bits & 0x80) != 0;
        }
        return x;
    }

This encodes an unsigned int as a variable number of bytes. The low 7 bits 
of each byte contribute to the number, and the high bit says whether there 
are more bytes to come.

Although I've used this technique in the past I haven't tested this exact 
code, so it may have bugs or be in the wrong place. If we are saving in an 
ASCII format we wouldn't want to do this because ASCII is intrinsically 
variable length anyway. And of course, we cannot use it as the default way 
of writing integers because for some numbers it is less efficient (with 
this scheme the overhead can never be more than a byte).

That said, when used appropriately the benefits include:

(a) Smaller archives in the common case.
(b) Faster loading and saving (because of there being fewer bytes to move 
around).
(c) Avoidance of arbitrary limits caused by hardwired sizes.
(d) Extra portability due to not relying on the number and ordering of 
bytes in primitive types.

Of course something like this can be built on top of the current library, 
but if it is included then the library can use it for its bookkeeping 
data. It can be used for things like class_id_type and the lengths of 
strings and vectors. Then the library will get benefits (a)-(d).

For example, a string like "hello" is currently stored (by boarchive) with 
a size_t length, which on my machine is 32 bits, taking 9 bytes 
altogether. If the variable length format is used, it will take 6 bytes, a 
33% saving. Further, it can be reloaded into a machine for which size_t is 
only 16 bits.

-- Dave Harris

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] Re: Serialization -- limits, and variable length integers

Reply via email to