Although binary serialization gives us substantial performance advantage
the primary motive for this change was saving space (which in a bigdata
context IS a performance factor). With that in mind I've put together
of a comparison of a common type of data record using various methods of
serialization. For this comparison I've picked the classic stockmarket
quote. It's a common small record type that's used in large volumes
(and also happens to be the main entry barrier for me to using OrientDB):
class Quote {
String ticker; //assume 4 chars
Date date;
float open;
float high;
float low;
float close;
long volume;
}
For the purpose of the exercise we assume that stock price on average is
4 digits + 1 decimal place e.g. $12.51 or for a penny stock $0.013. We
will also assume that String encoded field values are enclosed in quotes
as per the current implementation (we'll ignore escape chars for this
exercise)
Also assume that Strings are encoded using 2 bytes/char
**String keys+values serialization (current implentation)
**key: ticker - 8 chars + fieldValueDelimiter (1 char) = 16 bytes
value: ticker - 3 chars + 2 " chars = 10 bytes
key: date - 4 + 1 = 10 bytes
value: date - 13 + 2 = 30 bytes // System.currentTimeMillis() returns 13
digit number
key: open - 4 + 1 = 10 bytes
value: open - 5 + 1 = 12 bytes
key: high - 4 + 1 = 10 bytes
value: high - 5 + 1 = 12 bytes
key: low - 3 + 1 = 8 bytes
value: low - 5 + 1 = 12 bytes
key: close - 5 + 1 = 12 bytes
value: close - 5 + 1 = 12 bytes
key: volume - 6 + 1 = 14 bytes
value: volume - 8 + 1 = 18 bytes
TOTAL = 180 bytes
*String keys+ Binary values serialization*
(assumes length byte for Strings)
key: ticker - 6 chars + fieldValueDelimiter (1 char) = 14 bytes
value: ticker - 1 byte + 3 chars = 7 bytes
key: date - 4 + 1 = 10 bytes
value: date - 8 bytes
key: open - 4 + 1 = 10 bytes
value: open - 4 bytes
key: high - 4 + 1 = 10 bytes
value: high - 4 bytes
key: low - 3 + 1 = 8 bytes
value: low - 4 bytes
key: close - 5 + 1 = 12 bytes
value: close - 4 bytes
key: volume - 6 + 1 = 14 bytes
value: volume - 8 bytes
TOTAL = 65 bytes
**Binary serialization without declared schema*
**
Header:
*format, classId, version, headerLength, fieldCount, nullbitsLength,
nullbits = 7 bytes
7 fields * nameId, datatype, offset, length = 4 * 7 = 28 bytes
dataLength = 4 bytes
header total: 35 bytes
*Data:
*ticker = 6 bytes
open, high, low, close = 4 * 4 = 16 bytes
volume = 8 bytes
data total: 30 bytes
record total: 58 bytes
**Binary serialization with declared schema*
**
Header:
*format, classId, version, headerLength, fieldCount, nullbitsLength,
nullbits = 7 bytes
ticker field: offset, length = 2 bytes
dataLength = 4 bytes
header total: 13 bytes
*Data:
*ticker = 6 bytes
open, high, low, close = 4 * 4 = 16 bytes
volume = 8 bytes
data total: 30 bytes
record total: 43 bytes
The valid comparison currently is between the current implementation
(which doesn't change it's serialized size regardless of whether the
class is schema declared) and either of the two binary examples.
i.e. 160 bytes vs either 58 bytes or 43 bytes which in terms of records
able to be cached means a factor of 2.8 or 3.7 depending on whether the
class is schema declared.
On 07/04/14 22:27, [email protected] wrote:
>
> Steve,
>
> I see you mention serialization of sub-elements as well.
>
> How much effort do you think it is to get this working for embedded
> maps and do you see that as something you will look into?
>
> Regards,
> -Stefan
>
> On Monday, 7 April 2014 12:25:33 UTC, [email protected] wrote:
>
> Hi,
>
> Do you have any rough estimation regarding how much space this
> could save?
> I know the question is very vague but I'm curious to know if you
> have done any comparison at all.
>
> Luca; Are you able to prioritize this to take advantage of this
> create work asap?
>
> Steve, again, thank you very much.
>
> Regards,
> -Stefán
>
> On Monday, 7 April 2014 03:41:11 UTC, Steve Coughlan wrote:
>
> On 07/04/14 13:12, Steve wrote:
> > For testing/debug it may be convenient to use a string data
> > serialization format. Or if using compressedbits this field
> may
> > specify which compression algorithm or settings to use.
>
> I should add a caveat to this statement. As long is there is
> not a
> mismatch between whether the serializer serializes fixed
> length fields
> using the same fixed length. Some code changes would be
> required to
> allow for this although they would not be difficult to do.
>
> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout.
--
---
You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.