[orientdb] Schema driven serialization #1890

Steve Sat, 15 Feb 2014 19:54:07 -0800

This is probably going to be a stupid question because the solution
seems so obvious I must have missed something fundamental.


I found OrientDB when I gave up on MongoDB due the issue of storing
field names in every document (for a lot of my data the field names are
larger than the data itself).  I just came across issue #1890
<https://github.com/orientechnologies/orientdb/issues/1890> and happy to
see that Orient considers this a priority but I don't quite understand
the need for such a complex approach.

Why not simply maintain an internal index of field names and store the
index?  It wouldn't really matter if you had different classes with the
same field name since the name is all you are interested in.  To further
compact things you could use a format like google protobufs 'varint'
type
<https://developers.google.com/protocol-buffers/docs/encoding#varints>.
If you altered the varint format so the first byte 'grouping' was 16
bits rather than 8 then you'd have 32k field names available before
needing to expand (which would cover an awful lot of uses cases).

The lookup would be as trivial as an array lookup and any overhead would
be more than offset by the benefits of being able to cache many more
records in memory due to the space savings.  Another potential advantage
would be that you only ever use one instance of each field name String
and vastly improve any map lookups that are done internally.  If the
current format writes the actual field name as a string then every time
a field is read it's reading a new string so for every field * every
record where a map lookup is required it must compute hashcode and run a
manual char by char equals(). 3 traversals of the string saved on the
first lookup (1 for hashcode and 1 for both strings) and 2 for
subsequent lookups.

On the client side I suppose there is the issue of whether the client
should keep the entire lookup table in memory.  It could be passed
portions of it as needed and use something like a Trove map for
lookups.  Not quite as fast as an array lookup but again I would imagine
the savings in memory, bandwidth etc would more than offset the cost.

I must be missing something?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[orientdb] Schema driven serialization #1890

Reply via email to