[orientdb] Re: Schema driven serialization #1890

stefan Tue, 18 Feb 2014 10:54:59 -0800

+1 for this (if someone is counting)

it's very relevant for our use-case (schema-mixed).


Regards,
 -Stefan


On Tuesday, 18 February 2014 13:32:24 UTC, Steve Coughlan wrote:
>
> Curious to know is there currently a 'defrag' tool or something of that 
> nature?  If so that would be the ideal place to insert the schema 
> consolidation process.
>
> On Sunday, February 16, 2014 1:53:27 PM UTC+10, Steve Coughlan wrote:
>>
>>  This is probably going to be a stupid question because the solution 
>> seems so obvious I must have missed something fundamental.
>>
>> I found OrientDB when I gave up on MongoDB due the issue of storing field 
>> names in every document (for a lot of my data the field names are larger 
>> than the data itself).  I just came across issue 
>> #1890<https://github.com/orientechnologies/orientdb/issues/1890>and happy to 
>> see that Orient considers this a priority but I don't quite 
>> understand the need for such a complex approach.
>>
>> Why not simply maintain an internal index of field names and store the 
>> index?  It wouldn't really matter if you had different classes with the 
>> same field name since the name is all you are interested in.  To further 
>> compact things you could use a format like google protobufs 'varint' 
>> type<https://developers.google.com/protocol-buffers/docs/encoding#varints>. 
>> If you altered the varint format so the first byte 'grouping' was 16 bits 
>> rather than 8 then you'd have 32k field names available before needing to 
>> expand (which would cover an awful lot of uses cases).
>>
>> The lookup would be as trivial as an array lookup and any overhead would 
>> be more than offset by the benefits of being able to cache many more 
>> records in memory due to the space savings.  Another potential advantage 
>> would be that you only ever use one instance of each field name String and 
>> vastly improve any map lookups that are done internally.  If the current 
>> format writes the actual field name as a string then every time a field is 
>> read it's reading a new string so for every field * every record where a 
>> map lookup is required it must compute hashcode and run a manual char by 
>> char equals(). 3 traversals of the string saved on the first lookup (1 for 
>> hashcode and 1 for both strings) and 2 for subsequent lookups.
>>
>> On the client side I suppose there is the issue of whether the client 
>> should keep the entire lookup table in memory.  It could be passed portions 
>> of it as needed and use something like a Trove map for lookups.  Not quite 
>> as fast as an array lookup but again I would imagine the savings in memory, 
>> bandwidth etc would more than offset the cost.
>>
>> I must be missing something?
>>  
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[orientdb] Re: Schema driven serialization #1890

Reply via email to