Curious to know is there currently a 'defrag' tool or something of that 
nature?  If so that would be the ideal place to insert the schema 
consolidation process.

On Sunday, February 16, 2014 1:53:27 PM UTC+10, Steve Coughlan wrote:
>
>  This is probably going to be a stupid question because the solution seems 
> so obvious I must have missed something fundamental.
>
> I found OrientDB when I gave up on MongoDB due the issue of storing field 
> names in every document (for a lot of my data the field names are larger 
> than the data itself).  I just came across issue 
> #1890<https://github.com/orientechnologies/orientdb/issues/1890>and happy to 
> see that Orient considers this a priority but I don't quite 
> understand the need for such a complex approach.
>
> Why not simply maintain an internal index of field names and store the 
> index?  It wouldn't really matter if you had different classes with the 
> same field name since the name is all you are interested in.  To further 
> compact things you could use a format like google protobufs 'varint' 
> type<https://developers.google.com/protocol-buffers/docs/encoding#varints>. 
> If you altered the varint format so the first byte 'grouping' was 16 bits 
> rather than 8 then you'd have 32k field names available before needing to 
> expand (which would cover an awful lot of uses cases).
>
> The lookup would be as trivial as an array lookup and any overhead would 
> be more than offset by the benefits of being able to cache many more 
> records in memory due to the space savings.  Another potential advantage 
> would be that you only ever use one instance of each field name String and 
> vastly improve any map lookups that are done internally.  If the current 
> format writes the actual field name as a string then every time a field is 
> read it's reading a new string so for every field * every record where a 
> map lookup is required it must compute hashcode and run a manual char by 
> char equals(). 3 traversals of the string saved on the first lookup (1 for 
> hashcode and 1 for both strings) and 2 for subsequent lookups.
>
> On the client side I suppose there is the issue of whether the client 
> should keep the entire lookup table in memory.  It could be passed portions 
> of it as needed and use something like a Trove map for lookups.  Not quite 
> as fast as an array lookup but again I would imagine the savings in memory, 
> bandwidth etc would more than offset the cost.
>
> I must be missing something?
>  

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to