Re: [orientdb] Titan-Cassandra Combination vs OrientDB

Andrey Lomakin Mon, 13 Jan 2014 10:20:46 -0800

Hi,
Thank you !


On Mon, Jan 13, 2014 at 5:48 PM, LSP <[email protected]> wrote:

> Hello Andrey,
>
> Thanks for the detailed explanation. I am currently building a prototype
> with OrientDB. I am confident that along the process I will be challenged
> with modeling techniques and setting up relationships as my application may
> mandate. I will try and share any insights I may have along the process
> with the community.
>
>
> Thanks
> LSP
>
> On Monday, January 6, 2014 2:51:49 AM UTC-6, Andrey Lomakin wrote:
>
>> Hi,
>> Well, I will try answer your question.
>>
>> There are several primary differences between Cassandra and OrientDB.
>>
>> 1. Primary key handling.
>>
>> Cassandra:
>>
>> Cassandra is designed to achieve high write performance so they use LSM
>> trees as underlying data structure for primary key index.
>> What does it mean, it means that they achieve high write performance by
>> mitigation of random I/O overhead.
>> But trade off of such performance gain includes:
>> 1. Memory consumption.
>> 2. Disk space consumption.
>> 3. Read performance a bit slower than in typical for DBMS B-tree index.
>>
>> You can think about LSM trees as about several sorted arrays which are
>> stored on the disk and merged by background process.
>> So if you want to retrieve entry you should look over all those arrays.
>> As result you get complexity which equals to N * log(M) . Where N number
>> of sorted arrays and M number of records in array.
>> To avoid N multiplier Cassandra uses bloom filters , bloom filters detect
>> with some probability whether your key is contained in sorted array and you
>> need to find it in this array, or you can skip this array.
>> If I remember they use counting ones , so they require at least 3 bits of
>> additional memory , or about 3GB of theoretical overhead (without
>> implementation overhead) for 100 billion of  entries.
>> If you are going to make updates to your records you still have to look
>> through several sorted arrays.
>>
>> So for Cassandra primary key look up the best complexity is log(M) and
>> worst is N * log(M).
>>
>> OrientDB:
>>
>> OrientDB uses list based data structure which uses list index as primary
>> key.
>> As result lookup complexity is always O(1). When you create records I/O
>> operations mostly append only so you will not have write speed degradation.
>> But record updates use random I/O so they are slower than record creation
>> operations.
>>
>> To avoid random I/O overhead during updates we are considering to use new
>> cluster implementation it uses much simpler data structure then current one
>> (which means faster) and uses append only approach -  https://github.com/
>> orientechnologies/orientdb/issues/1600 .
>>
>> 2. Secondary key handling.
>>
>> Cassandra:
>>
>> As far as I know Cassandra secondary indexes are  limited. You can use
>> hash indexes and as I remember for data with low cardinality like color
>> names, sex and so on. (but you should recheck it I am not Cassandra expert).
>>
>> OrientDB:
>>
>> OrientDB  has 2 types of indexes hash index and sb-tree (b-tree based).
>> First guaranties at most 1 I/O operation for read  and at most 3 I/O for
>> writes, the second index has log(M) complexity.
>> In OrientDB you can index almost everything, for example you can index
>> embedded map by value, and then perform containsValue SQL queries using
>> indexes.
>>
>> But OrientDB indexes are suffer from random I/O , which means that you
>> probably will need to have more nodes in cluster in case of big data.
>> We have several issues to fix this disadvantage  - https://github.com/
>> orientechnologies/orientdb/issues/1756 https://github.com/
>> orientechnologies/orientdb/issues/1757
>>
>> 3. Server cluster support.
>>
>> The primary difference is scalability options , OrientDB does not use DHT
>> in it's cluster which means that you should migrate your data from one
>> cluster to bigger one manually.
>> But records can be distributed between nodes using different strategies,
>> round robin is default one.
>>
>> 4. Model.
>> OrientDB model is more powerful than Blueprints model (but may be Titan
>> provides additional extensions). We support one-to-many relations using not
>> only edges but LINKLIST, LINKSET, LINKMAP data structures.
>> Also OrientDB supports embedded documents and multi value properties
>> List, Set, Map. Also OrientDB SQL language has operators to support all
>> these collections.
>>
>> Hope this information will help you.
>>
>> But please note that we are not Cassandra or Titan experts, and would be
>> better to ask questions about concrete OrientDB features so you will have
>> ability to compare both implementations.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jan 3, 2014 at 12:04 AM, LSP <[email protected]> wrote:
>>
>>>  Hi All,
>>>
>>> We are currently in the process of building statistical analysis system.
>>> As a part of technology evaluation and due diligence we are drawing a
>>> comparison between Titan-Cassandra combination vs OrientDB.
>>>
>>> There was a topic in these forums that compared Cassandra and OrientDB
>>> (last update in October 2012). The comparison was quite succinct within the
>>> applicable context and the points therein have been factored in as a part
>>> of the due diligence. The biggest difference is obviously the fact that the
>>> comparison was between a columnar DB and a graph DB. The inclusion of Titan
>>> into this discussion makes it apples to apples comparison. Besides, a lot
>>> has changed between October 2012 and January 2014 for OrientDB (Hazelcast
>>> support, Multi-master support etc)
>>>
>>> Following is a high level summary of the scale requirements and internal
>>> design consensus we have:
>>>
>>>    1. 500-750 billion live samples per year (at this point in time we
>>>    do not have visibility if all this will necessarily translate into 
>>> vertices
>>>    per se).
>>>    2. A federated model/system is acceptable
>>>    3. Over and above the 500-750 billion live sample, the application
>>>    will have a couple of million records (just in case an additional drop
>>>    created chaos in the ocean :) )
>>>
>>>
>>> Given that we can store JSON data in Cassandra (with the knowledge that
>>> marshalling and umarshalling will induce latency) and Titan can provide
>>> graph relationship, what, in the estimation of this community tips the
>>> scales in favor of OrientDB.
>>>
>>> At the time of this writing, I have only managed to scratch the surface
>>> and I am relatively new to NoSQL and Big Data systems in general. So, if
>>> the question lacks clarity/depth, please let me know and I will share any
>>> additional information required
>>>
>>> Thanks
>>> LSP
>>> PS - Wishing you all a happy new year and a great 2014.
>>>
>>> --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Andrey Lomakin.
>>
>> Orient Technologies
>> the Company behind OrientDB
>>
>>   --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 
Best regards,
Andrey Lomakin.

Orient Technologies
the Company behind OrientDB

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [orientdb] Titan-Cassandra Combination vs OrientDB

Reply via email to