Re: A blog about Cassandra in the IoT arena

2018-08-29 Thread Rahul Singh
Understood. Deep problems to consider.

Partition size.
I’ve been looking at how Yugabyte is using “tablets” of data which have data. 
It’s an interesting proposition. .. it all comes down to the token based 
addressing - which is optimized as a single dimension array and I think this is 
part of the limitation.


The sorting problem is one of the oldest in the Industry. Maybe need to look at 
Kafka and Lucene. Between the two, there are some interesting patterns to 
reference the location of data and to store those references. The compaction 
process wouldn’t need to “sort” if there is an optimized index which orders the 
vectors and the location. Compacting files should be “dumb” operation if the 
“smart” index is ready as the task table. The major reason Cassandra is fast is 
because of the partitioner which effectively “indexes” the data into a node and 
into a token. We need to go one level deeper. Maybe it’s another compaction 
strategy that evenly distributes data by either threshold of size or maintain a 
certain number of sstables.

Don’t have any ideas yet on anything better than Merkle trees. Will get back to 
you with ideas or code.

Good stuff.

Rahul
On Aug 24, 2018, 12:06 PM -0400, DuyHai Doan , wrote:
> No what I meant by infinite partition is not auto sub-partitioning, even at 
> server-side. Ideally Cassandra should be able to support infinite partition 
> size and make compaction, repair and streaming of such partitions manageable:
>
> - compaction: find a way to iterate super efficiently through the whole 
> partition and merge-sort all sstables containing data of the same partition.
>
>  - repair: find another approach than Merkle tree because its resolution is 
> not granular enough. Ideally repair resolution should be at the clustering 
> level or every xxx clustering values
>
>  - streaming: same idea as repair, in case of error/disconnection the stream 
> should be resumed at the latest clustering level checkpoint, or at least 
> should we checkpoint every xxx clustering values
>
>  - partition index: find a way to index efficiently the huge partition. Right 
> now huge partition has a dramatic impact on partition index. The work of 
> Michael Kjellman on birch indices is going into the right direction 
> (CASSANDRA-9754)
>
> About tombstone, there is recently a research paper about Dotted DB and an 
> attempt to make delete without using tombstones: 
> http://haslab.uminho.pt/tome/files/dotteddb_srds.pdf
>
>
>
> > On Fri, Aug 24, 2018 at 12:38 AM, Rahul Singh 
> >  wrote:
> > > Agreed. One of the ideas I had on partition size is to automatically 
> > > synthetically shard based on some basic patterns seen in the data.
> > >
> > > It could be implemented as a tool that would create a new table with an 
> > > additional part of the key that is an automatic created shard, or it 
> > > would use an existing key and then migrate the data.
> > >
> > > The internal automatic shard would adjust as needed and keep 
> > > “Subpartitons” or “rowsets” but return the full partition given some 
> > > special CQL
> > >
> > > This is done today at the Data Access layer and he data model design but 
> > > it’s pretty much a step by step process that could be algorithmically 
> > > done.
> > >
> > > Regarding the tombstone — maybe we have another thread dedicated to 
> > > cleaning tombstones - separate from compaction. Depending on the amount 
> > > of tombstones and a threshold, it would be dedicated to deletion. It may 
> > > be an edge case , but people face issues with tombstones all the time 
> > > because they don’t know better.
> > >
> > > Rahul
> > > On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan , 
> > > wrote:
> > > > As I used to tell some people, the day we make :
> > > >
> > > > 1. partition size unlimited, or at least huge partition easily 
> > > > manageable (compaction, repair, streaming, partition index file)
> > > > 2. tombstone a non-issue
> > > >
> > > > that day, Cassandra will dominate any other IoT technology out there
> > > >
> > > > Until then ...
> > > >
> > > > > On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh 
> > > > >  wrote:
> > > > > > Good analysis of how the different key structures affect use cases 
> > > > > > and performance. I think you could extend this article with 
> > > > > > potential evaluation of FiloDB which specifically tries to solve 
> > > > > > the OLAP issue with arbitrary queries.
> > > > > >
> > > > > > Another option is leveraging Elassandra (index in Elasticsearch 
> > > > > > collocates with C*) or DataStax (index in Solr collocated with C*)
> > > > > >
> > > > > > I personally haven’t used SnappyData but that’s another Spark based 
> > > > > > DB that could be leveraged for performance real-time queries on the 
> > > > > > OLTP side.
> > > > > >
> > > > > > Rahul
> > > > > > On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > we wrote a blog about some of the results that engineers from 
> > > > > > 

Re: A blog about Cassandra in the IoT arena

2018-08-24 Thread DuyHai Doan
No what I meant by infinite partition is not auto sub-partitioning, even at
server-side. Ideally Cassandra should be able to support infinite partition
size and make compaction, repair and streaming of such partitions
manageable:

- compaction: find a way to iterate super efficiently through the whole
partition and merge-sort all sstables containing data of the same
partition.

 - repair: find another approach than Merkle tree because its resolution is
not granular enough. Ideally repair resolution should be at the clustering
level or every xxx clustering values

 - streaming: same idea as repair, in case of error/disconnection the
stream should be resumed at the latest clustering level checkpoint, or at
least should we checkpoint every xxx clustering values

 - partition index: find a way to index efficiently the huge partition.
Right now huge partition has a dramatic impact on partition index. The work
of Michael Kjellman on birch indices is going into the right direction
 (CASSANDRA-9754)

About tombstone, there is recently a research paper about Dotted DB and an
attempt to make delete without using tombstones:
http://haslab.uminho.pt/tome/files/dotteddb_srds.pdf



On Fri, Aug 24, 2018 at 12:38 AM, Rahul Singh 
wrote:

> Agreed. One of the ideas I had on partition size is to automatically
> synthetically shard based on some basic patterns seen in the data.
>
> It could be implemented as a tool that would create a new table with an
> additional part of the key that is an automatic created shard, or it would
> use an existing key and then migrate the data.
>
> The internal automatic shard would adjust as needed and keep
> “Subpartitons” or “rowsets” but return the full partition given some
> special CQL
>
> This is done today at the Data Access layer and he data model design but
> it’s pretty much a step by step process that could be algorithmically done.
>
> Regarding the tombstone — maybe we have another thread dedicated to
> cleaning tombstones - separate from compaction. Depending on the amount of
> tombstones and a threshold, it would be dedicated to deletion. It may be an
> edge case , but people face issues with tombstones all the time because
> they don’t know better.
>
> Rahul
> On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan ,
> wrote:
>
> As I used to tell some people, the day we make :
>
> 1. partition size unlimited, or at least huge partition easily manageable
> (compaction, repair, streaming, partition index file)
> 2. tombstone a non-issue
>
> that day, Cassandra will dominate any other IoT technology out there
>
> Until then ...
>
> On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh  > wrote:
>
>> Good analysis of how the different key structures affect use cases and
>> performance. I think you could extend this article with potential
>> evaluation of FiloDB which specifically tries to solve the OLAP issue with
>> arbitrary queries.
>>
>> Another option is leveraging Elassandra (index in Elasticsearch
>> collocates with C*) or DataStax (index in Solr collocated with C*)
>>
>> I personally haven’t used SnappyData but that’s another Spark based DB
>> that could be leveraged for performance real-time queries on the OLTP side.
>>
>> Rahul
>> On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
>>
>> Hi,
>>
>> we wrote a blog about some of the results that engineers from AN10 shared
>> earlier.
>>
>> I am sharing it here for greater comments and discussions.
>>
>> http://www.an10.io/technology/cassandra-and-iot-queries-are-
>> they-a-good-match/
>>
>>
>> Thank you.
>>
>>
>>
>> - Affan
>>
>>
>


Re: A blog about Cassandra in the IoT arena

2018-08-23 Thread Rahul Singh
Agreed. One of the ideas I had on partition size is to automatically 
synthetically shard based on some basic patterns seen in the data.

It could be implemented as a tool that would create a new table with an 
additional part of the key that is an automatic created shard, or it would use 
an existing key and then migrate the data.

The internal automatic shard would adjust as needed and keep “Subpartitons” or 
“rowsets” but return the full partition given some special CQL

This is done today at the Data Access layer and he data model design but it’s 
pretty much a step by step process that could be algorithmically done.

Regarding the tombstone — maybe we have another thread dedicated to cleaning 
tombstones - separate from compaction. Depending on the amount of tombstones 
and a threshold, it would be dedicated to deletion. It may be an edge case , 
but people face issues with tombstones all the time because they don’t know 
better.

Rahul
On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan , wrote:
> As I used to tell some people, the day we make :
>
> 1. partition size unlimited, or at least huge partition easily manageable 
> (compaction, repair, streaming, partition index file)
> 2. tombstone a non-issue
>
> that day, Cassandra will dominate any other IoT technology out there
>
> Until then ...
>
> > On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh  
> > wrote:
> > > Good analysis of how the different key structures affect use cases and 
> > > performance. I think you could extend this article with potential 
> > > evaluation of FiloDB which specifically tries to solve the OLAP issue 
> > > with arbitrary queries.
> > >
> > > Another option is leveraging Elassandra (index in Elasticsearch 
> > > collocates with C*) or DataStax (index in Solr collocated with C*)
> > >
> > > I personally haven’t used SnappyData but that’s another Spark based DB 
> > > that could be leveraged for performance real-time queries on the OLTP 
> > > side.
> > >
> > > Rahul
> > > On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
> > > > Hi,
> > > >
> > > > we wrote a blog about some of the results that engineers from AN10 
> > > > shared earlier.
> > > >
> > > > I am sharing it here for greater comments and discussions.
> > > >
> > > > http://www.an10.io/technology/cassandra-and-iot-queries-are-they-a-good-match/
> > > >
> > > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > > > - Affan
>


Re: A blog about Cassandra in the IoT arena

2018-08-23 Thread DuyHai Doan
As I used to tell some people, the day we make :

1. partition size unlimited, or at least huge partition easily manageable
(compaction, repair, streaming, partition index file)
2. tombstone a non-issue

that day, Cassandra will dominate any other IoT technology out there

Until then ...

On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh 
wrote:

> Good analysis of how the different key structures affect use cases and
> performance. I think you could extend this article with potential
> evaluation of FiloDB which specifically tries to solve the OLAP issue with
> arbitrary queries.
>
> Another option is leveraging Elassandra (index in Elasticsearch collocates
> with C*) or DataStax (index in Solr collocated with C*)
>
> I personally haven’t used SnappyData but that’s another Spark based DB
> that could be leveraged for performance real-time queries on the OLTP side.
>
> Rahul
> On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
>
> Hi,
>
> we wrote a blog about some of the results that engineers from AN10 shared
> earlier.
>
> I am sharing it here for greater comments and discussions.
>
> http://www.an10.io/technology/cassandra-and-iot-queries-are-
> they-a-good-match/
>
>
> Thank you.
>
>
>
> - Affan
>
>


Re: A blog about Cassandra in the IoT arena

2018-08-23 Thread Rahul Singh
Good analysis of how the different key structures affect use cases and 
performance. I think you could extend this article with potential evaluation of 
FiloDB which specifically tries to solve the OLAP issue with arbitrary queries.

Another option is leveraging Elassandra (index in Elasticsearch collocates with 
C*) or DataStax (index in Solr collocated with C*)

I personally haven’t used SnappyData but that’s another Spark based DB that 
could be leveraged for performance real-time queries on the OLTP side.

Rahul
On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
> Hi,
>
> we wrote a blog about some of the results that engineers from AN10 shared 
> earlier.
>
> I am sharing it here for greater comments and discussions.
>
> http://www.an10.io/technology/cassandra-and-iot-queries-are-they-a-good-match/
>
>
> Thank you.
>
>
>
> - Affan


A blog about Cassandra in the IoT arena

2018-08-23 Thread Affan Syed
Hi,

we wrote a blog about some of the results that engineers from AN10 shared
earlier.

I am sharing it here for greater comments and discussions.

http://www.an10.io/technology/cassandra-and-iot-queries-are-they-a-good-match/


Thank you.



- Affan