Re: Academic paper about Cassandra database compaction

2018-05-15 Thread Jeff Jirsa
On Mon, May 14, 2018 at 11:04 AM, Lucas Benevides <
lu...@maurobenevides.com.br> wrote:

> Thank you Jeff Jirsa by your comments,
>
> How can we do this:  "fix this by not scheduling the major compaction
> until we know all of the sstables in the window are available to be
> compacted"?
>
>
Would require a change to TWCS itself. Right here where we grab
not-currently-compacting sstables (
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/TimeWindowCompactionStrategy.java#L110
), we'd also grab the compacting set, and if the candidate sstables for the
task overlapped with the same window as the compacting sstables (respecting
repaired/unrepaired/pending-repaired sets), then we'd skip compacting until
the previous compactions finished.


> About the column-family schema, I had to customize the cassandra-stress
> tool so that it could create a reasonable number of rows per partition. In
> the default behavior it keeps creating repeated clustering keys for each
> partition, and so most data get updated instead of inserted.
>

A similar customization may be useful to create partitions that are
narrowly bucketed into fixed sized time windows (which is a common and
typical schema in IOT use cases).

- Jeff





>
> Lucas B. Dias
>
> 2018-05-14 14:03 GMT-03:00 Jeff Jirsa :
>
>> Interesting!
>>
>> I suspect I know what the increased disk usage in TWCS, and it's a
>> solvable problem, the problem is roughly something like this:
>> - Window 1 has sstables 1, 2, 3, 4, 5, 6
>> - We start compacting 1, 2, 3, 4 (using STCS-in-TWCS first window)
>> - The TWCS window rolls over
>> - We flush (sstable 7), and trigger the TWCS window major compaction,
>> which starts compacting 5, 6, 7 + any other sstable from that window
>> - If the first compaction (1,2,3,4) has finished by the time sstable 7 is
>> flushed, we'll include it's result in that compaction, if it doesn't we'll
>> have to do the major compaction twice to guarantee we have exactly one
>> sstable per window, which will temporarily increase disk space
>>
>> We can likely fix this by not scheduling the major compaction until we
>> know all of the sstables in the window are available to be compacted.
>>
>> Also your data model is probably typical, but not well suited for time
>> series cases - if you find my 2016 Cassandra Summit TWCS talk (it's on
>> youtube), I mention aligning partition keys to TWCS windows, which involves
>> adding a second component to the partition key. This is hugely important in
>> terms of making sure TWCS data expires quickly and avoiding having to read
>> from more than one TWCS window at a time.
>>
>>
>> - Jeff
>>
>>
>>
>> On Mon, May 14, 2018 at 7:12 AM, Lucas Benevides <
>> lu...@maurobenevides.com.br> wrote:
>>
>>> Dear community,
>>>
>>> I want to tell you about my paper published in a conference in March.
>>> The title is " NoSQL Database Performance Tuning for IoT Data -
>>> Cassandra Case Study"  and it is available (not for free) in
>>> http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10
>>> .5220/0006782702770284 .
>>>
>>> TWCS is used and compared with DTCS.
>>>
>>> I hope you can download it, unfortunately I cannot send copies as the
>>> publisher has its copyright.
>>>
>>> Lucas B. Dias
>>>
>>>
>>>
>>
>


Re: Academic paper about Cassandra database compaction

2018-05-14 Thread Lucas Benevides
Hello kooljava2,

There aren't many books about Cassandra, but one of the most famous is the
"Cassandra: The definitive guide: Distributed Data at Web Scale", by Hewitt
The problem is that as Cassandra evolves very fast, these books get out of
date quickly.
To understand some concepts that exist in many different NoSQL databases
(many of then originary from the Distributed Systems area), there is the
book "NoSQL Distilled", by Martin Fowler.

Unfortunately the documentation is also not the strongest thing about
Cassandra, reason why this group is very important. But everyone can
cooperate on this.

Lucas B. Dias



2018-05-14 13:56 GMT-03:00 kooljava2 :

> Hello,
>
> Thank you Lucas for sharing.  I am still a beginner in Cassandra NoSQL
> world. Are there any other good books related to Performance tuning and
> Architecture overview?
>
> Thank you.
>
> On Monday, 14 May 2018, 07:57:38 GMT-7, Nitan Kainth <
> nitankai...@gmail.com> wrote:
>
>
> Hi Lucas,
>
> I am not able to download. can you share as attachment in email?
>
>
>
> Regards,
> Nitan K.
> Cassandra and Oracle Architect/SME
> Datastax Certified Cassandra expert
> Oracle 10g Certified
>
> On Mon, May 14, 2018 at 9:12 AM, Lucas Benevides <
> lu...@maurobenevides.com.br> wrote:
>
> Dear community,
>
> I want to tell you about my paper published in a conference in March. The
> title is " NoSQL Database Performance Tuning for IoT Data - Cassandra
> Case Study"  and it is available (not for free) in http://www.scitepress.org/
> DigitalLibrary/Link.aspx?doi= 10.5220/0006782702770284
> 
>  .
>
> TWCS is used and compared with DTCS.
>
> I hope you can download it, unfortunately I cannot send copies as the
> publisher has its copyright.
>
> Lucas B. Dias
>
>
>
>


Re: Academic paper about Cassandra database compaction

2018-05-14 Thread Lucas Benevides
Thank you Jeff Jirsa by your comments,

How can we do this:  "fix this by not scheduling the major compaction until
we know all of the sstables in the window are available to be compacted"?

About the column-family schema, I had to customize the cassandra-stress
tool so that it could create a reasonable number of rows per partition. In
the default behavior it keeps creating repeated clustering keys for each
partition, and so most data get updated instead of inserted.

Lucas B. Dias

2018-05-14 14:03 GMT-03:00 Jeff Jirsa :

> Interesting!
>
> I suspect I know what the increased disk usage in TWCS, and it's a
> solvable problem, the problem is roughly something like this:
> - Window 1 has sstables 1, 2, 3, 4, 5, 6
> - We start compacting 1, 2, 3, 4 (using STCS-in-TWCS first window)
> - The TWCS window rolls over
> - We flush (sstable 7), and trigger the TWCS window major compaction,
> which starts compacting 5, 6, 7 + any other sstable from that window
> - If the first compaction (1,2,3,4) has finished by the time sstable 7 is
> flushed, we'll include it's result in that compaction, if it doesn't we'll
> have to do the major compaction twice to guarantee we have exactly one
> sstable per window, which will temporarily increase disk space
>
> We can likely fix this by not scheduling the major compaction until we
> know all of the sstables in the window are available to be compacted.
>
> Also your data model is probably typical, but not well suited for time
> series cases - if you find my 2016 Cassandra Summit TWCS talk (it's on
> youtube), I mention aligning partition keys to TWCS windows, which involves
> adding a second component to the partition key. This is hugely important in
> terms of making sure TWCS data expires quickly and avoiding having to read
> from more than one TWCS window at a time.
>
>
> - Jeff
>
>
>
> On Mon, May 14, 2018 at 7:12 AM, Lucas Benevides <
> lu...@maurobenevides.com.br> wrote:
>
>> Dear community,
>>
>> I want to tell you about my paper published in a conference in March. The
>> title is " NoSQL Database Performance Tuning for IoT Data - Cassandra
>> Case Study"  and it is available (not for free) in
>> http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10
>> .5220/0006782702770284 .
>>
>> TWCS is used and compared with DTCS.
>>
>> I hope you can download it, unfortunately I cannot send copies as the
>> publisher has its copyright.
>>
>> Lucas B. Dias
>>
>>
>>
>


Re: Academic paper about Cassandra database compaction

2018-05-14 Thread Jeff Jirsa
Interesting!

I suspect I know what the increased disk usage in TWCS, and it's a solvable
problem, the problem is roughly something like this:
- Window 1 has sstables 1, 2, 3, 4, 5, 6
- We start compacting 1, 2, 3, 4 (using STCS-in-TWCS first window)
- The TWCS window rolls over
- We flush (sstable 7), and trigger the TWCS window major compaction, which
starts compacting 5, 6, 7 + any other sstable from that window
- If the first compaction (1,2,3,4) has finished by the time sstable 7 is
flushed, we'll include it's result in that compaction, if it doesn't we'll
have to do the major compaction twice to guarantee we have exactly one
sstable per window, which will temporarily increase disk space

We can likely fix this by not scheduling the major compaction until we know
all of the sstables in the window are available to be compacted.

Also your data model is probably typical, but not well suited for time
series cases - if you find my 2016 Cassandra Summit TWCS talk (it's on
youtube), I mention aligning partition keys to TWCS windows, which involves
adding a second component to the partition key. This is hugely important in
terms of making sure TWCS data expires quickly and avoiding having to read
from more than one TWCS window at a time.


- Jeff



On Mon, May 14, 2018 at 7:12 AM, Lucas Benevides <
lu...@maurobenevides.com.br> wrote:

> Dear community,
>
> I want to tell you about my paper published in a conference in March. The
> title is " NoSQL Database Performance Tuning for IoT Data - Cassandra
> Case Study"  and it is available (not for free) in
> http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=
> 10.5220/0006782702770284 .
>
> TWCS is used and compared with DTCS.
>
> I hope you can download it, unfortunately I cannot send copies as the
> publisher has its copyright.
>
> Lucas B. Dias
>
>
>


Re: Academic paper about Cassandra database compaction

2018-05-14 Thread kooljava2
 Hello,
Thank you Lucas for sharing.  I am still a beginner in Cassandra NoSQL world. 
Are there any other good books related to Performance tuning and Architecture 
overview?
Thank you.

On Monday, 14 May 2018, 07:57:38 GMT-7, Nitan Kainth 
 wrote:  
 
 Hi Lucas,
I am not able to download. can you share as attachment in email?


Regards,
Nitan K.Cassandra and Oracle Architect/SMEDatastax Certified Cassandra expert
Oracle 10g Certified
On Mon, May 14, 2018 at 9:12 AM, Lucas Benevides  
wrote:

Dear community,
I want to tell you about my paper published in a conference in March. The title 
is "NoSQL Database Performance Tuning for IoT Data - Cassandra Case Study"  and 
it is available (not for free) in http://www.scitepress.org/ 
DigitalLibrary/Link.aspx?doi= 10.5220/0006782702770284 .
TWCS is used and compared with DTCS.
I hope you can download it, unfortunately I cannot send copies as the publisher 
has its copyright.
Lucas B. Dias



  

Re: Academic paper about Cassandra database compaction

2018-05-14 Thread Nitan Kainth
Hi Lucas,

I am not able to download. can you share as attachment in email?



Regards,
Nitan K.
Cassandra and Oracle Architect/SME
Datastax Certified Cassandra expert
Oracle 10g Certified

On Mon, May 14, 2018 at 9:12 AM, Lucas Benevides <
lu...@maurobenevides.com.br> wrote:

> Dear community,
>
> I want to tell you about my paper published in a conference in March. The
> title is " NoSQL Database Performance Tuning for IoT Data - Cassandra
> Case Study"  and it is available (not for free) in
> http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=
> 10.5220/0006782702770284 .
>
> TWCS is used and compared with DTCS.
>
> I hope you can download it, unfortunately I cannot send copies as the
> publisher has its copyright.
>
> Lucas B. Dias
>
>
>


Academic paper about Cassandra database compaction

2018-05-14 Thread Lucas Benevides
Dear community,

I want to tell you about my paper published in a conference in March. The
title is " NoSQL Database Performance Tuning for IoT Data - Cassandra Case
Study"  and it is available (not for free) in
http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006782702770284
 .

TWCS is used and compared with DTCS.

I hope you can download it, unfortunately I cannot send copies as the
publisher has its copyright.

Lucas B. Dias