Re: on-disk size vs partition-size in cfhistograms

2016-05-20 Thread Alain RODRIGUEZ
Hi Joseph,

The approach i took was to insert increasing number of rows into a replica
> of the table to sized,  watch the size of the "data" directory (after doing
> nodetool flush and compact), and calculate the average size per row (total
> directory size/count of rows). Can this be considered a valid approach to
> extrapolate for future growth of data ?


You also need to consider the replication factor you are going to use and
the percentage of the data this node you are looking at is owning.
Also, when you run "nodetool compact" you get the minimal possible size,
when in real conditions you probably never will never be in this state. If
you update the same row again and again, shards of the row will be spread
in multiple sstables, with more overhead. Plus if you plan to TTL data or
to delete some, you will always having some tombstones in there too, and
maybe for long depending on how you tune Cassandra and on you use case I
guess.

So I would say this approach is not very accurate. My guess is you will end
up using more space than you think. But it is also harder to do capacity
planning from nothing than from a working system.

It seems the size in cfhisto has a wide variation with the calculated value
> using the approach detailed above (avg 2KB/row). Could this difference be
> due to compression, or are there any other factors at play here?


It could be compression indeed. To check that, you need to dig into the
code. What Cassandra version are you planning to use? By the way, If disk
space matters to you as it seems to me, you might want to use Cassandra
3.0+: http://www.datastax.com/2015/12/storage-engine-30,
http://www.planetcassandra.org/blog/this-week-in-cassandra-3-0-storage-engine-deep-dive-3112016/,
http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html
.


> What would be the typical use/interpretation of the "partition size"
> metric.


I guess people use that to spot wide rows mainly, but if you are happy
summing those, it should be good as long as you know what you are summing.
Each Cassandra operator has his tips and own usage of the tools available
and might have a distinct way of performing operations depending on its
needs and own experience :-). So if it looks relevant to you, go ahead. For
example, if you find out that this is the data before compression, then
just applying the compression ratio to your sum should be good. Still take
care of my first point above.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-05-06 13:27 GMT+02:00 Joseph Tech :

> Hi,
>
> I am trying to get some baselines for capacity planning. The approach i
> took was to insert increasing number of rows into a replica of the table to
> sized,  watch the size of the "data" directory (after doing nodetool flush
> and compact), and calculate the average size per row (total directory
> size/count of rows). Can this be considered a valid approach to extrapolate
> for future growth of data ?
>
> Related to this, is there any information we can gather from
> partition-size of cfhistograms (snipped output for my table below) :
>
> Partition Size (bytes)
>642 bytes: 221
>770 bytes: 2328
>924 bytes: 328858
> ..
> 8239 bytes: 153178
> ...
>  24601 bytes: 16973
>  29521 bytes: 10805
> ...
> 219342 bytes: 23
> 263210 bytes: 6
> 315852 bytes: 4
>
> It seems the size in cfhisto has a wide variation with the calculated
> value using the approach detailed above (avg 2KB/row). Could this
> difference be due to compression, or are there any other factors at play
> here? . What would be the typical use/interpretation of the "partition
> size" metric.
>
> The table definition is like :
>
> CREATE TABLE abc (
>   key1 text,
>   col1 text,
>   PRIMARY KEY ((key1))
> ) WITH
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.10 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.00 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'sstable_size_in_mb': '50', 'class':
> 'LeveledCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
> Thanks,
> Joseph
>
>
>
>
>


on-disk size vs partition-size in cfhistograms

2016-05-06 Thread Joseph Tech
Hi,

I am trying to get some baselines for capacity planning. The approach i
took was to insert increasing number of rows into a replica of the table to
sized,  watch the size of the "data" directory (after doing nodetool flush
and compact), and calculate the average size per row (total directory
size/count of rows). Can this be considered a valid approach to extrapolate
for future growth of data ?

Related to this, is there any information we can gather from partition-size
of cfhistograms (snipped output for my table below) :

Partition Size (bytes)
   642 bytes: 221
   770 bytes: 2328
   924 bytes: 328858
..
8239 bytes: 153178
...
 24601 bytes: 16973
 29521 bytes: 10805
...
219342 bytes: 23
263210 bytes: 6
315852 bytes: 4

It seems the size in cfhisto has a wide variation with the calculated value
using the approach detailed above (avg 2KB/row). Could this difference be
due to compression, or are there any other factors at play here? . What
would be the typical use/interpretation of the "partition size" metric.

The table definition is like :

CREATE TABLE abc (
  key1 text,
  col1 text,
  PRIMARY KEY ((key1))
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.00 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'sstable_size_in_mb': '50', 'class':
'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

Thanks,
Joseph