Re: clarification on 100k tombstone limit in indexes

2014-08-13 Thread DuyHai Doan
add an additional integer column to the partition key (making it a
composite partition key if it isn't already).  When inserting, randomly
pick a value between, say, 0 and 10 to use for this column  -- Due to the
low cardinality of bucket (only 10), there is no guarantee that the
partitions would be distributed evenly. But it's better than nothing.

Alternatively, instead of using a random number, you could hash the other
key components and use the lowest bits for the value.  This has the
advantage of being deterministic -- Does it work with VNodes, where
tokens are split in 256 ranges and shuffled in all nodes ?


On Tue, Aug 12, 2014 at 7:39 PM, Tyler Hobbs ty...@datastax.com wrote:


 On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose ianr...@fullstory.com wrote:


 You better off create a manuel reverse-index to track modification
 date, something like this  -- I had considered an approach like this but
 my concern is that for any given minute *all* of the updates will be
 handled by a single node, right?  For example, if the minute_bucket is 2739
 then for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.


 If you're concerned about bottlenecking on one node (or set of replicas)
 during the minute, add an additional integer column to the partition key
 (making it a composite partition key if it isn't already).  When inserting,
 randomly pick a value between, say, 0 and 10 to use for this column.  When
 reading, read all 10 partitions and merge them.  (Alternatively, instead of
 using a random number, you could hash the other key components and use the
 lowest bits for the value.  This has the advantage of being deterministic.)


 --
 Tyler Hobbs
 DataStax http://datastax.com/



Re: clarification on 100k tombstone limit in indexes

2014-08-13 Thread Tyler Hobbs
On Wed, Aug 13, 2014 at 4:35 AM, DuyHai Doan doanduy...@gmail.com wrote:

 add an additional integer column to the partition key (making it a
 composite partition key if it isn't already).  When inserting, randomly
 pick a value between, say, 0 and 10 to use for this column  -- Due to the
 low cardinality of bucket (only 10), there is no guarantee that the
 partitions would be distributed evenly. But it's better than nothing.


It's important to think about it probablistically, i.e. what is the
probability that all ten partitions belong to the same node?  If you have
a ten node cluster (assume RF=1 for simplicity), there's a 1/10^9 (one in a
billion) chance that a single node is the owner for all partitions.  So
it's quite a bit better than nothing.  If you want to improve your odds,
bump the number up.  But, keep in mind that it's a balance, because reads
become more expensive.



 Alternatively, instead of using a random number, you could hash the
 other key components and use the lowest bits for the value.  This has the
 advantage of being deterministic -- Does it work with VNodes, where
 tokens are split in 256 ranges and shuffled in all nodes ?


Yes, it works perfectly fine with vnodes.


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread DuyHai Doan
Hello Ian

So that way each index entry *will* have quite a few entries and the index
as a whole won't grow too big.  Is my thinking correct here? -- In this
case yes. Do not forget that for each date value, there will be 1
corresponding index value + 10 updates. If you have an approximate count
for a few entries, a quick maths should give you an idea about how
large the index partition is

I had considered an approach like this but my concern is that for any
given minute *all* of the updates will be handled by a single node, right?
-- If you time resolution is a minute, yes it will be a problem. And
depending on the insert rate, it can become a quickly a bottle neck during
this minute.

 The manual index approach suffers a lot from bottleneck issue for heavy
workload, that's the main reason they implement a distributed secondary
index. There is no free lunch though. What you gain in term of control and
tuning with the manual index, you loose on the load distribution side.




On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi DuyHai,

 Thanks for the detailed response!  A few responses below:

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense).
 -- I should have mentioned this original (slipped my mind) but to deal
 specifically with this problem I had planned to use a timestamp with a
 resolution of 1 minute (like your minute_bucket).  So that way each index
 entry *will* have quite a few entries and the index as a whole won't grow
 too big.  Is my thinking correct here?

 You better off create a manuel reverse-index to track modification date,
 something like this  -- I had considered an approach like this but my
 concern is that for any given minute *all* of the updates will be handled
 by a single node, right?  For example, if the minute_bucket is 2739 then
 for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.

 Cheers,
 Ian




 On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Ian

 It sounds like this 100k limit is, indeed, a global limit as opposed
 to a per-row limit --The threshold applies to each REQUEST, not
 partition or globally.

 The threshold does not apply to a partition (physical row) simply because
 in one request you can fetch data from many partitions (multi get slice).
 There was a JIRA about this here:
 https://issues.apache.org/jira/browse/CASSANDRA-6865

 Are these tombstones ever GCed out of the index? -- Yes they are,
 during compactions of the index column family.

 How frequently? -- That's the real pain. Indeed you do not have any
 control on the tuning of secondary index CF compaction. As far as I know,
 the compaction settings (strategy, min/max thresholds...) inherits from the
 one of the base table

 Now, by looking very fast into your data model, it seems that you have a
 skinny partition patter. Since you mentioned that the date is updated only
 10 times max, you should not run into the tombstonne threshold issue.

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense). It
 is the high-cardinality scenario to be avoided (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
 Plus, the query on the index (find all items where last_updated  [now - 30
 minutes]) makes things worse since it is not an exact match but inequality.

  You better off create a manuel reverse-index to track modification date,
 something like this:

 CREATE TABLE last_updated_item (
 minute_bucket int, // format MMDDHHmm
 last_update_date timestamp,
 item_id ascii,
 PRIMARY KEY(minute_bucket, last_update_date)
 );

  The last_update_date column is quite self-explanatory. The minute_bucket
 is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
 day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
 wide rows since you mentioned that there are 10 items (so 10 updates) /
 sec. Of course, 30 mins is just an exemple, you can tune it down to a
 window of 5 minutes or 1 minute, depending on the insertion rate.





 On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too familiar with the nuts 
 bolts of Cassandra I'm not sure how to apply that info to my current
 situation.  It sounds like this 

Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread Ian Rose
Makes sense - thanks again!


On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Ian

 So that way each index entry *will* have quite a few entries and the
 index as a whole won't grow too big.  Is my thinking correct here? -- In
 this case yes. Do not forget that for each date value, there will be 1
 corresponding index value + 10 updates. If you have an approximate count
 for a few entries, a quick maths should give you an idea about how
 large the index partition is

 I had considered an approach like this but my concern is that for any
 given minute *all* of the updates will be handled by a single node,
 right? -- If you time resolution is a minute, yes it will be a problem.
 And depending on the insert rate, it can become a quickly a bottle neck
 during this minute.

  The manual index approach suffers a lot from bottleneck issue for heavy
 workload, that's the main reason they implement a distributed secondary
 index. There is no free lunch though. What you gain in term of control and
 tuning with the manual index, you loose on the load distribution side.




 On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi DuyHai,

 Thanks for the detailed response!  A few responses below:

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense).
 -- I should have mentioned this original (slipped my mind) but to deal
 specifically with this problem I had planned to use a timestamp with a
 resolution of 1 minute (like your minute_bucket).  So that way each index
 entry *will* have quite a few entries and the index as a whole won't
 grow too big.  Is my thinking correct here?

 You better off create a manuel reverse-index to track modification
 date, something like this  -- I had considered an approach like this but
 my concern is that for any given minute *all* of the updates will be
 handled by a single node, right?  For example, if the minute_bucket is 2739
 then for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.

 Cheers,
 Ian




 On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Hello Ian

 It sounds like this 100k limit is, indeed, a global limit as opposed
 to a per-row limit --The threshold applies to each REQUEST, not
 partition or globally.

 The threshold does not apply to a partition (physical row) simply
 because in one request you can fetch data from many partitions (multi get
 slice). There was a JIRA about this here:
 https://issues.apache.org/jira/browse/CASSANDRA-6865

 Are these tombstones ever GCed out of the index? -- Yes they are,
 during compactions of the index column family.

 How frequently? -- That's the real pain. Indeed you do not have any
 control on the tuning of secondary index CF compaction. As far as I know,
 the compaction settings (strategy, min/max thresholds...) inherits from the
 one of the base table

 Now, by looking very fast into your data model, it seems that you have a
 skinny partition patter. Since you mentioned that the date is updated only
 10 times max, you should not run into the tombstonne threshold issue.

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense). It
 is the high-cardinality scenario to be avoided (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
 Plus, the query on the index (find all items where last_updated  [now - 30
 minutes]) makes things worse since it is not an exact match but inequality.

  You better off create a manuel reverse-index to track modification
 date, something like this:

 CREATE TABLE last_updated_item (
 minute_bucket int, // format MMDDHHmm
 last_update_date timestamp,
 item_id ascii,
 PRIMARY KEY(minute_bucket, last_update_date)
 );

  The last_update_date column is quite self-explanatory. The
 minute_bucket is trickier. The idea is to split ranges on 30 minutes into
 buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on.
 For a whole day, you'd have 48 buckets. We need to put data into buckets to
 avoid ultra wide rows since you mentioned that there are 10 items (so 10
 updates) / sec. Of course, 30 mins is just an exemple, you can tune it down
 to a window of 5 minutes or 1 minute, depending on the insertion rate.





 On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com
 wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too 

Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread Tyler Hobbs
On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose ianr...@fullstory.com wrote:


 You better off create a manuel reverse-index to track modification date,
 something like this  -- I had considered an approach like this but my
 concern is that for any given minute *all* of the updates will be handled
 by a single node, right?  For example, if the minute_bucket is 2739 then
 for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.


If you're concerned about bottlenecking on one node (or set of replicas)
during the minute, add an additional integer column to the partition key
(making it a composite partition key if it isn't already).  When inserting,
randomly pick a value between, say, 0 and 10 to use for this column.  When
reading, read all 10 partitions and merge them.  (Alternatively, instead of
using a random number, you could hash the other key components and use the
lowest bits for the value.  This has the advantage of being deterministic.)


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: clarification on 100k tombstone limit in indexes

2014-08-11 Thread Ian Rose
Hi DuyHai,

Thanks for the detailed response!  A few responses below:

On a side node, your usage of secondary index is not the best one. Indeed,
indexing the update date will lead to a situation where for one date,
you'll mostly have one or a few matching items (assuming that the update
date resolution is small enough and update rate is not intense). -- I
should have mentioned this original (slipped my mind) but to deal
specifically with this problem I had planned to use a timestamp with a
resolution of 1 minute (like your minute_bucket).  So that way each index
entry *will* have quite a few entries and the index as a whole won't grow
too big.  Is my thinking correct here?

You better off create a manuel reverse-index to track modification date,
something like this  -- I had considered an approach like this but my
concern is that for any given minute *all* of the updates will be handled
by a single node, right?  For example, if the minute_bucket is 2739 then
for that one minute, every single item update will flow to the node at
HASH(2739).  Assuming I am thinking about that right, that seemed like a
potential scaling bottleneck, which scared me off that approach.

Cheers,
Ian




On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Ian

 It sounds like this 100k limit is, indeed, a global limit as opposed to
 a per-row limit --The threshold applies to each REQUEST, not partition
 or globally.

 The threshold does not apply to a partition (physical row) simply because
 in one request you can fetch data from many partitions (multi get slice).
 There was a JIRA about this here:
 https://issues.apache.org/jira/browse/CASSANDRA-6865

 Are these tombstones ever GCed out of the index? -- Yes they are,
 during compactions of the index column family.

 How frequently? -- That's the real pain. Indeed you do not have any
 control on the tuning of secondary index CF compaction. As far as I know,
 the compaction settings (strategy, min/max thresholds...) inherits from the
 one of the base table

 Now, by looking very fast into your data model, it seems that you have a
 skinny partition patter. Since you mentioned that the date is updated only
 10 times max, you should not run into the tombstonne threshold issue.

 On a side node, your usage of secondary index is not the best one. Indeed,
 indexing the update date will lead to a situation where for one date,
 you'll mostly have one or a few matching items (assuming that the update
 date resolution is small enough and update rate is not intense). It is the
 high-cardinality scenario to be avoided (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
 Plus, the query on the index (find all items where last_updated  [now - 30
 minutes]) makes things worse since it is not an exact match but inequality.

  You better off create a manuel reverse-index to track modification date,
 something like this:

 CREATE TABLE last_updated_item (
 minute_bucket int, // format MMDDHHmm
 last_update_date timestamp,
 item_id ascii,
 PRIMARY KEY(minute_bucket, last_update_date)
 );

  The last_update_date column is quite self-explanatory. The minute_bucket
 is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
 day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
 wide rows since you mentioned that there are 10 items (so 10 updates) /
 sec. Of course, 30 mins is just an exemple, you can tune it down to a
 window of 5 minutes or 1 minute, depending on the insertion rate.





 On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too familiar with the nuts 
 bolts of Cassandra I'm not sure how to apply that info to my current
 situation.  It sounds like this 100k limit is, indeed, a global limit as
 opposed to a per-row limit.  Are these tombstones ever GCed out of the
 index?  How frequently?  If not, then it seems like *any* index is at risk
 of reaching this tipping point; it's just that indexes on frequently
 updated columns will reach this pointer faster the indexes on rarely
 updated columns.

 Basically I'm trying to get some kind of sense for what frequently
 updated
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html
 means quantitatively.  As written, the docs make it sound dangerous to
 create an index on a column that is *ever* deleted or updated since
 there is no sense of how frequent is too frequent.

 Cheers,
 Ian







 On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy mark.re...@boxever.com
 wrote:

 Hi Ian,

 The issues here, which relates to normal and index column families, is
 scanning over a large number of tombstones can cause Cassandra to fall over
 due to increased GC pressure. This pressure is caused because tombstones
 will create DeletedColumn objects which consume heap. 

Re: clarification on 100k tombstone limit in indexes

2014-08-10 Thread Mark Reddy
Hi Ian,

The issues here, which relates to normal and index column families, is
scanning over a large number of tombstones can cause Cassandra to fall over
due to increased GC pressure. This pressure is caused because tombstones
will create DeletedColumn objects which consume heap. Also
these DeletedColumn objects will have to be serialized and sent back to the
coordinator, thus increasing your response times. Take for example a row
that does deletes and you query it with a limit of 100. In a worst case
scenario you could end up reading say 50k tombstones to reach the 100
'live' column limit, all of which has to be put on heap and then sent over
the wire to the coordinator. This would be considered a Cassandra
anti-pattern.[1]

With that in mind there was a debug warning added to 1.2 to inform the user
when they were querying a row with 1000 tombstones [2]. Then in 2.0 the
action was taken to drop requests reaching 100k tombstones[3] rather than
just printing out a warning. This is a safety measure, as it is not advised
to perform such a query and is a result of most people 'doing it wrong'.

For those people who understand the risk of scanning over large numbers of
tombstones there is a configuration option in the cassandra.yaml to
increase this threshold, tombstone_failure_threshold.[4]


Mark

[1]
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
[2] https://issues.apache.org/jira/browse/CASSANDRA-6042
[3] https://issues.apache.org/jira/browse/CASSANDRA-6117
[4]
https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417



On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi -

 On this page (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
 the docs state:

 Do not use an index [...] On a frequently updated or deleted column


 and


 *Problems using an index on a frequently updated or deleted column*¶
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx

 Cassandra stores tombstones in the index until the tombstone limit reaches
 100K cells. After exceeding the tombstone limit, the query that uses the
 indexed value will fail.



 I'm afraid I don't really understand this limit from its (brief)
 description.  I also saw this recent thread
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=oyo4...@mail.gmail.com%3E
  but
 I'm afraid it didn't help me much...


 *SHORT VERSION*

 If I have tens or hundreds of thousands of rows in a keyspace, where every
 row has an indexed column that is updated O(10) times during the lifetime
 of each row, is that going to cause problems for me?  If that 100k limit is 
 *per
 row* then I should be fine but if that 100k limit is *per keyspace* then
 I'd definitely exceed it quickly.


 *FULL EXPLANATION*

 In our system, items are created at a rate of ~10/sec.  Each item is
 updated ~10 times over the next few minutes (although in rare cases the
 number of updates, and the duration, might be several times as long).  Once
 the last update is received for an item, we select it from Cassandra,
 process the data, then delete the entire row.

 The tricky bit is that sometimes (maybe 30-40% of the time) we don't
 actually know when the last update has been received so we use a timeout:
 if an item hasn't been updated for 30 minutes, then we assume it is done
 and should process it as before (select, then delete).  So I am trying to
 design a schema that will allow for efficient queries of the form find me
 all items that have not been updated in the past 30 minutes.  We plan to
 call this query once a minute.

 Here is my tentative schema:

 CREATE TABLE items (
   item_id ascii,
   last_updated timestamp,
   item_data listblob,
   PRIMARY KEY (item_id)
 )
 plus an index on last_updated.

 So updates to an existing item would just be lookup by item_id, append
 new data to item_data, and set last_updated to now.  And queries to find
 items that have timed out would use the index on last_updated: find all
 items where last_updated  [now - 30 minutes].

 Assuming, that is, that the aforementioned 100k tombstone limit won't
 bring this index crashing to a halt...

 Any clarification on this limit and/or suggestions on a better way to
 model/implement this system would be greatly appreciated!

 Cheers,
 Ian




Re: clarification on 100k tombstone limit in indexes

2014-08-10 Thread Ian Rose
Hi Mark -

Thanks for the clarification but as I'm not too familiar with the nuts 
bolts of Cassandra I'm not sure how to apply that info to my current
situation.  It sounds like this 100k limit is, indeed, a global limit as
opposed to a per-row limit.  Are these tombstones ever GCed out of the
index?  How frequently?  If not, then it seems like *any* index is at risk
of reaching this tipping point; it's just that indexes on frequently
updated columns will reach this pointer faster the indexes on rarely
updated columns.

Basically I'm trying to get some kind of sense for what frequently updated
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html
means quantitatively.  As written, the docs make it sound dangerous to
create an index on a column that is *ever* deleted or updated since there
is no sense of how frequent is too frequent.

Cheers,
Ian







On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy mark.re...@boxever.com wrote:

 Hi Ian,

 The issues here, which relates to normal and index column families, is
 scanning over a large number of tombstones can cause Cassandra to fall over
 due to increased GC pressure. This pressure is caused because tombstones
 will create DeletedColumn objects which consume heap. Also
 these DeletedColumn objects will have to be serialized and sent back to the
 coordinator, thus increasing your response times. Take for example a row
 that does deletes and you query it with a limit of 100. In a worst case
 scenario you could end up reading say 50k tombstones to reach the 100
 'live' column limit, all of which has to be put on heap and then sent over
 the wire to the coordinator. This would be considered a Cassandra
 anti-pattern.[1]

 With that in mind there was a debug warning added to 1.2 to inform the
 user when they were querying a row with 1000 tombstones [2]. Then in 2.0
 the action was taken to drop requests reaching 100k tombstones[3] rather
 than just printing out a warning. This is a safety measure, as it is not
 advised to perform such a query and is a result of most people 'doing it
 wrong'.

 For those people who understand the risk of scanning over large numbers of
 tombstones there is a configuration option in the cassandra.yaml to
 increase this threshold, tombstone_failure_threshold.[4]


 Mark

 [1]
 http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
 [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
 [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
 [4]
 https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417



 On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi -

 On this page (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
 the docs state:

 Do not use an index [...] On a frequently updated or deleted column


 and


 *Problems using an index on a frequently updated or deleted column*¶
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx

 Cassandra stores tombstones in the index until the tombstone limit
 reaches 100K cells. After exceeding the tombstone limit, the query that
 uses the indexed value will fail.



 I'm afraid I don't really understand this limit from its (brief)
 description.  I also saw this recent thread
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=oyo4...@mail.gmail.com%3E
  but
 I'm afraid it didn't help me much...


 *SHORT VERSION*

 If I have tens or hundreds of thousands of rows in a keyspace, where
 every row has an indexed column that is updated O(10) times during the
 lifetime of each row, is that going to cause problems for me?  If that 100k
 limit is *per row* then I should be fine but if that 100k limit is *per
 keyspace* then I'd definitely exceed it quickly.


 *FULL EXPLANATION*

 In our system, items are created at a rate of ~10/sec.  Each item is
 updated ~10 times over the next few minutes (although in rare cases the
 number of updates, and the duration, might be several times as long).  Once
 the last update is received for an item, we select it from Cassandra,
 process the data, then delete the entire row.

 The tricky bit is that sometimes (maybe 30-40% of the time) we don't
 actually know when the last update has been received so we use a timeout:
 if an item hasn't been updated for 30 minutes, then we assume it is done
 and should process it as before (select, then delete).  So I am trying to
 design a schema that will allow for efficient queries of the form find me
 all items that have not been updated in the past 30 minutes.  We plan to
 call this query once a minute.

 Here is my tentative schema:

 CREATE TABLE items (
   item_id ascii,
   last_updated timestamp,
   item_data listblob,
   PRIMARY KEY (item_id)
 )
 plus an index on last_updated.

 So updates to an existing item 

Re: clarification on 100k tombstone limit in indexes

2014-08-10 Thread Mark Reddy
Hi Ian

Are these tombstones ever GCed out of the index?  How frequently?


Yes, tombstones are removed after the time specified by gc_grace_seconds
has elapsed, which by default is 10 days and is configurable. Knowing and
understanding how Cassandra handles distributed deletes is key to designing
an efficient schema if you plan to delete often. There are lots of
resources on how deletes are handled in Cassandra, take a look at these
links for example:
http://wiki.apache.org/cassandra/DistributedDeletes
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html

Mark


On Sun, Aug 10, 2014 at 9:02 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too familiar with the nuts 
 bolts of Cassandra I'm not sure how to apply that info to my current
 situation.  It sounds like this 100k limit is, indeed, a global limit as
 opposed to a per-row limit.  Are these tombstones ever GCed out of the
 index?  How frequently?  If not, then it seems like *any* index is at risk
 of reaching this tipping point; it's just that indexes on frequently
 updated columns will reach this pointer faster the indexes on rarely
 updated columns.

 Basically I'm trying to get some kind of sense for what frequently
 updated
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html
 means quantitatively.  As written, the docs make it sound dangerous to
 create an index on a column that is *ever* deleted or updated since there
 is no sense of how frequent is too frequent.

 Cheers,
 Ian







 On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy mark.re...@boxever.com
 wrote:

 Hi Ian,

 The issues here, which relates to normal and index column families, is
 scanning over a large number of tombstones can cause Cassandra to fall over
 due to increased GC pressure. This pressure is caused because tombstones
 will create DeletedColumn objects which consume heap. Also
 these DeletedColumn objects will have to be serialized and sent back to the
 coordinator, thus increasing your response times. Take for example a row
 that does deletes and you query it with a limit of 100. In a worst case
 scenario you could end up reading say 50k tombstones to reach the 100
 'live' column limit, all of which has to be put on heap and then sent over
 the wire to the coordinator. This would be considered a Cassandra
 anti-pattern.[1]

 With that in mind there was a debug warning added to 1.2 to inform the
 user when they were querying a row with 1000 tombstones [2]. Then in 2.0
 the action was taken to drop requests reaching 100k tombstones[3] rather
 than just printing out a warning. This is a safety measure, as it is not
 advised to perform such a query and is a result of most people 'doing it
 wrong'.

 For those people who understand the risk of scanning over large numbers
 of tombstones there is a configuration option in the cassandra.yaml to
 increase this threshold, tombstone_failure_threshold.[4]


 Mark

 [1]
 http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
 [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
 [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
 [4]
 https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417



 On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi -

 On this page (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
 the docs state:

 Do not use an index [...] On a frequently updated or deleted column


 and


 *Problems using an index on a frequently updated or deleted column*¶
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx

 Cassandra stores tombstones in the index until the tombstone limit
 reaches 100K cells. After exceeding the tombstone limit, the query that
 uses the indexed value will fail.



 I'm afraid I don't really understand this limit from its (brief)
 description.  I also saw this recent thread
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=oyo4...@mail.gmail.com%3E
  but
 I'm afraid it didn't help me much...


 *SHORT VERSION*

 If I have tens or hundreds of thousands of rows in a keyspace, where
 every row has an indexed column that is updated O(10) times during the
 lifetime of each row, is that going to cause problems for me?  If that 100k
 limit is *per row* then I should be fine but if that 100k limit is *per
 keyspace* then I'd definitely exceed it quickly.


 *FULL EXPLANATION*

 In our system, items are created at a rate of ~10/sec.  Each item is
 updated ~10 times over the next few minutes (although in rare cases the
 number of updates, and the duration, might be several times as long).  Once
 the last update is received for an item, we select it from Cassandra,
 process the data, then delete the entire 

Re: clarification on 100k tombstone limit in indexes

2014-08-10 Thread DuyHai Doan
Hello Ian

It sounds like this 100k limit is, indeed, a global limit as opposed to
a per-row limit --The threshold applies to each REQUEST, not partition
or globally.

The threshold does not apply to a partition (physical row) simply because
in one request you can fetch data from many partitions (multi get slice).
There was a JIRA about this here:
https://issues.apache.org/jira/browse/CASSANDRA-6865

Are these tombstones ever GCed out of the index? -- Yes they are,
during compactions of the index column family.

How frequently? -- That's the real pain. Indeed you do not have any
control on the tuning of secondary index CF compaction. As far as I know,
the compaction settings (strategy, min/max thresholds...) inherits from the
one of the base table

Now, by looking very fast into your data model, it seems that you have a
skinny partition patter. Since you mentioned that the date is updated only
10 times max, you should not run into the tombstonne threshold issue.

On a side node, your usage of secondary index is not the best one. Indeed,
indexing the update date will lead to a situation where for one date,
you'll mostly have one or a few matching items (assuming that the update
date resolution is small enough and update rate is not intense). It is the
high-cardinality scenario to be avoided (
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
Plus, the query on the index (find all items where last_updated  [now - 30
minutes]) makes things worse since it is not an exact match but inequality.

 You better off create a manuel reverse-index to track modification date,
something like this:

CREATE TABLE last_updated_item (
minute_bucket int, // format MMDDHHmm
last_update_date timestamp,
item_id ascii,
PRIMARY KEY(minute_bucket, last_update_date)
);

 The last_update_date column is quite self-explanatory. The minute_bucket
is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
wide rows since you mentioned that there are 10 items (so 10 updates) /
sec. Of course, 30 mins is just an exemple, you can tune it down to a
window of 5 minutes or 1 minute, depending on the insertion rate.





On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too familiar with the nuts 
 bolts of Cassandra I'm not sure how to apply that info to my current
 situation.  It sounds like this 100k limit is, indeed, a global limit as
 opposed to a per-row limit.  Are these tombstones ever GCed out of the
 index?  How frequently?  If not, then it seems like *any* index is at risk
 of reaching this tipping point; it's just that indexes on frequently
 updated columns will reach this pointer faster the indexes on rarely
 updated columns.

 Basically I'm trying to get some kind of sense for what frequently
 updated
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html
 means quantitatively.  As written, the docs make it sound dangerous to
 create an index on a column that is *ever* deleted or updated since there
 is no sense of how frequent is too frequent.

 Cheers,
 Ian







 On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy mark.re...@boxever.com
 wrote:

 Hi Ian,

 The issues here, which relates to normal and index column families, is
 scanning over a large number of tombstones can cause Cassandra to fall over
 due to increased GC pressure. This pressure is caused because tombstones
 will create DeletedColumn objects which consume heap. Also
 these DeletedColumn objects will have to be serialized and sent back to the
 coordinator, thus increasing your response times. Take for example a row
 that does deletes and you query it with a limit of 100. In a worst case
 scenario you could end up reading say 50k tombstones to reach the 100
 'live' column limit, all of which has to be put on heap and then sent over
 the wire to the coordinator. This would be considered a Cassandra
 anti-pattern.[1]

 With that in mind there was a debug warning added to 1.2 to inform the
 user when they were querying a row with 1000 tombstones [2]. Then in 2.0
 the action was taken to drop requests reaching 100k tombstones[3] rather
 than just printing out a warning. This is a safety measure, as it is not
 advised to perform such a query and is a result of most people 'doing it
 wrong'.

 For those people who understand the risk of scanning over large numbers
 of tombstones there is a configuration option in the cassandra.yaml to
 increase this threshold, tombstone_failure_threshold.[4]


 Mark

 [1]
 http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
 [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
 [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
 [4]