Re: Notes and questions from performing a large delete

2013-12-07 Thread Josh Dzielak
Thanks Nate. I hadn't noticed that and it definitely explains it.

It'd be nice to see that called out much more clearly. As we found out the 
implications can be severe!

-Josh 


On Thursday, December 5, 2013 at 11:30 AM, Nate McCall wrote:

 Per the 256mb to 5mb change, check the very last section of this page:
 http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/alter_table_r.html
 
 Changing any compaction or compression option erases all previous compaction 
 or compression settings.
 
 In other words, you have to include the whole 'WITH' clause each time - in 
 the future just grab the output from 'show schema' and add/modify as needed. 
 
 I did not know this either until it happened to me as well - could probably 
 stand to be a little bit more front-and-center, IMO. 
 
 
 On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak j...@keen.io 
 (mailto:j...@keen.io) wrote:
  We recently had a little Cassandra party I wanted to share and see if 
  anyone has notes to compare. Or can tell us what we did wrong or what we 
  could do better. :) Apologies in advance for the length of the narrative 
  here. 
  
  Task at hand: Delete about 50% of the rows in a large column family (~8TB) 
  to reclaim some disk. These are rows are used only for intermediate storage.
  
  Sequence of events: 
  
  - Issue the actual deletes. This, obviously, was super-fast.
  - Nothing happens yet, which makes sense. New tombstones are not 
  immediately compacted b/c of gc_grace_seconds.
  - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.
  
  - Every node started working very hard. We saw disk space start to free up. 
  It was exciting.
  - Eventually the compactions finished and we had gotten a ton of disk back. 
  - However, our SSTables were now 5Mb, not 256Mb as they had always been :(
  - We inspected the schema in CQL/Opscenter etc and sure enough 
  sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were 
  set at 256Mb, and all other CF's still were.
  
  - At 5Mb we had a huge number of SSTables. Our next goal was to get these 
  tables back to 256Mb. 
  - First step was to update the schema back to 256Mb.
  - Figuring out how to do this in CQL was tricky, because CQL has gone 
  through a lot of changes recently and getting the docs for your version is 
  hard. Eventually we figured it out - ALTER TABLE events WITH 
  compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
  - Out of our 12 nodes, 9 acknowledged the update. The others showed the old 
  schema still.
  - The remaining 3 would not. There was no extra load was on the systems, 
  operational status was very clean. All nodes could see each other.
  - For each of the remaining 3 we tried to update the schema through a local 
  cqlsh session. The same ALTER TABLE would just hang forever.
  - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE 
  again. It worked this time. We finally had schema agreement.
  
  - Starting with just 1 node, we kicked off upgradesstables, hoping it would 
  rebuild the 5Mb tables to 256Mb tables.
  - Nothing happened. This was (afaik) because the sstable size change 
  doesn't represent a new version of schema for the sstables. So existing 
  tables are ignored.
  - We discovered the -a option for upgradesstables, which tells it to skip 
  the schema check just and just do all the tables anyway.
  - We ran upgradesstables -a and things started happening. After a few hours 
  the pending compactions finished.
  - Sadly, this node was now using 3x the disk it previously had. Some 
  sstables were now 256Mb, but not all. There were tens of thousands of ~20Mb 
  tables.
  - A direct comparison to other nodes owning the same % of the ring showed 
  both the same number of sstables and the same ratio of 256Mb+ tables to 
  small tables. However, on a 'normal' node the small tables were all 5-6Mb 
  and on the fat, upgraded node, all the tables were 20Mb+. This was why the 
  fat node was taking up 3x disk overall.
  - I tried to see what was in those 20Mb files relative to the 5Mb ones but 
  sstable2json failed against our authenticated keyspace. I filed a bug 
  (https://issues.apache.org/jira/browse/CASSANDRA-6450). 
  - Had little choice here. We shut down the fat node, did a manual delete of 
  sstables, brought it back up and did a repair. It came back to the right 
  size.
  
  TL;DR / Our big questions are: 
  How could the schema have spontaneously changed from 256Mb 
  sstable_size_in_mb to 5Mb?
  How could schema propagation failed such that only 9 of 12 nodes got the 
  change even when cluster was healthy? Why did updating schema locally hang 
  until restart?
  What could have happened inside of upgradesstables that created the node 
  with the same ring % but 3x disk load?
  
  We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node 
  cluster across 2 DCs. No compression, leveled compaction. Happy to 

Re: Notes and questions from performing a large delete

2013-12-07 Thread Edward Capriolo
It is definately unexpected and can be very impactful to reset such
impprtant settings.

On Saturday, December 7, 2013, Josh Dzielak j...@keen.io wrote:
 Thanks Nate. I hadn't noticed that and it definitely explains it.
 It'd be nice to see that called out much more clearly. As we found out
the implications can be severe!
 -Josh

 On Thursday, December 5, 2013 at 11:30 AM, Nate McCall wrote:

 Per the 256mb to 5mb change, check the very last section of this page:

http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/alter_table_r.html

 Changing any compaction or compression option erases all previous
compaction or compression settings.
 In other words, you have to include the whole 'WITH' clause each time -
in the future just grab the output from 'show schema' and add/modify as
needed.

 I did not know this either until it happened to me as well - could
probably stand to be a little bit more front-and-center, IMO.

 On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak j...@keen.io wrote:

 We recently had a little Cassandra party I wanted to share and see if
anyone has notes to compare. Or can tell us what we did wrong or what we
could do better. :) Apologies in advance for the length of the narrative
here.
 Task at hand: Delete about 50% of the rows in a large column family
(~8TB) to reclaim some disk. These are rows are used only for intermediate
storage.
 Sequence of events:
 - Issue the actual deletes. This, obviously, was super-fast.
 - Nothing happens yet, which makes sense. New tombstones are not
immediately compacted b/c of gc_grace_seconds.
 - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.
 - Every node started working very hard. We saw disk space start to free
up. It was exciting.
 - Eventually the compactions finished and we had gotten a ton of disk
back.
 - However, our SSTables were now 5Mb, not 256Mb as they had always been :(
 - We inspected the schema in CQL/Opscenter etc and sure enough
sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were
set at 256Mb, and all other CF's still were.
 - At 5Mb we had a huge number of SSTables. Our next goal was to get these
tables back to 256Mb.
 - First step was to update the schema back to 256Mb.
 - Figuring out how to do this in CQL was tricky, because CQL has gone
through a lot of changes recently and getting the docs for your version is
hard. Eventually we figured it out - ALTER TABLE events WITH
compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
 - Out of our 12 nodes, 9 acknowledged the update. The others showed the
old schema still.
 - The remaining 3 would not. There was no extra load was on the systems,
operational status was very clean. All nodes could see each other.
 - For each of the remaining 3 we tried to update the schema through a
local cqlsh session. The same ALTER TABLE would just hang forever.
 - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE
again. It worked this time. We finally had schema agreement.
 - Starting with just 1 node, we kicked off upgradesstables, hoping it
would rebuild the 5Mb tables to 256Mb tables.
 - Nothing happened. This was (afaik) because the sstable size change
doesn't represent a new version of schema for the sstables. So existing
tables are ignored.
 - We discovered the -a option for upgradesstables, which tells it to
skip the schema check just and just do all the tables anyway.
 - We ran upgradesstables -a and things started happening. After a few
hours the pending compactions finished.
 - Sadly, this node was now using 3x the disk it previously had. Some
sstables were now 256Mb, but not all. There were tens of thousands of ~20Mb
tables.
 - A direct comparison to other nodes owning the same % of the ring showed
both the same number of sstables and the same ratio of 256Mb+ tables to
small tables. However, on a 'normal' node the small tables were all 5-6Mb
and on the fat, upgraded node, all the tables were 20Mb+. This was why the
fat node was taking up 3x disk overall.
 - I tried to see what was in those 20Mb files relative to the 5Mb ones
but sstable2json failed against our authenticated keyspace. I filed a bug.
 - Had little choice here. We shut down the fat node, did a manual delete
of sstables, br

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Notes and questions from performing a large delete

2013-12-05 Thread Nate McCall
Per the 256mb to 5mb change, check the very last section of this page:
http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/alter_table_r.html

Changing any compaction or compression option erases all previous
compaction or compression settings.

In other words, you have to include the whole 'WITH' clause each time - in
the future just grab the output from 'show schema' and add/modify as
needed.

I did not know this either until it happened to me as well - could probably
stand to be a little bit more front-and-center, IMO.


On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak j...@keen.io wrote:

 We recently had a little Cassandra party I wanted to share and see if
 anyone has notes to compare. Or can tell us what we did wrong or what we
 could do better. :) Apologies in advance for the length of the narrative
 here.

 *Task at hand*: Delete about 50% of the rows in a large column family
 (~8TB) to reclaim some disk. These are rows are used only for intermediate
 storage.

 *Sequence of events*:

 - Issue the actual deletes. This, obviously, was super-fast.
 - Nothing happens yet, which makes sense. New tombstones are not
 immediately compacted b/c of gc_grace_seconds.
 - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.

 - Every node started working very hard. We saw disk space start to free
 up. It was exciting.
 - Eventually the compactions finished and we had gotten a ton of disk
 back.
 - However, our SSTables were now 5Mb, not 256Mb as they had always been :(
 - We inspected the schema in CQL/Opscenter etc and sure enough
 sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were
 set at 256Mb, and all other CF's still were.

 - At 5Mb we had a huge number of SSTables. Our next goal was to get these
 tables back to 256Mb.
 - First step was to update the schema back to 256Mb.
 - Figuring out how to do this in CQL was tricky, because CQL has gone
 through a lot of changes recently and getting the docs for your version is
 hard. Eventually we figured it out - ALTER TABLE events WITH
 compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
 - Out of our 12 nodes, 9 acknowledged the update. The others showed the
 old schema still.
 - The remaining 3 would not. There was no extra load was on the systems,
 operational status was very clean. All nodes could see each other.
 - For each of the remaining 3 we tried to update the schema through a
 local cqlsh session. The same ALTER TABLE would just hang forever.
 - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE
 again. It worked this time. We finally had schema agreement.

 - Starting with just 1 node, we kicked off upgradesstables, hoping it
 would rebuild the 5Mb tables to 256Mb tables.
 - Nothing happened. This was (afaik) because the sstable size change
 doesn't represent a new version of schema for the sstables. So existing
 tables are ignored.
 - We discovered the -a option for upgradesstables, which tells it to
 skip the schema check just and just do all the tables anyway.
 - We ran upgradesstables -a and things started happening. After a few
 hours the pending compactions finished.
 - Sadly, this node was now using 3x the disk it previously had. Some
 sstables were now 256Mb, but not all. There were tens of thousands of ~20Mb
 tables.
 - A direct comparison to other nodes owning the same % of the ring showed
 both the same number of sstables and the same ratio of 256Mb+ tables to
 small tables. However, on a 'normal' node the small tables were all 5-6Mb
 and on the fat, upgraded node, all the tables were 20Mb+. This was why the
 fat node was taking up 3x disk overall.
 - I tried to see what was in those 20Mb files relative to the 5Mb ones but
 sstable2json failed against our authenticated keyspace. I filed a 
 bughttps://issues.apache.org/jira/browse/CASSANDRA-6450
 .
 - Had little choice here. We shut down the fat node, did a manual delete
 of sstables, brought it back up and did a repair. It came back to the right
 size.

 *TL;DR* / Our big questions are:
 How could the schema have spontaneously changed from 256Mb
 sstable_size_in_mb to 5Mb?
 How could schema propagation failed such that only 9 of 12 nodes got the
 change even when cluster was healthy? Why did updating schema locally hang
 until restart?
 What could have happened inside of upgradesstables that created the node
 with the same ring % but 3x disk load?

 We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node
 cluster across 2 DCs. No compression, leveled compaction. Happy to provide
 more details. Thanks in advance for any insights into what happened or any
 best practices we missed during this episode.

 Best,
 Josh




-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Notes and questions from performing a large delete

2013-12-04 Thread Josh Dzielak
We recently had a little Cassandra party I wanted to share and see if anyone 
has notes to compare. Or can tell us what we did wrong or what we could do 
better. :) Apologies in advance for the length of the narrative here.

Task at hand: Delete about 50% of the rows in a large column family (~8TB) to 
reclaim some disk. These are rows are used only for intermediate storage.

Sequence of events:

- Issue the actual deletes. This, obviously, was super-fast.
- Nothing happens yet, which makes sense. New tombstones are not immediately 
compacted b/c of gc_grace_seconds.
- Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.

- Every node started working very hard. We saw disk space start to free up. It 
was exciting.
- Eventually the compactions finished and we had gotten a ton of disk back. 
- However, our SSTables were now 5Mb, not 256Mb as they had always been :(
- We inspected the schema in CQL/Opscenter etc and sure enough 
sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were set 
at 256Mb, and all other CF's still were.

- At 5Mb we had a huge number of SSTables. Our next goal was to get these 
tables back to 256Mb.
- First step was to update the schema back to 256Mb.
- Figuring out how to do this in CQL was tricky, because CQL has gone through a 
lot of changes recently and getting the docs for your version is hard. 
Eventually we figured it out - ALTER TABLE events WITH 
compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
- Out of our 12 nodes, 9 acknowledged the update. The others showed the old 
schema still.
- The remaining 3 would not. There was no extra load was on the systems, 
operational status was very clean. All nodes could see each other.
- For each of the remaining 3 we tried to update the schema through a local 
cqlsh session. The same ALTER TABLE would just hang forever.
- We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE 
again. It worked this time. We finally had schema agreement.

- Starting with just 1 node, we kicked off upgradesstables, hoping it would 
rebuild the 5Mb tables to 256Mb tables.
- Nothing happened. This was (afaik) because the sstable size change doesn't 
represent a new version of schema for the sstables. So existing tables are 
ignored.
- We discovered the -a option for upgradesstables, which tells it to skip the 
schema check just and just do all the tables anyway.
- We ran upgradesstables -a and things started happening. After a few hours the 
pending compactions finished.
- Sadly, this node was now using 3x the disk it previously had. Some sstables 
were now 256Mb, but not all. There were tens of thousands of ~20Mb tables.
- A direct comparison to other nodes owning the same % of the ring showed both 
the same number of sstables and the same ratio of 256Mb+ tables to small 
tables. However, on a 'normal' node the small tables were all 5-6Mb and on the 
fat, upgraded node, all the tables were 20Mb+. This was why the fat node was 
taking up 3x disk overall.
- I tried to see what was in those 20Mb files relative to the 5Mb ones but 
sstable2json failed against our authenticated keyspace. I filed a bug 
(https://issues.apache.org/jira/browse/CASSANDRA-6450). 
- Had little choice here. We shut down the fat node, did a manual delete of 
sstables, brought it back up and did a repair. It came back to the right size.

TL;DR / Our big questions are:
How could the schema have spontaneously changed from 256Mb sstable_size_in_mb 
to 5Mb?
How could schema propagation failed such that only 9 of 12 nodes got the change 
even when cluster was healthy? Why did updating schema locally hang until 
restart?
What could have happened inside of upgradesstables that created the node with 
the same ring % but 3x disk load?

We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node cluster 
across 2 DCs. No compression, leveled compaction. Happy to provide more 
details. Thanks in advance for any insights into what happened or any best 
practices we missed during this episode.

Best,
Josh