[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0001-global-key-cache.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0003-CacheServiceMBean-and-correct-key-cache-loading.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0004-key-row-cache-tests-and-tweaks.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: (was: 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-3143: --- Attachment: 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch 0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch 0004-key-row-cache-tests-and-tweaks.patch 0003-CacheServiceMBean-and-correct-key-cache-loading.patch 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch 0001-global-key-cache.patch rebased with the lastest trunk (last commit 58518301472fc99b01cfd4bcf90bf81b5f0694ee) Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Attachments: 0001-global-key-cache.patch, 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch, 0003-CacheServiceMBean-and-correct-key-cache-loading.patch, 0004-key-row-cache-tests-and-tweaks.patch, 0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch, 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs - Key: CASSANDRA-3620 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 1.0.5 Reporter: Dominic Williams Fix For: 1.1 Here is a proposal for an improved system for handling distributed deletes. *** The Problem *** Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have acknowledged them 3. Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. 4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset 5. Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira ** Advantages ** -- The labour/administration overhead associated with running repair will be removed -- The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair -- There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) -- Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2The Problem/h2 Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have acknowledged them 3. Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. 4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset 5. Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira ** Advantages ** -- The labour/administration overhead associated with running repair will be removed -- The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair -- There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) -- Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. *** The Problem *** Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. *The Problem* Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have acknowledged them 3. Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. 4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset 5. Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira ** Advantages ** -- The labour/administration overhead associated with running repair will be removed -- The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair -- There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) -- Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2The Problem/h2 Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Advantages -- The labour/administration overhead associated with running repair will be removed -- The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair -- There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) -- Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Proposed Reaper Model 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Proposed Reaper Model 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have acknowledged them 3. Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. 4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset 5. Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Advantages -- The labour/administration overhead associated with running repair will be removed -- The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair -- There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) -- Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. *The Problem* Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. *** Proposed Reaper Model *** 1. Tombstones do not expire, and there is no GCSeconds. 2. Tombstones have associated ACK lists, which record the replicas that have
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Advantages * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: -- Repair is expensive anyway -- Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) -- Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment -- When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear -- If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Advantages * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Advantages * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair before GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse where you have lots of column families or where you have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes *dramatically* was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while nodes are added/removed that can be discussed in separate Jira h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas. # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168321#comment-13168321 ] Samarth Gahire commented on CASSANDRA-3589: --- No, I do not have any secondary indexes on any of the column family and I have done the fair comparison and seen some performance hit in sstable-loader utility. Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Assignee: Sylvain Lebresne Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3621) nodetool is trying to contact old ip addres
nodetool is trying to contact old ip addres --- Key: CASSANDRA-3621 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621 Project: Cassandra Issue Type: Bug Affects Versions: 0.8.8 Environment: java 1.6.26, linux Reporter: Zenek Kraweznik My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 10.0.2.0/24 network (for security resons). I want to test new cassandra before upgrading production instances. I've made snapshot and moved it to test servers (except system/LocationInfo* files). Changes in configuration: ip adresses (seeds, listen address etc), cluster name. Test server are in 10.0.1.0/24 network. In logs I see that test nodes are seeing each other, but when i try to show ring I get this error: casstest1:/# nodetool -h 10.0.1.211 ring Error connection to remote JMX agent! java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; nested exception is: java.net.NoRouteToHostException: No route to host at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582) Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.init(Socket.java:375) at java.net.Socket.init(Socket.java:189) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 10 more casstest1:/# Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 10.0.1.203 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # New tombstones replace old tombstones and always start with an empty ACK list # Upon deletion, a tombstone is written to a relic list, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds.
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK list is simply reset # Upon deletion, a tombstone is written to a relic list, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss,
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # New tombstones replace old tombstones and always start with an empty ACK list # Upon deletion, a tombstone is written to a relic list/index, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # Upon deletion, a tombstone is written to a super fast relic index, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # New tombstones replace old tombstones and always start with an empty ACK list # Upon deletion, a tombstone is written to a super fast relic index, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and deletes tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # Upon deletion, a tombstone is written to a super fast relic index, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (the writer acknowledges this has some of the drawbacks of GCSeconds) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and delete tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # Upon deletion, a tombstone is written to a super fast relic index, which is scavenged according to some configurable period, thereby allowing deleted tombstones to still be acknowledged (this relic index might simply contain MD5 hashes of cf-k-n(-sn)-acks) # Background reaper threads constantly stream ACK requests and ACKs from other replicas and delete tombstones that have received all their ACKs # If a reaper receives a request to ACK a missing tombstone, it creates the tombstone, adds an ACK for the requestor, and replies with an ACK A number of systems could be used to maintain synchronization while cluster nodes are added/removed. h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds. # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when ## They have been acknowledged by all replicas ## All replicas have acknowledged receiving all acknowledgements # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received # Once a tombstone has been acknowledged by all replicas, after a configurable period, the reaper asks the replicas to acknowledge that the others have received all their acknowledgements ## If a node is down or otherwise can't reply, this is retried after a back-off period ## If a node is asked to fully acknowledge a tombstone, and it is not ready to do so, it may try to receive outstanding acknowledgements so that it can reply with an ACK # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM.
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when ## They have been acknowledged by all replicas ## All replicas have acknowledged receiving all acknowledgements # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received # Once a tombstone has been acknowledged by all replicas, after a configurable period, the reaper asks the replicas to acknowledge that the others have received all their acknowledgements ## If a node is down or otherwise can't reply, this is retried after a back-off period ## If a node is asked to fully acknowledge a tombstone, and it is not ready to do so, it may try to receive outstanding acknowledgements so that it can reply with an ACK # When a tombstone is deleted, it is added to a fast relic index, comprised of MD5 hashes calculated from cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge that it has received all acknowledgements after it has deleted a tombstone # The relic index is scavenged according to some configurable period # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when ## They have been acknowledged by all replicas ## All replicas have acknowledged receiving all acknowledgements # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # Once a tombstone has been acknowledged by all replicas, after a configurable period, the reaper asks the replicas to acknowledge that the others have received all their acknowledgements ## If a node is down or otherwise can't reply, this is retried after a back-off period ## If a node is asked to fully acknowledge a tombstone, and it is not ready to do so, it may try to receive outstanding acknowledgements so that it can reply with an ACK # When a tombstone is deleted, it is added to a fast relic index, comprised of MD5 hashes calculated from cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge that it has received all acknowledgements after it has deleted a tombstone # The relic index is scavenged according to some configurable period # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # The relic index is scavenged according to some configurable period # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after the configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas (which is no big deal since does not corrupt data) h3. Benefits * The labour/administration overhead associated with running repair will be removed * The reapers can utilize spare cycles and run constantly to prevent the load spikes and performance issues associated with repair * There will no longer be the risk of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Reducing the average number of tombstones databases carry will improve performance, sometimes dramatically was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this is the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't impact your system. This
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. Some prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't great, and it is made worse if you have lots of column families or have to run a low GCSeconds on a column family to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, you are going to hit problems, and this can feel like the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Therefore ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments you often cannot make repair a cron job. You prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load so you don't
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in a cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. Some prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't great, and it is made worse when there are lots of column families or it is necessary to run a column family with a low GCSeconds to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, or increase GCSeconds, you are going to lose deletes and this can feel like the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Therefore ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with having to run repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in a cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with having to run repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in a cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. Some prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't great, and it is made worse when there are lots of column families or it is necessary to run a column family with a low GCSeconds to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, or increase GCSeconds, you are going to lose deletes and this can feel like the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Therefore ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem Repair has issues: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair can often itself fail and need restarting, especially in cloud environments where a network issue might make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds, tombstones can start overloading your system Because of the foregoing, in high throughput environments it can be very difficult to make repair
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. Some prefer to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't great, and it is made worse when there are lots of column families or it is necessary to run a column family with a low GCSeconds to reduce tombstone load. You know that if you don't manage to run repair with the GCSeconds window, or increase GCSeconds, you are going to lose deletes and this can feel like the Sword of Damocles over your head. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Therefore ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in a cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and the problem is made worse when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load. The database owner must run repair with the GCSeconds window, or increase GCSeconds, to avoid losing delete operations. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. Therefore ideally there should be no ongoing requirement to run repair to avoid data loss, and no GCSeconds. Repair should be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and the problem is made worse when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load. The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid losing delete operations. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. It would be much better if there was no ongoing requirement to run repair to avoid data loss (or rather the potential for data to reappear), and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause data loss, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of data loss if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, data written to a node that did not receive a copy of a delete operation (because for example it was down) can reappear * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and problems are exacerbated when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load (because there are many write/deletes to that column family). The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing delete operations. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. It would be much better if there was no ongoing requirement to run repair to ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause corruption, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of corruption if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, deleted data can reappear * If you cannot run repair and have to
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either by error or because of issues with Cassandra, data written to a node that did not see a later delete can reappear (and a node might miss a delete for several reasons including being down or simply dropping requests during load shedding) * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and problems are exacerbated when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load (because there are many write/deletes to that column family). The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing delete operations. Running repair to deal with missing writes isn't so important, since QUORUM reads will always receive data successfully written with QUORUM. It would be much better if there was no ongoing requirement to run repair to ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause corruption, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of corruption if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either because you are dumb or because of issues with Cassandra, data
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Williams updated CASSANDRA-3620: Description: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either by error or because of issues with Cassandra, data written to a node that did not see a later delete can reappear (and a node might miss a delete for several reasons including being down or simply dropping requests during load shedding) * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and problems are exacerbated when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load (because there are many write/deletes to that column family). The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing delete operations. It would be much better if there was no ongoing requirement to run repair to ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause corruption, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of corruption if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number involved in query processing, will often dramatically improve performance was: Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either by error or because of issues with Cassandra, data written to a node that did not see a later delete can reappear (and a node might miss a delete for several reasons including being down or simply
[jira] [Commented] (CASSANDRA-3621) nodetool is trying to contact old ip address
[ https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168460#comment-13168460 ] Brandon Williams commented on CASSANDRA-3621: - You most likely have a hostname resolution problem where the system's hostname still resolves to the old IP. nodetool is trying to contact old ip address Key: CASSANDRA-3621 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621 Project: Cassandra Issue Type: Bug Affects Versions: 0.8.8 Environment: java 1.6.26, linux Reporter: Zenek Kraweznik My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 10.0.2.0/24 network (for security resons). I want to test new cassandra before upgrading production instances. I've made snapshot and moved it to test servers (except system/LocationInfo* files). Changes in configuration: ip adresses (seeds, listen address etc), cluster name. Test server are in 10.0.1.0/24 network. In logs I see that test nodes are seeing each other, but when i try to show ring I get this error: casstest1:/# nodetool -h 10.0.1.211 ring Error connection to remote JMX agent! java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; nested exception is: java.net.NoRouteToHostException: No route to host at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582) Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.init(Socket.java:375) at java.net.Socket.init(Socket.java:189) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 10 more casstest1:/# Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 10.0.1.203 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168467#comment-13168467 ] Jonathan Ellis commented on CASSANDRA-3589: --- Have you been able to benchmark Sylvain's patch? Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Assignee: Sylvain Lebresne Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3620: -- Affects Version/s: (was: 1.0.5) Fix Version/s: (was: 1.1) At a high level, I think it's worth trying. One big drawback is making deletes O(N**2) expensive: N acks must be written to each of the N replicas. That's 81 writes for a single delete in a cluster with 9 total replicas across 3 DCs, which is not a hypothetical situation. Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs - Key: CASSANDRA-3620 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Dominic Williams Labels: GCSeconds,, deletes,, distributed_deletes,, merkle_trees, repair, Original Estimate: 504h Remaining Estimate: 504h Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either by error or because of issues with Cassandra, data written to a node that did not see a later delete can reappear (and a node might miss a delete for several reasons including being down or simply dropping requests during load shedding) * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and problems are exacerbated when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load (because there are many write/deletes to that column family). The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing delete operations. It would be much better if there was no ongoing requirement to run repair to ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause corruption, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of corruption if repair can't be run for some reason (for example because of a new
[jira] [Commented] (CASSANDRA-3511) Supercolumn key caches are not saved
[ https://issues.apache.org/jira/browse/CASSANDRA-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168480#comment-13168480 ] Radim Kolar commented on CASSANDRA-3511: This is also cache save issue because i seen case where after loading cache saved by 1.0.5 cache is not saved anymore. I will attach another demonstration file. Problems are 2: 1. cache can be saved in incorrect format (maybe truncated?). Save to -tmp and rename later? 2. loading incorrect cache save image will cause that cache is not saved anymore. Incorrect image is not overwritten by a good one. Add some kind of error/checksum to cache for detecting and rejecting incorrect cache save images during load. Supercolumn key caches are not saved Key: CASSANDRA-3511 URL: https://issues.apache.org/jira/browse/CASSANDRA-3511 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.2, 1.0.3 Reporter: Radim Kolar Priority: Minor Labels: supercolumns Attachments: rapidshare-resultcache-KeyCache cache saving seems to be broken in 1.0.2 and 1.0.3 i have 2 CF in keyspace with enabled cache saving and only one gets its key cache saved. It worked perfectly in 0.8, both were saved. This one works: create column family query2 with column_type = 'Standard' and comparator = 'AsciiType' and default_validation_class = 'BytesType' and key_validation_class = 'UTF8Type' and rows_cached = 500.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 14400 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 5 and max_compaction_threshold = 10 and replicate_on_write = false and row_cache_provider = 'ConcurrentLinkedHashCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' This does not create column family dkb13 with column_type = 'Super' and comparator = 'LongType' and subcomparator = 'AsciiType' and default_validation_class = 'BytesType' and key_validation_class = 'UTF8Type' and rows_cached = 600.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 14400 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 5 and max_compaction_threshold = 10 and replicate_on_write = false and row_cache_provider = 'ConcurrentLinkedHashCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' in second test system i created these 2 column families and none of them got single cache key saved. Both have save period 30 seoonds - their cache should save often. Its not that standard column family works while super does not. create column family test1 with column_type = 'Standard' and comparator = 'BytesType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and rows_cached = 0.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 30 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and row_cache_provider = 'SerializingCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'; create column family test2 with column_type = 'Standard' and comparator = 'BytesType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and rows_cached = 0.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 30 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and row_cache_provider = 'SerializingCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'; If this is done on purpose for example cassandra 1.0 is doing some heuristic decision if cache should be saved or not then it should be removed. Saving cache is fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3511) Supercolumn key caches are not saved
[ https://issues.apache.org/jira/browse/CASSANDRA-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated CASSANDRA-3511: --- Attachment: failed-to-save-after-load-KeyCache Supercolumn key caches are not saved Key: CASSANDRA-3511 URL: https://issues.apache.org/jira/browse/CASSANDRA-3511 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.2, 1.0.3 Reporter: Radim Kolar Priority: Minor Labels: supercolumns Attachments: failed-to-save-after-load-KeyCache, rapidshare-resultcache-KeyCache cache saving seems to be broken in 1.0.2 and 1.0.3 i have 2 CF in keyspace with enabled cache saving and only one gets its key cache saved. It worked perfectly in 0.8, both were saved. This one works: create column family query2 with column_type = 'Standard' and comparator = 'AsciiType' and default_validation_class = 'BytesType' and key_validation_class = 'UTF8Type' and rows_cached = 500.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 14400 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 5 and max_compaction_threshold = 10 and replicate_on_write = false and row_cache_provider = 'ConcurrentLinkedHashCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' This does not create column family dkb13 with column_type = 'Super' and comparator = 'LongType' and subcomparator = 'AsciiType' and default_validation_class = 'BytesType' and key_validation_class = 'UTF8Type' and rows_cached = 600.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 14400 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 5 and max_compaction_threshold = 10 and replicate_on_write = false and row_cache_provider = 'ConcurrentLinkedHashCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' in second test system i created these 2 column families and none of them got single cache key saved. Both have save period 30 seoonds - their cache should save often. Its not that standard column family works while super does not. create column family test1 with column_type = 'Standard' and comparator = 'BytesType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and rows_cached = 0.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 30 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and row_cache_provider = 'SerializingCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'; create column family test2 with column_type = 'Standard' and comparator = 'BytesType' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and rows_cached = 0.0 and row_cache_save_period = 0 and row_cache_keys_to_save = 2147483647 and keys_cached = 20.0 and key_cache_save_period = 30 and read_repair_chance = 1.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and row_cache_provider = 'SerializingCacheProvider' and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'; If this is done on purpose for example cassandra 1.0 is doing some heuristic decision if cache should be saved or not then it should be removed. Saving cache is fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
svn commit: r1213775 - in /cassandra/branches/cassandra-1.0: CHANGES.txt src/java/org/apache/cassandra/utils/obs/OpenBitSet.java
Author: jbellis Date: Tue Dec 13 16:38:12 2011 New Revision: 1213775 URL: http://svn.apache.org/viewvc?rev=1213775view=rev Log: more efficient allocation of small bloom filters patch by slebresne; reviewed by jbellis for CASSANDRA-3618 Modified: cassandra/branches/cassandra-1.0/CHANGES.txt cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Modified: cassandra/branches/cassandra-1.0/CHANGES.txt URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-1.0/CHANGES.txt?rev=1213775r1=1213774r2=1213775view=diff == --- cassandra/branches/cassandra-1.0/CHANGES.txt (original) +++ cassandra/branches/cassandra-1.0/CHANGES.txt Tue Dec 13 16:38:12 2011 @@ -1,5 +1,6 @@ 1.0.7 * fix assertion when dropping a columnfamily with no sstables (CASSANDRA-3614) + * more efficient allocation of small bloom filters (CASSANDRA-3618) 1.0.6 Modified: cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java URL: http://svn.apache.org/viewvc/cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java?rev=1213775r1=1213774r2=1213775view=diff == --- cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java (original) +++ cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Tue Dec 13 16:38:12 2011 @@ -76,6 +76,7 @@ Test system: AMD Opteron, 64 bit linux, public class OpenBitSet implements Serializable { protected long[][] bits; protected int wlen; // number of words (elements) used in the array + private final int pageCount; /** * length of bits[][] page in long[] elements. * Choosing unform size for all sizes of bitsets fight fragmentation for very large @@ -95,13 +96,19 @@ public class OpenBitSet implements Seria public OpenBitSet(long numBits, boolean allocatePages) { wlen= bits2words(numBits); +int lastPageSize = wlen % PAGE_SIZE; +int fullPageCount = wlen / PAGE_SIZE; +pageCount = fullPageCount + (lastPageSize == 0 ? 0 : 1); -bits = new long[getPageCount()][]; - +bits = new long[pageCount][]; + if (allocatePages) { -for (int allocated=0,i=0;allocatedwlen;allocated+=PAGE_SIZE,i++) -bits[i]=new long[PAGE_SIZE]; +for (int i = 0; i fullPageCount; ++i) +bits[i] = new long[PAGE_SIZE]; + +if (lastPageSize != 0) +bits[bits.length - 1] = new long[lastPageSize]; } } @@ -119,7 +126,7 @@ public class OpenBitSet implements Seria public int getPageCount() { - return wlen / PAGE_SIZE + 1; + return pageCount; } public long[] getPage(int pageIdx)
[jira] [Updated] (CASSANDRA-3618) OpenBitSet can allocate more bytes than it needs
[ https://issues.apache.org/jira/browse/CASSANDRA-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3618: -- Reviewer: jbellis Affects Version/s: (was: 1.0.0) 1.0.1 Committed. (This affects 1.0.1+, introduced by CASSANDRA-2466.) OpenBitSet can allocate more bytes than it needs Key: CASSANDRA-3618 URL: https://issues.apache.org/jira/browse/CASSANDRA-3618 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.1 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Fix For: 1.0.7 Attachments: 0001-Fix-openBitSet.patch CASSANDRA-2466 changed OpenBitSet to break big long arrays into pages. However, it always allocate full pages, each page being of size 4096 * 8 bytes. This means that we almost always allocate too much bytes, and for a row that has 1 column, the associated row bloom filter allocates 32760 more bytes than it should. This has a significant impact on performance. In a small test using the SSTableSimpleUnsortedWriter to generate rows with 1 column, 0.8 is about twice as fast as 1.0 because of that (the difference shrink when there is more columns obviously). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3622) clean up openbitset
clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3622) clean up openbitset
[ https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3622: -- Attachment: 3622.txt Replaces get/set operations with fastGet/Set operations. Where there was no fast analogue to an expanding method, I removed them. (All such were unused.) clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Attachments: 3622.txt Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3592) Major Compaction Incredibly Slow
[ https://issues.apache.org/jira/browse/CASSANDRA-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168529#comment-13168529 ] Dan Hendry commented on CASSANDRA-3592: --- I can give that a try, I am a little confused about how it will help though. CASSANDRA-3618 seems to be related to performance for column families with narrow (single column) rows. The compaction slowdown I am seeing is for CFs that are characterized by very wide rows (thousands to millions of columns per row). Major Compaction Incredibly Slow Key: CASSANDRA-3592 URL: https://issues.apache.org/jira/browse/CASSANDRA-3592 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.3 Environment: RHEL6 - 24 core machines 24 GB mem total, 11 GB java heap java version 1.6.0_26 6 node cluster (4@0.8.6, 2@1.0.3) Reporter: Dan Hendry Labels: compaction Twice now (on different nodes), I have observed major compaction for certain column families take *significantly* longer on 1.0.3 in comparison to 0.8.6. For example, On the 0.8.6 node, the post compaction log message: {noformat}CompactionManager.java (line 608) Compacted to XXX. 339,164,959,170 to 158,825,469,883 (~46% of original) bytes for 25,996 keys. Time: 26,934,317ms.{noformat} On the 1.0.3 node, the post compaction log message: {noformat} CompactionTask.java (line 213) Compacted to [XXX]. 222,338,354,529 to 147,751,403,084 (~66% of original) bytes for 26,100 keys at 0.562045MB/s. Time: 250,703,563ms.{noformat} So... literally an order of magnitude slower on 1.0.3 in comparison to 0.8.6. Relevant configuration settings: * compaction_throughput_mb_per_sec: 0 (why? because the compaction throttling logic as currently implemented is highly unsuitable for wide rows but thats a different issue) * in_memory_compaction_limit_in_mb: 128 Column family characteristics: * Many wide rows (~5% of rows greater than 10MB and hundreds of rows greater than 100 MB, with many small columns). * Heavy use of expiring columns - each row represents data for a particular hour so typically all columns in the row will expire together. * The significant size shrinkage as reported by the log messages is due mainly to expired data being cleaned up (I typically trigger major compaction when 30-50% of the on disk data has expired which is about once every 3 weeks per node). * Perhaps obviously: size tiered compaction and no compression (the schema has not changed since the partial upgrade to 1.0.x) * Standard column family Performance notes during compaction: * Nice CPU usage and load average is basically the same between 0.8.6 and 1.0.3 - ie, compaction IS running and is not getting stalled or hung up anywhere. * Compaction is IO bound on the 0.8.6 machines - the disks see heavy, constant utilization when compaction is running. * Compaction is uses virtually no IO on the 1.0.3 machines - disk utilization is virtually no different when compacting vs not compacting (but at the same time, CPU usage and load average clearly indicate that compaction IS running). Finally, I have not had time to profile more thoroughly but jconsole always shows the following stacktrace for the active compaction thread (for the 1.0.3 machine): {noformat} Stack trace: org.apache.cassandra.db.ColumnFamilyStore.removeDeletedStandard(ColumnFamilyStore.java:851) org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:835) org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:826) org.apache.cassandra.db.compaction.PrecompactedRow.removeDeletedAndOldShards(PrecompactedRow.java:77) org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:102) org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:133) org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102) org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87) org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116) org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) com.google.common.collect.Iterators$7.computeNext(Iterators.java:614) com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:172)
[jira] [Commented] (CASSANDRA-3592) Major Compaction Incredibly Slow
[ https://issues.apache.org/jira/browse/CASSANDRA-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168549#comment-13168549 ] Jonathan Ellis commented on CASSANDRA-3592: --- You're right, that's not likely to help. It sounded like such a good fit superficially! Major Compaction Incredibly Slow Key: CASSANDRA-3592 URL: https://issues.apache.org/jira/browse/CASSANDRA-3592 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.3 Environment: RHEL6 - 24 core machines 24 GB mem total, 11 GB java heap java version 1.6.0_26 6 node cluster (4@0.8.6, 2@1.0.3) Reporter: Dan Hendry Labels: compaction Twice now (on different nodes), I have observed major compaction for certain column families take *significantly* longer on 1.0.3 in comparison to 0.8.6. For example, On the 0.8.6 node, the post compaction log message: {noformat}CompactionManager.java (line 608) Compacted to XXX. 339,164,959,170 to 158,825,469,883 (~46% of original) bytes for 25,996 keys. Time: 26,934,317ms.{noformat} On the 1.0.3 node, the post compaction log message: {noformat} CompactionTask.java (line 213) Compacted to [XXX]. 222,338,354,529 to 147,751,403,084 (~66% of original) bytes for 26,100 keys at 0.562045MB/s. Time: 250,703,563ms.{noformat} So... literally an order of magnitude slower on 1.0.3 in comparison to 0.8.6. Relevant configuration settings: * compaction_throughput_mb_per_sec: 0 (why? because the compaction throttling logic as currently implemented is highly unsuitable for wide rows but thats a different issue) * in_memory_compaction_limit_in_mb: 128 Column family characteristics: * Many wide rows (~5% of rows greater than 10MB and hundreds of rows greater than 100 MB, with many small columns). * Heavy use of expiring columns - each row represents data for a particular hour so typically all columns in the row will expire together. * The significant size shrinkage as reported by the log messages is due mainly to expired data being cleaned up (I typically trigger major compaction when 30-50% of the on disk data has expired which is about once every 3 weeks per node). * Perhaps obviously: size tiered compaction and no compression (the schema has not changed since the partial upgrade to 1.0.x) * Standard column family Performance notes during compaction: * Nice CPU usage and load average is basically the same between 0.8.6 and 1.0.3 - ie, compaction IS running and is not getting stalled or hung up anywhere. * Compaction is IO bound on the 0.8.6 machines - the disks see heavy, constant utilization when compaction is running. * Compaction is uses virtually no IO on the 1.0.3 machines - disk utilization is virtually no different when compacting vs not compacting (but at the same time, CPU usage and load average clearly indicate that compaction IS running). Finally, I have not had time to profile more thoroughly but jconsole always shows the following stacktrace for the active compaction thread (for the 1.0.3 machine): {noformat} Stack trace: org.apache.cassandra.db.ColumnFamilyStore.removeDeletedStandard(ColumnFamilyStore.java:851) org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:835) org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:826) org.apache.cassandra.db.compaction.PrecompactedRow.removeDeletedAndOldShards(PrecompactedRow.java:77) org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:102) org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:133) org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102) org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87) org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116) org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) com.google.common.collect.Iterators$7.computeNext(Iterators.java:614) com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:172) org.apache.cassandra.db.compaction.CompactionManager$4.call(CompactionManager.java:277) java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) java.util.concurrent.FutureTask.run(FutureTask.java:138)
svn commit: r1213827 - in /cassandra/trunk: ./ contrib/ interface/thrift/gen-java/org/apache/cassandra/thrift/ src/java/org/apache/cassandra/utils/obs/
Author: slebresne Date: Tue Dec 13 18:25:48 2011 New Revision: 1213827 URL: http://svn.apache.org/viewvc?rev=1213827view=rev Log: merge from 1.0 Modified: cassandra/trunk/ (props changed) cassandra/trunk/CHANGES.txt cassandra/trunk/contrib/ (props changed) cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java (props changed) cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java (props changed) cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/InvalidRequestException.java (props changed) cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/NotFoundException.java (props changed) cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/SuperColumn.java (props changed) cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Propchange: cassandra/trunk/ -- --- svn:mergeinfo (original) +++ svn:mergeinfo Tue Dec 13 18:25:48 2011 @@ -4,7 +4,7 @@ /cassandra/branches/cassandra-0.8:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976 /cassandra/branches/cassandra-0.8.0:1125021-1130369 /cassandra/branches/cassandra-0.8.1:1101014-1125018 -/cassandra/branches/cassandra-1.0:1167085-1211978,1212284 +/cassandra/branches/cassandra-1.0:1167085-1211978,1212284,1213775 /cassandra/branches/cassandra-1.0.0:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243 /cassandra/tags/cassandra-0.7.0-rc3:1051699-1053689 /cassandra/tags/cassandra-0.8.0-rc1:1102511-1125020 Modified: cassandra/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/cassandra/trunk/CHANGES.txt?rev=1213827r1=1213826r2=1213827view=diff == --- cassandra/trunk/CHANGES.txt (original) +++ cassandra/trunk/CHANGES.txt Tue Dec 13 18:25:48 2011 @@ -24,6 +24,7 @@ * Remove columns shadowed by a deleted container even when we cannot purge (CASSANDRA-3538) * Improve memtable slice iteration performance (CASSANDRA-3545) + * more efficient allocation of small bloom filters (CASSANDRA-3618) 1.0.6 Propchange: cassandra/trunk/contrib/ -- --- svn:mergeinfo (original) +++ svn:mergeinfo Tue Dec 13 18:25:48 2011 @@ -4,7 +4,7 @@ /cassandra/branches/cassandra-0.8/contrib:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976 /cassandra/branches/cassandra-0.8.0/contrib:1125021-1130369 /cassandra/branches/cassandra-0.8.1/contrib:1101014-1125018 -/cassandra/branches/cassandra-1.0/contrib:1167085-1211978,1212284 +/cassandra/branches/cassandra-1.0/contrib:1167085-1211978,1212284,1213775 /cassandra/branches/cassandra-1.0.0/contrib:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243 /cassandra/tags/cassandra-0.7.0-rc3/contrib:1051699-1053689 /cassandra/tags/cassandra-0.8.0-rc1/contrib:1102511-1125020 Propchange: cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java -- --- svn:mergeinfo (original) +++ svn:mergeinfo Tue Dec 13 18:25:48 2011 @@ -4,7 +4,7 @@ /cassandra/branches/cassandra-0.8/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976 /cassandra/branches/cassandra-0.8.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1125021-1130369 /cassandra/branches/cassandra-0.8.1/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1101014-1125018 -/cassandra/branches/cassandra-1.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167085-1211978,1212284 +/cassandra/branches/cassandra-1.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167085-1211978,1212284,1213775 /cassandra/branches/cassandra-1.0.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243 /cassandra/tags/cassandra-0.7.0-rc3/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1051699-1053689 /cassandra/tags/cassandra-0.8.0-rc1/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1102511-1125020 Propchange: cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java -- --- svn:mergeinfo (original) +++ svn:mergeinfo Tue Dec 13 18:25:48 2011 @@ -4,7 +4,7 @@ /cassandra/branches/cassandra-0.8/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976 /cassandra/branches/cassandra-0.8.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java:1125021-1130369
[jira] [Commented] (CASSANDRA-3622) clean up openbitset
[ https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168576#comment-13168576 ] Sylvain Lebresne commented on CASSANDRA-3622: - The patch renames the fastGet/Set to get/set (which is fine), but do not update the call-sites (in BloomFilter.java). clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Attachments: 3622.txt Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit
[ https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168595#comment-13168595 ] paul cannon commented on CASSANDRA-3477: Joaquin- ready to close? cassandra takes too long to shut down when told to quit --- Key: CASSANDRA-3477 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477 Project: Cassandra Issue Type: Bug Components: Core Reporter: Joaquin Casares Assignee: paul cannon Priority: Minor Fix For: 1.0.6 The restart command keeps failing and never passes. The stop command seems to have completed successfully, but the processes is still listed when I run 'ps auwx | grep cass'. Using the Debian6 images on Rackspace. 2 nodes are definitely showing the same behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3101) Should check for errors when calling /bin/ln
[ https://issues.apache.org/jira/browse/CASSANDRA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168605#comment-13168605 ] paul cannon commented on CASSANDRA-3101: This works, except you've taken out a logger.error() call instead of adding another one. I think it's worth logging an error for the cassandra log in both cases. Should check for errors when calling /bin/ln Key: CASSANDRA-3101 URL: https://issues.apache.org/jira/browse/CASSANDRA-3101 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.4 Reporter: paul cannon Assignee: Vijay Priority: Minor Labels: lhf Fix For: 1.0.6 Attachments: 0001-0001-throw-IOE-while-calling-bin-ln-v2.patch, 0001-3101-throw-IOE-while-calling-bin-ln.patch It looks like cassandra.utils.CLibrary.createHardLinkWithExec() does not check for any errors in the execution of the hard-link-making utility. This could be bad if, for example, the user has put the snapshot directory on a different filesystem from the data directory. The hard linking would fail and the sstable snapshots would not exist, but no error would be reported. It does look like errors with the more direct JNA link() call are handled correctly- an exception is thrown. The WithExec version should probably do the same thing. Definitely it would be enough to check the process exit value from /bin/ln for nonzero in the *nix case, but I don't know whether 'fsutil hardlink create' or 'cmd /c mklink /H' return nonzero on failure. For bonus points, use any output from the Process's error stream in the text of the exception, to aid in debugging problems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit
[ https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168608#comment-13168608 ] Joaquin Casares commented on CASSANDRA-3477: Sure thing. Haven't seen it on 1.05 yet. Thanks! cassandra takes too long to shut down when told to quit --- Key: CASSANDRA-3477 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477 Project: Cassandra Issue Type: Bug Components: Core Reporter: Joaquin Casares Assignee: paul cannon Priority: Minor Fix For: 1.0.6 The restart command keeps failing and never passes. The stop command seems to have completed successfully, but the processes is still listed when I run 'ps auwx | grep cass'. Using the Debian6 images on Rackspace. 2 nodes are definitely showing the same behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3622) clean up openbitset
[ https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3622: -- Attachment: 3622-v2.txt Oops, that's what I get for assuming a patch against 1.0 would Just Work against 1.1. v2 attached. clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Attachments: 3622-v2.txt, 3622.txt Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit
[ https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis reopened CASSANDRA-3477: --- cassandra takes too long to shut down when told to quit --- Key: CASSANDRA-3477 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477 Project: Cassandra Issue Type: Bug Components: Core Reporter: Joaquin Casares Priority: Minor The restart command keeps failing and never passes. The stop command seems to have completed successfully, but the processes is still listed when I run 'ps auwx | grep cass'. Using the Debian6 images on Rackspace. 2 nodes are definitely showing the same behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit
[ https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis resolved CASSANDRA-3477. --- Resolution: Cannot Reproduce Reopened/re-resolved b/c that is actually a different issue. cassandra takes too long to shut down when told to quit --- Key: CASSANDRA-3477 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477 Project: Cassandra Issue Type: Bug Components: Core Reporter: Joaquin Casares Priority: Minor The restart command keeps failing and never passes. The stop command seems to have completed successfully, but the processes is still listed when I run 'ps auwx | grep cass'. Using the Debian6 images on Rackspace. 2 nodes are definitely showing the same behavior. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations
[ https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-1391: --- Attachment: (was: 0001-new-migration-schema-and-avro-methods-cleanup.patch) Allow Concurrent Schema Migrations -- Key: CASSANDRA-1391 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Stu Hood Assignee: Pavel Yaskevich Fix For: 1.1 Attachments: CASSANDRA-1391.patch CASSANDRA-1292 fixed multiple migrations started from the same node to properly queue themselves, but it is still possible for migrations initiated on different nodes to conflict and leave the cluster in a bad state. Since the system_add/drop/rename methods are accessible directly from the client API, they should be completely safe for concurrent use. It should be possible to allow for most types of concurrent migrations by converting the UUID schema ID into a VersionVectorClock (as provided by CASSANDRA-580). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations
[ https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-1391: --- Attachment: (was: 0002-avro-removal.patch) Allow Concurrent Schema Migrations -- Key: CASSANDRA-1391 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Stu Hood Assignee: Pavel Yaskevich Fix For: 1.1 Attachments: CASSANDRA-1391.patch CASSANDRA-1292 fixed multiple migrations started from the same node to properly queue themselves, but it is still possible for migrations initiated on different nodes to conflict and leave the cluster in a bad state. Since the system_add/drop/rename methods are accessible directly from the client API, they should be completely safe for concurrent use. It should be possible to allow for most types of concurrent migrations by converting the UUID schema ID into a VersionVectorClock (as provided by CASSANDRA-580). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations
[ https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-1391: --- Attachment: 0002-avro-removal.patch 0001-new-migration-schema-and-avro-methods-cleanup.patch rebased with the lastest trunk (last commit e37bd7e8d344332ff41bd1015e6018c81ca81fa3) Allow Concurrent Schema Migrations -- Key: CASSANDRA-1391 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Stu Hood Assignee: Pavel Yaskevich Fix For: 1.1 Attachments: 0001-new-migration-schema-and-avro-methods-cleanup.patch, 0002-avro-removal.patch, CASSANDRA-1391.patch CASSANDRA-1292 fixed multiple migrations started from the same node to properly queue themselves, but it is still possible for migrations initiated on different nodes to conflict and leave the cluster in a bad state. Since the system_add/drop/rename methods are accessible directly from the client API, they should be completely safe for concurrent use. It should be possible to allow for most types of concurrent migrations by converting the UUID schema ID into a VersionVectorClock (as provided by CASSANDRA-580). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3213) Upgrade Thrift to 0.7.0
[ https://issues.apache.org/jira/browse/CASSANDRA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168654#comment-13168654 ] Jake Farrell commented on CASSANDRA-3213: - Jake Luciani and I where talking about this, changing to update to 0.8 and removing custom THsHa and use the default. I'll have a patch for this shortly Upgrade Thrift to 0.7.0 --- Key: CASSANDRA-3213 URL: https://issues.apache.org/jira/browse/CASSANDRA-3213 Project: Cassandra Issue Type: Task Components: Core Reporter: Jake Farrell Assignee: Jake Farrell Priority: Trivial Labels: thrift Fix For: 1.1 Attachments: v1-0001-update-generated-thrift-code.patch, v1-0002-upgrade-thrift-jar-and-license.patch, v1-0003-update-build-xml.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3624) Hinted Handoff - related OOM
Hinted Handoff - related OOM Key: CASSANDRA-3624 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson One of our nodes had collected alot of hints for another node, so when the dead node came back and the row mutations were read back from disk, the node died with an OOM-exception (and kept dying after restart, even with increased heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our application does not use those (but HH does). I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does not fit in memory (will check this tomorrow) A simple fix (if my assumption above is correct) would be to reduce the PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to reduce the memory pressure. The performance hit would be small since we are doing the hinted handoff throttle delay sleep before sending every *mutation* anyway (not every page), thoughts? If anyone runs in to the same problem, I got the node started again by simply removing the HintsColumnFamily* files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3625) Do something about DynamicCompositeType
Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN
Nodes can get stuck in UP state forever, despite being DOWN --- Key: CASSANDRA-3626 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626 Project: Cassandra Issue Type: Bug Components: Core Reporter: Peter Schuller Assignee: Peter Schuller This is a proposed phrasing for an upstream ticket named Newly discovered nodes that are down get stuck in UP state forever (will edit w/ feedback until done): We have a observed a problem with gossip which, when you are bootstrapping a new node (or replacing using the replace_token support), any node in the cluster which is Down at the time the node is started, will be assumed to be Up and then *never ever* flapped back to Down until you restart the node. This has at least two implications to replacing or bootstrapping new nodes when there are nodes down in the ring: * If the new node happens to select a node listed as (UP but in reality is DOWN) as a stream source, streaming will sit there hanging forever. * If that doesn't happen (by picking another host), it will instead finish bootstrapping correctly, and begin servicing requests all the while thinking DOWN nodes are UP, and thus routing requests to them, generating timeouts. The way to get out of this is to restart the node(s) that you bootstrapped. I have tested and confirmed the symptom (that the bootstrapped node things other nodes are Up) using a fairly recent 1.0. The main debugging effort happened on 0.8 however, so all details below refer to 0.8 but are probably similar in 1.0. Steps to reproduce: * Bring up a cluster of = 3 nodes. *Ensure RF is N*, so that the cluster is operative with one node removed. * Pick two random nodes A, and B. Shut them *both* off. * Wait for everyone to realize they are both off (for good measure). * Now, take node A and nuke it's data directories and re-start it, such that it comes up w/ normal bootstrap (or use replace_token; didn't test that but should not affect it). * Watch how node A starts up, all the while believing node B is down, even though all other nodes in the cluster agree that B is down and B is in fact still turned off. The mechanism by which it initially goes into Up state is that the node receives a gossip response from any other node in the cluster, and GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally(). Gossiper.applyStateLocally() doesn't have any local endpoint state for the cluster, so the else statement at the end (it's a new node) gets triggered and handleMajorStateChange() is called. handleMajorStateChange() always calls markAlive(), unless the state is a dead state (but dead here does not mean not up, but refers to joining/hibernate etc). So at this point the node is up in the mind of the node you just bootstrapped. Now, in each gossip round doStatusCheck() is called, which iterates over all nodes (including the one falsly Up) and among other things, calls FailureDetector.interpret() on each node. FailureDetector.interpret() is meant to update its sense of Phi for the node, and potentially convict it. However there is a short-circuit at the top, whereby if we do not yet have any arrival window for the node, we simply return immediately. Arrival intervals are only added as a result of a FailureDetector.report() call, which never happens in this case because the initial endpoint state we added, which came from a remote node that was up, had the latest version of the gossip state (so Gossiper.reportFailureDetector() will never call report()). The result is that the node can never ever be convicted. Now, let's ignore for a moment the problem that a node that is actually Down will be thought to be Up temporarily for a little while. That is sub-optimal, but let's aim for a fix to the more serious problem in this ticket - which is that is stays up forever. Considered solutions: * When interpret() gets called and there is no arrival window, we could add a faked arrival window far back in time to cause the node to have history and be marked down. This works in the particular test case. The problem is that since we are not ourselves actively trying to gossip to these nodes with any particular speed, it might take a significant time before we get any kind of confirmation from someone else that it's actually Up in cases where the node actually *is* Up, so it's not clear that this is a good idea. * When interpret() gets called and there is no arrival window, we can simply convict it immediately. This has roughly similar behavior as the previous suggestion. * When interpret() gets called and there is no arrival window, we can add a faked arrival window at the current time, which will allow it to be treated as Up until the usual time has passed before we exceed the Phi conviction threshold. * When interpret() gets called and
[jira] [Updated] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN
[ https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Goffinet updated CASSANDRA-3626: -- Reviewer: lenn0x Affects Version/s: 0.8.8 1.0.5 Nodes can get stuck in UP state forever, despite being DOWN --- Key: CASSANDRA-3626 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.8.8, 1.0.5 Reporter: Peter Schuller Assignee: Peter Schuller This is a proposed phrasing for an upstream ticket named Newly discovered nodes that are down get stuck in UP state forever (will edit w/ feedback until done): We have a observed a problem with gossip which, when you are bootstrapping a new node (or replacing using the replace_token support), any node in the cluster which is Down at the time the node is started, will be assumed to be Up and then *never ever* flapped back to Down until you restart the node. This has at least two implications to replacing or bootstrapping new nodes when there are nodes down in the ring: * If the new node happens to select a node listed as (UP but in reality is DOWN) as a stream source, streaming will sit there hanging forever. * If that doesn't happen (by picking another host), it will instead finish bootstrapping correctly, and begin servicing requests all the while thinking DOWN nodes are UP, and thus routing requests to them, generating timeouts. The way to get out of this is to restart the node(s) that you bootstrapped. I have tested and confirmed the symptom (that the bootstrapped node things other nodes are Up) using a fairly recent 1.0. The main debugging effort happened on 0.8 however, so all details below refer to 0.8 but are probably similar in 1.0. Steps to reproduce: * Bring up a cluster of = 3 nodes. *Ensure RF is N*, so that the cluster is operative with one node removed. * Pick two random nodes A, and B. Shut them *both* off. * Wait for everyone to realize they are both off (for good measure). * Now, take node A and nuke it's data directories and re-start it, such that it comes up w/ normal bootstrap (or use replace_token; didn't test that but should not affect it). * Watch how node A starts up, all the while believing node B is down, even though all other nodes in the cluster agree that B is down and B is in fact still turned off. The mechanism by which it initially goes into Up state is that the node receives a gossip response from any other node in the cluster, and GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally(). Gossiper.applyStateLocally() doesn't have any local endpoint state for the cluster, so the else statement at the end (it's a new node) gets triggered and handleMajorStateChange() is called. handleMajorStateChange() always calls markAlive(), unless the state is a dead state (but dead here does not mean not up, but refers to joining/hibernate etc). So at this point the node is up in the mind of the node you just bootstrapped. Now, in each gossip round doStatusCheck() is called, which iterates over all nodes (including the one falsly Up) and among other things, calls FailureDetector.interpret() on each node. FailureDetector.interpret() is meant to update its sense of Phi for the node, and potentially convict it. However there is a short-circuit at the top, whereby if we do not yet have any arrival window for the node, we simply return immediately. Arrival intervals are only added as a result of a FailureDetector.report() call, which never happens in this case because the initial endpoint state we added, which came from a remote node that was up, had the latest version of the gossip state (so Gossiper.reportFailureDetector() will never call report()). The result is that the node can never ever be convicted. Now, let's ignore for a moment the problem that a node that is actually Down will be thought to be Up temporarily for a little while. That is sub-optimal, but let's aim for a fix to the more serious problem in this ticket - which is that is stays up forever. Considered solutions: * When interpret() gets called and there is no arrival window, we could add a faked arrival window far back in time to cause the node to have history and be marked down. This works in the particular test case. The problem is that since we are not ourselves actively trying to gossip to these nodes with any particular speed, it might take a significant time before we get any kind of confirmation from someone else that it's actually Up in cases where the node actually *is* Up, so it's not clear that this is a good idea. * When interpret() gets called and there is no arrival
[jira] [Commented] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168849#comment-13168849 ] Sylvain Lebresne commented on CASSANDRA-3143: - {quote} bq. Preceding point apart, we would at least need a way to deactivate row caching on a per-cf basis. We may also want that for key cache, though this seems less critical. My initial idea would be to either have a boolean flag if we only want to allow disabling row cache, or some multi-value caches option that could be none, key_only, row_only or all. This is going to be moved to the separate task. {quote} I'm not a fan of that idea. We just cannot release this without a way to deactivate the row cache as this would make the row cache unusable for most users. IMHO, that's a good definition of something that should not be moved to a separate task. {quote} bq. Why does the getRowCacheKeysToSave() option disappeared? Because we don't control that anymore, rely on cache LRU policy instead. {quote} I don't understand how relying on cache LRU policy has anything to do with that. The initial motivation for that option is that people don't want to reload the full extend of the row cache on restart because it takes forever, but they don't want to start with cold caches either. I don't see how making the cache global changes anything on that. I agree that the number of row cache key to save should now be a global option, but I disagree that it should be removed. Otherwise: * The code around CFS.prepareRowForCaching is weird. First the comment seems to suggest that prepareRowForCaching is used exclusively from CacheService while it's use below in cacheRow. It also adds a copy of the columns which I don't think is necessary since we already copy in MappedFileDataInput. Overall I'm not sure prepareRowForCaching is useful and CacheService.readSavedRowCache could use cacheRow directly * I don't think CacheService.reloadKeyCache works correctly. It only populate the cache with fake values that won't get updated unless you reload the sstables, which has no reason to happen. I'm fine removing the key cache reloading altogether, but as an alternative, why not save the value of the key cache too? The thing is, I'm not very comfortable with the current 'two phase' key cache loading: if we ever have a bug in the SSTReader.load method, the actual pre-loading with -1 values will be harmful, which seems unnecessarily fragile. Saving the values on disk would avoid that. * Talking of the key cache save, the format used by the patch is really really not compact. For each key we save the full path to the sstable, which can easily be 50 bytes. Maybe we could associate an int to each descriptor during the save and save the association of descriptor - id separately. * Still worth allowing to chose how may keys to save * The cache sizings don't take the keys into account. For the row cache, one could make the argument that the overhead of the keys is negligible compared to the values. For the key cache however, the key are bigger than the values. * The patch mistakenly remove the help for 'nodetool upgradesstables' (in NodeCmd.java) * Would be worth adding a global cache log line in StatusLogger. * Patch wrongly reintroduces memtable_operations and memtable_throughput to CliHelp. * The default row cache provider since 1.0 is the serializing one, this patch sets the ConcurrentLinkedHashCacheProvider instead. And a number of nits: * In CFS, it's probably faster/simpler to use metadata.cfId rather than Schema.instance.getId(table.name, this.columnFamily) * In CacheService, calling scheduleSaving with -1 as second argument would be slightly faster than using Integer.MAX_VALUE. * In SSTableReader.cacheKey, the assert {{key.key == null}} is useless in trunk (DK with key == null can't be constructed). * In AbstractCassandraDaemon, there's a unecessary import of javax.management.RuntimeErrorException * There is some comments duplication in the yaml file * I wonder if the reduce cache capacity thing still makes sense after this patch? * In AutosavingCache, I think we could declare AutoSavingCacheK extends CacheKey, V and get rid of the translateKey() method Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Attachments: 0001-global-key-cache.patch, 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch, 0003-CacheServiceMBean-and-correct-key-cache-loading.patch, 0004-key-row-cache-tests-and-tweaks.patch,
[jira] [Created] (CASSANDRA-3627) IN (...) SELECTs don't honor KEY keyword
IN (...) SELECTs don't honor KEY keyword Key: CASSANDRA-3627 URL: https://issues.apache.org/jira/browse/CASSANDRA-3627 Project: Cassandra Issue Type: Bug Components: API Affects Versions: 1.0.5, 0.8.8 Reporter: Eric Evans The WHERE clause of a SELECT ... IN (...) will not work with the KEY keyword, (but does with named/aliased keys). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168855#comment-13168855 ] Dominic Williams commented on CASSANDRA-3620: - Make it optional per column family? Repair would still need to exist anyway so could fall back to that for cases like this. Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs - Key: CASSANDRA-3620 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Dominic Williams Labels: GCSeconds,, deletes,, distributed_deletes,, merkle_trees, repair, Original Estimate: 504h Remaining Estimate: 504h Here is a proposal for an improved system for handling distributed deletes. h2. The Problem There are various issues with repair: * Repair is expensive anyway * Repair jobs are often made more expensive than they should be by other issues (nodes dropping requests, hinted handoff not working, downtime etc) * Repair processes can often fail and need restarting, for example in cloud environments where network issues make a node disappear from the ring for a brief moment * When you fail to run repair within GCSeconds, either by error or because of issues with Cassandra, data written to a node that did not see a later delete can reappear (and a node might miss a delete for several reasons including being down or simply dropping requests during load shedding) * If you cannot run repair and have to increase GCSeconds to prevent deleted data reappearing, in some cases the growing tombstone overhead can significantly degrade performance Because of the foregoing, in high throughput environments it can be very difficult to make repair a cron job. It can be preferable to keep a terminal open and run repair jobs one by one, making sure they succeed and keeping and eye on overall load to reduce system impact. This isn't desirable, and problems are exacerbated when there are lots of column families in a database or it is necessary to run a column family with a low GCSeconds to reduce tombstone load (because there are many write/deletes to that column family). The database owner must run repair within the GCSeconds window, or increase GCSeconds, to avoid potentially losing delete operations. It would be much better if there was no ongoing requirement to run repair to ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an optional maintenance utility used in special cases, or to ensure ONE reads get consistent data. h2. Reaper Model Proposal # Tombstones do not expire, and there is no GCSeconds # Tombstones have associated ACK lists, which record the replicas that have acknowledged them # Tombstones are only deleted (or marked for compaction) when they have been acknowledged by all replicas # When a tombstone is deleted, it is added to a fast relic index of MD5 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for a reaper to acknowledge a tombstone after it is deleted # Background reaper threads constantly stream ACK requests to other nodes, and stream back ACK responses back to requests they have received (throttling their usage of CPU and bandwidth so as not to affect performance) # If a reaper receives a request to ACK a tombstone that does not exist, it creates the tombstone and adds an ACK for the requestor, and replies with an ACK NOTES * The existence of entries in the relic index do not affect normal query performance * If a node goes down, and comes up after a configurable relic entry timeout, the worst that can happen is that a tombstone that hasn't received all its acknowledgements is re-created across the replicas when the reaper requests their acknowledgements (which is no big deal since this does not corrupt data) * Since early removal of entries in the relic index does not cause corruption, it can be kept small, or even kept in memory * Simple to implement and predictable h3. Planned Benefits * Operations are finely grained (reaper interruption is not an issue) * The labour administration overhead associated with running repair can be removed * Reapers can utilize spare cycles and run constantly in background to prevent the load spikes and performance issues associated with repair * There will no longer be the threat of corruption if repair can't be run for some reason (for example because of a new adopter's lack of Cassandra expertise, a cron script failing, or Cassandra bugs preventing repair being run etc) * Deleting tombstones earlier, thereby reducing the number
[jira] [Created] (CASSANDRA-3628) Make Pig/CassandraStorage delete functionality disabled by default and configurable
Make Pig/CassandraStorage delete functionality disabled by default and configurable --- Key: CASSANDRA-3628 URL: https://issues.apache.org/jira/browse/CASSANDRA-3628 Project: Cassandra Issue Type: Task Reporter: Jeremy Hanna Assignee: Jeremy Hanna Right now, there is a way to delete column with the CassandraStorage loadstorefunc. In practice it is a bad idea to have that enabled by default. A scenario: do an outer join and you don't have a value for something and then you write out to cassandra all of the attributes of that relation. You've just inadvertently deleted a column for all the rows that didn't have that value as a result of the outer join. It can be argued that you want to be careful with how you project after the join. However, I would think disabling by default and having a configurable property to enable it for the instances when you explicitly want to use it is the right plan. Fwiw, we had a bug in one of our scripts that did exactly as described above. It's good to fix the bug. It's bad to implicitly delete data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3622) clean up openbitset
[ https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168867#comment-13168867 ] Sylvain Lebresne commented on CASSANDRA-3622: - +1 clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Attachments: 3622-v2.txt, 3622.txt Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair
[ https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168924#comment-13168924 ] Sylvain Lebresne commented on CASSANDRA-3483: - I haven't applied the patch yet, it needs rebase and preferably against trunk since that is the likely target for this, but a few comments. We could have more reuse of code between Boostrapper ant the rebuild command. Typically: * RangeStreamer.getAllRangeWithSourcesFor does essentially the same thing that Boostrapper.getRangesWithSources, so it would be nice to do some reuse. * In rebuild, we essentially have the code of Boostrapper.getWorkMap, again would be nice to do some code reuse. I think we should move all of those in RangeStreamer and ultimately Boostrapper.boostrap() should be just one call to rebuild with the right arguments (mostly the correct tokenMetada instance and the myRange collection). A few nits: * rebuild code could be simplified slightly by using StorageService.getLocalRanges() * rebuild doesn't fully respect the code style. Support bringing up a new datacenter to existing cluster without repair --- Key: CASSANDRA-3483 URL: https://issues.apache.org/jira/browse/CASSANDRA-3483 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.2 Reporter: Chris Goffinet Assignee: Peter Schuller Attachments: CASSANDRA-3483-0.8-prelim.txt, CASSANDRA-3483-1.0.txt Was talking to Brandon in irc, and we ran into a case where we want to bring up a new DC to an existing cluster. He suggested from jbellis the way to do it currently was set strategy options of dc2:0, then add the nodes. After the nodes are up, change the RF of dc2, and run repair. I'd like to avoid a repair as it runs AES and is a bit more intense than how bootstrap works currently by just streaming ranges from the SSTables. Would it be possible to improve this functionality (adding a new DC to existing cluster) than the proposed method? We'd be happy to do a patch if we got some input on the best way to go about it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3143) Global caches (key/row)
[ https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168929#comment-13168929 ] Pavel Yaskevich commented on CASSANDRA-3143: bq. I'm not a fan of that idea. We just cannot release this without a way to deactivate the row cache as this would make the row cache unusable for most users. IMHO, that's a good definition of something that should not be moved to a separate task. Couldn't we do that the same way we did with compression options? I'm happy to make it a sub-task, I just want the main code to be settled before starting with that. bq. Why does the getRowCacheKeysToSave() option disappeared? Is that going to have the same use case as it did per-CF? Meaning we would be saving a top of the cache and it doesn't guarantee that system doesn't start almost cold... {quote} Talking of the key cache save, the format used by the patch is really really not compact. For each key we save the full path to the sstable, which can easily be 50 bytes. Maybe we could associate an int to each descriptor during the save and save the association of descriptor - id separately. * Still worth allowing to chose how may keys to save {quote} Do you think that it worse the effort of maintaining (also persisting) such descriptor - id relationship exclusively for key cache? Meaning it's already very compact cache e.g. even with descriptor 50 bytes we would need ~20 mb to store 20 keys... bq. The cache sizings don't take the keys into account. For the row cache, one could make the argument that the overhead of the keys is negligible compared to the values. For the key cache however, the key are bigger than the values. We do that because CLHM only allows to measure values, to do something about it we would need to re-write Weighter interface and change core semantics of CLHM... Global caches (key/row) --- Key: CASSANDRA-3143 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143 Project: Cassandra Issue Type: Improvement Reporter: Pavel Yaskevich Assignee: Pavel Yaskevich Priority: Minor Labels: Core Fix For: 1.1 Attachments: 0001-global-key-cache.patch, 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch, 0003-CacheServiceMBean-and-correct-key-cache-loading.patch, 0004-key-row-cache-tests-and-tweaks.patch, 0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch, 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch Caches are difficult to configure well as ColumnFamilies are added, similar to how memtables were difficult pre-CASSANDRA-2006. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
svn commit: r1214016 - /cassandra/tags/cassandra-1.0.6/
Author: slebresne Date: Wed Dec 14 01:17:43 2011 New Revision: 1214016 URL: http://svn.apache.org/viewvc?rev=1214016view=rev Log: Create 1.0.6 branch Added: cassandra/tags/cassandra-1.0.6/ (props changed) - copied from r1212944, cassandra/branches/cassandra-1.0/ Propchange: cassandra/tags/cassandra-1.0.6/ -- --- svn:ignore (added) +++ svn:ignore Wed Dec 14 01:17:43 2011 @@ -0,0 +1,8 @@ +.classpath +.project +.settings +temp-testng-customsuite.xml +build +build.properties +.idea +out Propchange: cassandra/tags/cassandra-1.0.6/ -- --- svn:mergeinfo (added) +++ svn:mergeinfo Wed Dec 14 01:17:43 2011 @@ -0,0 +1,16 @@ +/cassandra/branches/cassandra-0.6:922689-1052356,1052358-1053452,1053454,1053456-1131291 +/cassandra/branches/cassandra-0.7:1026516-1211709 +/cassandra/branches/cassandra-0.7.0:1053690-1055654 +/cassandra/branches/cassandra-0.8:1090934-1125013,1125019-1212854,1212938 +/cassandra/branches/cassandra-0.8.0:1125021-1130369 +/cassandra/branches/cassandra-0.8.1:1101014-1125018 +/cassandra/branches/cassandra-1.0:1167106,1167185 +/cassandra/branches/cassandra-1.0.0:1167104-1181093,1181741,1181816,1181820,1182951,1183243 +/cassandra/branches/cassandra-1.0.5:1208016 +/cassandra/tags/cassandra-0.7.0-rc3:1051699-1053689 +/cassandra/tags/cassandra-0.8.0-rc1:1102511-1125020 +/cassandra/trunk:1167085-1167102,1169870 +/incubator/cassandra/branches/cassandra-0.3:774578-796573 +/incubator/cassandra/branches/cassandra-0.4:810145-834239,834349-834350 +/incubator/cassandra/branches/cassandra-0.5:72-915439 +/incubator/cassandra/branches/cassandra-0.6:911237-922688
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168957#comment-13168957 ] Ed Anuff commented on CASSANDRA-3625: - I don't think you mean random but predictable so much as deterministic but opaque in your description of the correct behavior. I raised this issue with DynamicCompositeType when it was introduced and I suggested we use the alias character byte or a hash of the classname (see https://issues.apache.org/jira/browse/CASSANDRA-2231#comment-13002170 ). I still think that's the best approach. Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1391) Allow Concurrent Schema Migrations
[ https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168969#comment-13168969 ] Jonathan Ellis commented on CASSANDRA-1391: --- Thanks, Pavel. This is getting closer. But I think continuing to use UUIDs is the wrong approach. In particular, code like this means we've failed to achieve our goal: {code} . if (newVersion.timestamp() = lastVersion.timestamp()) throw new ConfigurationException(New version timestamp is not newer than the current version timestamp.); {code} If two migrations X and Y propagate through the cluster concurrently from different coordinators, some nodes will apply X first, some Y; whichever migration has a lower timestamp will then error out on the remaining nodes and we'll end up with the same kind of version conflict snafu we encounter now. Here's how I think it should work: * Coordinator turns KsDef and CfDef objects into RowMutations by applying them to the existing (local) schema. Maybe you use something like your attributesToCheck code since you already have that written. Give that mutation a normal local timestamp (FBU.timestampMicros). Then each node applying the change: * makes a deep copy of the existing schema ColumnFamily objects * calls Table.apply on the migration RowMutations * calls ColumnFamily.diff on the new schema ColumnFamily object vs the copied one. (This is where I was going above by saying let the existing resolve code do the work. No matter which order nodes apply X and Y in, they will always agree on the result after applying both. Note that this does not depend on X and Y getting correctly ordered timestamps, either.) * makes the appropriate Table + CFS + Schema changes dicated by the diff * (above obvously needs to be synchronized at least against the Table/CFS objects affected) Schema version may then be computed as an md5 of the Schema objects. (Again: goal is that nodes can apply X and Y in any order, and we don't care. So version needs to be entirely content-based, not time-based.) Probably the easiest way to do this is to just use CF.updateDigest. We can cut this down to the first 16 bytes if we need to cram it into a UUID, but I don't see a reason for that (the Thrift API uses Strings already). Nit: flushSystemCFs could use FBUtilities.waitOnFutures(flushes) instead of rolling its own multi-future wait. Allow Concurrent Schema Migrations -- Key: CASSANDRA-1391 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Stu Hood Assignee: Pavel Yaskevich Fix For: 1.1 Attachments: 0001-new-migration-schema-and-avro-methods-cleanup.patch, 0002-avro-removal.patch, CASSANDRA-1391.patch CASSANDRA-1292 fixed multiple migrations started from the same node to properly queue themselves, but it is still possible for migrations initiated on different nodes to conflict and leave the cluster in a bad state. Since the system_add/drop/rename methods are accessible directly from the client API, they should be completely safe for concurrent use. It should be possible to allow for most types of concurrent migrations by converting the UUID schema ID into a VersionVectorClock (as provided by CASSANDRA-580). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair
[ https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168975#comment-13168975 ] Peter Schuller commented on CASSANDRA-3483: --- I'll get it rebased once it's otherwise okay. As for re-use: I had intermediate versions that tried to do this, but ever time I ended up realizing that it was exploding in verbosity at the point where I was using the abstraction so it didn't actually help. However, I think there were a few changes towards the end after which I didn't re-evaluate. I'll look at it again and see what I can do. Support bringing up a new datacenter to existing cluster without repair --- Key: CASSANDRA-3483 URL: https://issues.apache.org/jira/browse/CASSANDRA-3483 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.2 Reporter: Chris Goffinet Assignee: Peter Schuller Attachments: CASSANDRA-3483-0.8-prelim.txt, CASSANDRA-3483-1.0.txt Was talking to Brandon in irc, and we ran into a case where we want to bring up a new DC to an existing cluster. He suggested from jbellis the way to do it currently was set strategy options of dc2:0, then add the nodes. After the nodes are up, change the RF of dc2, and run repair. I'd like to avoid a repair as it runs AES and is a bit more intense than how bootstrap works currently by just streaming ranges from the SSTables. Would it be possible to improve this functionality (adding a new DC to existing cluster) than the proposed method? We'd be happy to do a patch if we got some input on the best way to go about it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair
[ https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168976#comment-13168976 ] Peter Schuller commented on CASSANDRA-3483: --- I'll get it rebased once it's otherwise okay. As for re-use: I had intermediate versions that tried to do this, but ever time I ended up realizing that it was exploding in verbosity at the point where I was using the abstraction so it didn't actually help. However, I think there were a few changes towards the end after which I didn't re-evaluate. I'll look at it again and see what I can do. Support bringing up a new datacenter to existing cluster without repair --- Key: CASSANDRA-3483 URL: https://issues.apache.org/jira/browse/CASSANDRA-3483 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.2 Reporter: Chris Goffinet Assignee: Peter Schuller Attachments: CASSANDRA-3483-0.8-prelim.txt, CASSANDRA-3483-1.0.txt Was talking to Brandon in irc, and we ran into a case where we want to bring up a new DC to an existing cluster. He suggested from jbellis the way to do it currently was set strategy options of dc2:0, then add the nodes. After the nodes are up, change the RF of dc2, and run repair. I'd like to avoid a repair as it runs AES and is a bit more intense than how bootstrap works currently by just streaming ranges from the SSTables. Would it be possible to improve this functionality (adding a new DC to existing cluster) than the proposed method? We'd be happy to do a patch if we got some input on the best way to go about it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
svn commit: r1214034 - in /cassandra/trunk/src/java/org/apache/cassandra/utils: BloomFilter.java obs/OpenBitSet.java
Author: jbellis Date: Wed Dec 14 02:18:44 2011 New Revision: 1214034 URL: http://svn.apache.org/viewvc?rev=1214034view=rev Log: clean up OpenBitSet patch by jbellis; reviewed by slebresne for CASSANDRA-3622 Modified: cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Modified: cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java URL: http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java?rev=1214034r1=1214033r2=1214034view=diff == --- cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java (original) +++ cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java Wed Dec 14 02:18:44 2011 @@ -113,7 +113,7 @@ public class BloomFilter extends Filter { for (long bucketIndex : getHashBuckets(key)) { -bitset.fastSet(bucketIndex); +bitset.set(bucketIndex); } } @@ -121,7 +121,7 @@ public class BloomFilter extends Filter { for (long bucketIndex : getHashBuckets(key)) { - if (!bitset.fastGet(bucketIndex)) + if (!bitset.get(bucketIndex)) { return false; } Modified: cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java URL: http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java?rev=1214034r1=1214033r2=1214034view=diff == --- cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java (original) +++ cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Wed Dec 14 02:18:44 2011 @@ -21,8 +21,10 @@ import java.util.Arrays; import java.io.Serializable; import java.util.BitSet; -/** An open BitSet implementation that allows direct access to the array of words - * storing the bits. +/** + * An open BitSet implementation that allows direct access to the arrays of words + * storing the bits. Derived from Lucene's OpenBitSet, but with a paged backing array + * (see bits delaration, below). * p/ * Unlike java.util.bitset, the fact that bits are packed into an array of longs * is part of the interface. This allows efficient implementation of other algorithms @@ -39,77 +41,38 @@ import java.util.BitSet; * hence people re-implement their own version in order to get better performance). * If you want a safe, totally encapsulated (and slower and limited) BitSet * class, use codejava.util.BitSet/code. - * p/ - * h3Performance Results/h3 - * - Test system: Pentium 4, Sun Java 1.5_06 -server -Xbatch -Xmx64M -br/BitSet size = 1,000,000 -br/Results are java.util.BitSet time divided by OpenBitSet time. -table border=1 - tr - th/th thcardinality/th thintersect_count/th thunion/th thnextSetBit/th thget/th thiterator/th - /tr - tr - th50% full/th td3.36/td td3.96/td td1.44/td td1.46/td td1.99/td td1.58/td - /tr - tr - th1% full/th td3.31/td td3.90/td tdnbsp;/td td1.04/td tdnbsp;/td td0.99/td - /tr -/table -br/ -Test system: AMD Opteron, 64 bit linux, Sun Java 1.5_06 -server -Xbatch -Xmx64M -br/BitSet size = 1,000,000 -br/Results are java.util.BitSet time divided by OpenBitSet time. -table border=1 - tr - th/th thcardinality/th thintersect_count/th thunion/th thnextSetBit/th thget/th thiterator/th - /tr - tr - th50% full/th td2.50/td td3.50/td td1.00/td td1.03/td td1.12/td td1.25/td - /tr - tr - th1% full/th td2.51/td td3.49/td tdnbsp;/td td1.00/td tdnbsp;/td td1.02/td - /tr -/table */ public class OpenBitSet implements Serializable { - protected long[][] bits; - protected int wlen; // number of words (elements) used in the array - private final int pageCount; /** - * length of bits[][] page in long[] elements. - * Choosing unform size for all sizes of bitsets fight fragmentation for very large - * bloom filters. + * We break the bitset up into multiple arrays to avoid promotion failure caused by attempting to allocate + * large, contiguous arrays (CASSANDRA-2466). All sub-arrays but the last are uniformly PAGE_SIZE words; + * to avoid waste in small bloom filters (of which Cassandra has many: one per row) the last sub-array + * is sized to exactly the remaining number of words required to achieve the desired set size (CASSANDRA-3618). */ - protected static final int PAGE_SIZE= 4096; + private final long[][] bits; + private int wlen; // number of words (elements) used in the array + private final int pageCount; + private static final int PAGE_SIZE = 4096; - /** Constructs an OpenBitSet large enough to hold numBits. - * + /** + * Constructs an OpenBitSet large enough to hold numBits. * @param numBits */ public OpenBitSet(long numBits) { - this(numBits,true); - } - -
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169013#comment-13169013 ] Jonathan Ellis commented on CASSANDRA-3625: --- bq. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. That's the most straightforward suggestion IMO. bq. I suggested we use the alias character byte or a hash of the classname Couldn't we just fall back to lexical sorting for non-comparable types? Might as well keep it simple. Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3619) Use a separate writer thread for the SSTableSimpleUnsortedWriter
[ https://issues.apache.org/jira/browse/CASSANDRA-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3619: -- Reviewer: yukim Use a separate writer thread for the SSTableSimpleUnsortedWriter Key: CASSANDRA-3619 URL: https://issues.apache.org/jira/browse/CASSANDRA-3619 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.8.1 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Priority: Minor Fix For: 1.1 Attachments: 0001-Add-separate-writer-thread.patch Currently SSTableSimpleUnsortedWriter doesn't use any threading. This means that the thread using it is blocked while the buffered data is written on disk and that nothing is written on disk while data is added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3624) Hinted Handoff - related OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169051#comment-13169051 ] Jonathan Ellis commented on CASSANDRA-3624: --- That makes sense. (How big are your mutations?) We added adaptive page sizing back in CASSANDRA-2652, but apparently removed it for the CASSANDRA-2045 redesign. Hinted Handoff - related OOM Key: CASSANDRA-3624 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson One of our nodes had collected alot of hints for another node, so when the dead node came back and the row mutations were read back from disk, the node died with an OOM-exception (and kept dying after restart, even with increased heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our application does not use those (but HH does). I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does not fit in memory (will check this tomorrow) A simple fix (if my assumption above is correct) would be to reduce the PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to reduce the memory pressure. The performance hit would be small since we are doing the hinted handoff throttle delay sleep before sending every *mutation* anyway (not every page), thoughts? If anyone runs in to the same problem, I got the node started again by simply removing the HintsColumnFamily* files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3622) clean up openbitset
[ https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169052#comment-13169052 ] Hudson commented on CASSANDRA-3622: --- Integrated in Cassandra #1255 (See [https://builds.apache.org/job/Cassandra/1255/]) clean up OpenBitSet patch by jbellis; reviewed by slebresne for CASSANDRA-3622 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1214034 Files : * /cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java * /cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java clean up openbitset --- Key: CASSANDRA-3622 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 1.1 Attachments: 3622-v2.txt, 3622.txt Our OpenBitSet no longer supports expanding the set post-construction. Should update documentation to reflect that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3624) Hinted Handoff - related OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-3624: -- Attachment: 3624.txt Patch to add back adaptive page sizing, and drops the default size to 128 columns. Hinted Handoff - related OOM Key: CASSANDRA-3624 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.0 Reporter: Marcus Eriksson Labels: hintedhandoff Fix For: 1.0.7 Attachments: 3624.txt One of our nodes had collected alot of hints for another node, so when the dead node came back and the row mutations were read back from disk, the node died with an OOM-exception (and kept dying after restart, even with increased heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our application does not use those (but HH does). I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does not fit in memory (will check this tomorrow) A simple fix (if my assumption above is correct) would be to reduce the PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to reduce the memory pressure. The performance hit would be small since we are doing the hinted handoff throttle delay sleep before sending every *mutation* anyway (not every page), thoughts? If anyone runs in to the same problem, I got the node started again by simply removing the HintsColumnFamily* files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3624) Hinted Handoff - related OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169056#comment-13169056 ] Jonathan Ellis commented on CASSANDRA-3624: --- bq. The performance hit would be small since we are doing the hinted handoff throttle delay sleep before sending every mutation anyway True, but this is likely to change (see Jake's comments to CASSANDRA-3554). Hinted Handoff - related OOM Key: CASSANDRA-3624 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624 Project: Cassandra Issue Type: Bug Affects Versions: 1.0.0 Reporter: Marcus Eriksson Assignee: Jonathan Ellis Labels: hintedhandoff Fix For: 1.0.7 Attachments: 3624.txt One of our nodes had collected alot of hints for another node, so when the dead node came back and the row mutations were read back from disk, the node died with an OOM-exception (and kept dying after restart, even with increased heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our application does not use those (but HH does). I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does not fit in memory (will check this tomorrow) A simple fix (if my assumption above is correct) would be to reduce the PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to reduce the memory pressure. The performance hit would be small since we are doing the hinted handoff throttle delay sleep before sending every *mutation* anyway (not every page), thoughts? If anyone runs in to the same problem, I got the node started again by simply removing the HintsColumnFamily* files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169057#comment-13169057 ] Ed Anuff commented on CASSANDRA-3625: - Each component in a composite consists of a type (either an alias byte or a Cassandra comparator type name) and the value. I'm suggesting doing a compare on the type in the case of types not being equivalent. The comparison could be a lexical compare or a hash comparison. I think doing the compare on the component type is better since the purpose of the composite is for slices and if we do a lexical compare of the component values then the slices are going to have weird results in the middle of them. For example, a row that had dynamic composite columns (ed,5), (jonathan,6), and (103, 32), that was sliced from (ed) to (jonathan) could have the (103, 32) in the middle. If we compare on the type, then that never happens. Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169060#comment-13169060 ] Jonathan Ellis commented on CASSANDRA-3625: --- bq. For example, a row that had dynamic composite columns (ed,5), (jonathan,6), and (103, 32), that was sliced from (ed) to (jonathan) could have the (103, 32) in the middle Right, but I thought we were positing that You Shouldn't Do That. In which case as long as it doesn't crash, I'm good. :) Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169068#comment-13169068 ] Ed Anuff commented on CASSANDRA-3625: - I'm not positing that at all, I can think of a number of good reasons why it can happen and is even desirable. I'd really strongly urge we do the compare on the component type. I don't think the fix is any more complicated and it will be much preferable from a data modelling standpoint. Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[wiki.cassandra-jdbc] push by - Edited wiki page HowToBuild through web user interface. on 2011-12-13 23:50 GMT
Revision: 5280e68bfdf5 Author: john.eric.evans john.eric.ev...@gmail.com Date: Tue Dec 13 15:50:25 2011 Log: Edited wiki page HowToBuild through web user interface. http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5280e68bfdf5repo=wiki Modified: /HowToBuild.wiki === --- /HowToBuild.wikiTue Dec 13 15:48:41 2011 +++ /HowToBuild.wikiTue Dec 13 15:50:25 2011 @@ -1,3 +1,4 @@ +#labels Featured #Maven, FML. = Building =
[wiki.cassandra-jdbc] push by - Some build doc. on 2011-12-13 23:48 GMT
Revision: a8f3cd03dba3 Author: john.eric.evans john.eric.ev...@gmail.com Date: Tue Dec 13 15:48:41 2011 Log: Some build doc. http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=a8f3cd03dba3repo=wiki Added: /HowToBuild.wiki === --- /dev/null +++ /HowToBuild.wikiTue Dec 13 15:48:41 2011 @@ -0,0 +1,49 @@ +#Maven, FML. + += Building = + +== Satisfying Dependencies == + +The JDBC driver has a dependency on two [http://cassandra.apache.org Cassandra] jars, `cassandra-clientutil` and `cassandra-thrift`, neither of which will be available through a Maven repository until the release of Cassandra 1.1.0. In the meantime you must ~~shave a yak~~ satisfy this dependency manually. + +First, download the source and build the jar artifacts. + +{{{ +$ svn checkout https://svn.apache.org/repos/asf/cassandra/trunk cassandra +$ cd cassandra +$ ant jar +}}} + +When complete, install the artifacts to `~/.m2` + +{{{ +mvn install:install-file -DgroupId=org.apache.cassandra \ +-DartifactId=cassandra-clientutil -Dversion=1.1-dev-SNAPSHOT -Dpackaging=jar \ +-Dfile=build/apache-cassandra-clientutil-1.1-dev-SNAPSHOT.jar +... +mvn install:install-file -DgroupId=org.apache.cassandra \ +-DartifactId=cassandra-thrift -Dversion=1.1-dev-SNAPSHOT -Dpackaging=jar \ +-Dfile=build/apache-cassandra-thrift-1.1-dev-SNAPSHOT.jar +}}} + +== Building == + +[http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/checkout Checkout the source] and build with either Maven: + +{{{ +$ mvn compile +}}} + +Or ant: + +{{{ +$ ant +}}} + +== IDE == + +To generate project files for [http://www.eclipse.org/ Eclipse]: + +{{{ +$ mvn eclipse:eclipse +}}}
[cassandra-jdbc] 2 new revisions pushed by john.eri...@gmail.com on 2011-12-13 23:29 GMT
2 new revisions: Revision: 5ec85ae43461 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 12:57:26 2011 Log: add dependency on thrift (temporary?) http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5ec85ae43461 Revision: c80adc5f4bd2 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:21:15 2011 Log: IN (...) is broken and requires an aliased key... http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=c80adc5f4bd2 == Revision: 5ec85ae43461 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 12:57:26 2011 Log: add dependency on thrift (temporary?) http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5ec85ae43461 Modified: /pom.xml === --- /pom.xmlMon Nov 7 15:59:43 2011 +++ /pom.xmlTue Dec 13 12:57:26 2011 @@ -132,6 +132,11 @@ version1.6.1/version scopetest/scope /dependency +dependency + groupIdorg.apache.thrift/groupId + artifactIdlibthrift/artifactId + version0.6.1/version +/dependency /dependencies build == Revision: c80adc5f4bd2 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:21:15 2011 Log: IN (...) is broken and requires an aliased key See https://issues.apache.org/jira/browse/CASSANDRA-3627 http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=c80adc5f4bd2 Modified: /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java /src/test/java/org/apache/cassandra/cql/Schema.java /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java === --- /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java Thu Oct 13 01:56:33 2011 +++ /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java Tue Dec 13 15:21:15 2011 @@ -67,8 +67,9 @@ String[] inserts = { String.format(UPDATE Standard1 SET '%s' = '%s', '%s' = '%s' WHERE KEY = '%s', first, firstrec, last, lastrec, jsmith), -UPDATE JdbcInteger SET 1 = 11, 2 = 22, 42='fortytwo' WHERE KEY = ' + jsmith + ', -UPDATE JdbcInteger SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger0 SET 1 = 11, 2 = 22, 42='fortytwo' WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger0 SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger1 SET 1 = 'One', 2 = 'Two', 3 = 'Three' WHERE id = rowOne, UPDATE JdbcLong SET 1 = 11, 2 = 22 WHERE KEY = ' + jsmith + ', UPDATE JdbcAscii SET 'first' = 'firstrec', last = 'lastrec' WHERE key = ' + jsmith + ', String.format(UPDATE JdbcBytes SET '%s' = '%s', '%s' = '%s' WHERE key = '%s', first, firstrec, last, lastrec, jsmith), @@ -133,8 +134,8 @@ { String key = bytesToHex(Integer.getBytes()); Statement stmt = con.createStatement(); -stmt.executeUpdate(update JdbcInteger set 1=36893488147419103232, 42='fortytwofortytwo' where key=' + key + '); -ResultSet rs = stmt.executeQuery(select 1, 2, 42 from JdbcInteger where key=' + key + '); +stmt.executeUpdate(update JdbcInteger0 set 1=36893488147419103232, 42='fortytwofortytwo' where key=' + key + '); +ResultSet rs = stmt.executeQuery(select 1, 2, 42 from JdbcInteger0 where key=' + key + '); assert rs.next(); assert rs.getObject(1).equals(new BigInteger(36893488147419103232)); assert rs.getString(42).equals(fortytwofortytwo) : rs.getString(42); @@ -145,7 +146,7 @@ expectedMetaData(md, 2, BigInteger.class.getName(), JdbcInteger, Schema.KEYSPACE_NAME, 2, Types.BIGINT, JdbcInteger.class.getSimpleName(), true, false); expectedMetaData(md, 3, String.class.getName(), JdbcInteger, Schema.KEYSPACE_NAME, 42, Types.VARCHAR, JdbcUTF8.class.getSimpleName(), false, true); -rs = stmt.executeQuery(select key, 1, 2, 42 from JdbcInteger where key=' + key + '); +rs = stmt.executeQuery(select key, 1, 2, 42 from JdbcInteger0 where key=' + key + '); assert rs.next(); assert Arrays.equals(rs.getBytes(key), hexToBytes(key)); assert rs.getObject(1).equals(new BigInteger(36893488147419103232)); @@ -281,13 +282,13 @@ { Statement stmt = con.createStatement(); ListString keys = Arrays.asList(jsmith); -String selectQ = SELECT 1, 2 FROM JdbcInteger WHERE KEY=' + jsmith + '; +String selectQ = SELECT 1, 2 FROM JdbcInteger0 WHERE KEY=' + jsmith + '; checkResultSet(stmt.executeQuery(selectQ), Int, 1, keys, 1, 2); -selectQ = SELECT 3, 4 FROM JdbcInteger WHERE KEY=' + jsmith + '; +selectQ = SELECT 3, 4 FROM JdbcInteger0 WHERE KEY=' + jsmith + ';
[cassandra-jdbc] 3 new revisions pushed by john.eri...@gmail.com on 2011-12-13 23:26 GMT
3 new revisions: Revision: 92cb0506c77b Author: Eric Evans e...@acunu.com Date: Tue Dec 13 12:57:26 2011 Log: add dependency on thrift (temporary?) http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=92cb0506c77b Revision: 93551543de06 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:20:46 2011 Log: do not hard code host/port http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=93551543de06 Revision: 6acadeb166f9 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:21:15 2011 Log: IN (...) is broken and requires an aliased key... http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=6acadeb166f9 == Revision: 92cb0506c77b Author: Eric Evans e...@acunu.com Date: Tue Dec 13 12:57:26 2011 Log: add dependency on thrift (temporary?) http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=92cb0506c77b Modified: /pom.xml === --- /pom.xmlThu Dec 1 10:36:57 2011 +++ /pom.xmlTue Dec 13 12:57:26 2011 @@ -132,6 +132,11 @@ version1.6.1/version scopetest/scope /dependency +dependency + groupIdorg.apache.thrift/groupId + artifactIdlibthrift/artifactId + version0.6.1/version +/dependency /dependencies build == Revision: 93551543de06 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:20:46 2011 Log: do not hard code host/port http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=93551543de06 Modified: /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java === --- /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java Thu Dec 1 10:38:15 2011 +++ /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java Tue Dec 13 15:20:46 2011 @@ -38,17 +38,17 @@ public class PreparedStatementTest { private static java.sql.Connection con = null; - -//private static final Schema schema = new Schema(ConnectionDetails.getHost(), ConnectionDetails.getPort()); -private static final Schema schema = new Schema(localhost, 9160); +private static final Schema schema = new Schema(ConnectionDetails.getHost(), ConnectionDetails.getPort()); @BeforeClass public static void waxOn() throws Exception { schema.createSchema(); Class.forName(org.apache.cassandra.cql.jdbc.CassandraDriver); -con = DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s, ConnectionDetails.getHost(), ConnectionDetails.getPort(), Schema.KEYSPACE_NAME)); -//con = DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s, localhost, 9160, Schema.KEYSPACE_NAME)); +con = DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s, + ConnectionDetails.getHost(), + ConnectionDetails.getPort(), + Schema.KEYSPACE_NAME)); } @Test == Revision: 6acadeb166f9 Author: Eric Evans e...@acunu.com Date: Tue Dec 13 15:21:15 2011 Log: IN (...) is broken and requires an aliased key See https://issues.apache.org/jira/browse/CASSANDRA-3627 http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=6acadeb166f9 Modified: /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java /src/test/java/org/apache/cassandra/cql/Schema.java /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java === --- /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java Thu Oct 13 01:56:33 2011 +++ /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java Tue Dec 13 15:21:15 2011 @@ -67,8 +67,9 @@ String[] inserts = { String.format(UPDATE Standard1 SET '%s' = '%s', '%s' = '%s' WHERE KEY = '%s', first, firstrec, last, lastrec, jsmith), -UPDATE JdbcInteger SET 1 = 11, 2 = 22, 42='fortytwo' WHERE KEY = ' + jsmith + ', -UPDATE JdbcInteger SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger0 SET 1 = 11, 2 = 22, 42='fortytwo' WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger0 SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith + ', +UPDATE JdbcInteger1 SET 1 = 'One', 2 = 'Two', 3 = 'Three' WHERE id = rowOne, UPDATE JdbcLong SET 1 = 11, 2 = 22 WHERE KEY = ' + jsmith + ', UPDATE JdbcAscii SET 'first' = 'firstrec', last = 'lastrec' WHERE key = ' + jsmith + ', String.format(UPDATE JdbcBytes SET '%s' =
[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType
[ https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169082#comment-13169082 ] Matt Stump commented on CASSANDRA-3625: --- Until a long term solution is found, would it be possible to get something in the documentation warning people away from DynamicCompositeType? It was featured rather prominently in Ed's talk so people may mistakingly believe that DynamicCompositeType is the preferred method to create dynamic indexes. Do something about DynamicCompositeType --- Key: CASSANDRA-3625 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt. Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have: {noformat} index1: { bar:24 - someval bar:42 - someval foo:12 - someval ... } index2: { 0:uuid1:3.2 - someval 1:uuid2:2.2 - someval ... } {noformat} where index1, index2, ... are rows. So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows). But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare bar:24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException. The problem with that is that the exception is not throw at insert time, and it *cannot* be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up *corrupting* a sstable. It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic. One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType LongType StringType ... Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs. Opinions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3621) nodetool is trying to contact old ip address
[ https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169137#comment-13169137 ] Zenek Kraweznik commented on CASSANDRA-3621: I've restored backup on test cluster, so hostnames must change. nodetool is trying to contact old ip address Key: CASSANDRA-3621 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621 Project: Cassandra Issue Type: Bug Affects Versions: 0.8.8 Environment: java 1.6.26, linux Reporter: Zenek Kraweznik My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 10.0.2.0/24 network (for security resons). I want to test new cassandra before upgrading production instances. I've made snapshot and moved it to test servers (except system/LocationInfo* files). Changes in configuration: ip adresses (seeds, listen address etc), cluster name. Test server are in 10.0.1.0/24 network. In logs I see that test nodes are seeing each other, but when i try to show ring I get this error: casstest1:/# nodetool -h 10.0.1.211 ring Error connection to remote JMX agent! java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; nested exception is: java.net.NoRouteToHostException: No route to host at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582) Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.init(Socket.java:375) at java.net.Socket.init(Socket.java:189) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 10 more casstest1:/# Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 10.0.1.203 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3629) Bootstrapping nodes don't ensure schema is ready before continuing
Bootstrapping nodes don't ensure schema is ready before continuing -- Key: CASSANDRA-3629 URL: https://issues.apache.org/jira/browse/CASSANDRA-3629 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 1.0.7 A bootstrapping node will assume that after it has slept for RING_DELAY it has all of the schema migrations and can continue the bootstrap process. However, with a large enough amount of migrations this is not sufficient and causes problems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-3621) nodetool is trying to contact old ip address
[ https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169137#comment-13169137 ] Zenek Kraweznik edited comment on CASSANDRA-3621 at 12/14/11 7:40 AM: -- I've restored backup on test cluster, so hostnames must change. And I've never used hostname in configuration. What's the meaning of hostname here? was (Author: zenek_kraweznik0): I've restored backup on test cluster, so hostnames must change. nodetool is trying to contact old ip address Key: CASSANDRA-3621 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621 Project: Cassandra Issue Type: Bug Affects Versions: 0.8.8 Environment: java 1.6.26, linux Reporter: Zenek Kraweznik My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 10.0.2.0/24 network (for security resons). I want to test new cassandra before upgrading production instances. I've made snapshot and moved it to test servers (except system/LocationInfo* files). Changes in configuration: ip adresses (seeds, listen address etc), cluster name. Test server are in 10.0.1.0/24 network. In logs I see that test nodes are seeing each other, but when i try to show ring I get this error: casstest1:/# nodetool -h 10.0.1.211 ring Error connection to remote JMX agent! java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; nested exception is: java.net.NoRouteToHostException: No route to host at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582) Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.init(Socket.java:375) at java.net.Socket.init(Socket.java:189) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 10 more casstest1:/# Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 10.0.1.203 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira