date:20111213

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 0001-global-key-cache.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 
0003-CacheServiceMBean-and-correct-key-cache-loading.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 
0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 0004-key-row-cache-tests-and-tweaks.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 
0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: (was: 
0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-3143:
---

Attachment: 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch
0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch
0004-key-row-cache-tests-and-tweaks.patch
0003-CacheServiceMBean-and-correct-key-cache-loading.patch
0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch
0001-global-key-cache.patch

rebased with the lastest trunk (last commit 
58518301472fc99b01cfd4bcf90bf81b5f0694ee)

 Global caches (key/row)
 ---

 Key: CASSANDRA-3143
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
 Project: Cassandra
  Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
  Labels: Core
 Fix For: 1.1

 Attachments: 0001-global-key-cache.patch, 
 0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch, 
 0003-CacheServiceMBean-and-correct-key-cache-loading.patch, 
 0004-key-row-cache-tests-and-tweaks.patch, 
 0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch, 
 0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch


 Caches are difficult to configure well as ColumnFamilies are added, similar 
 to how memtables were difficult pre-CASSANDRA-2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Created) (JIRA)

Proposal for distributed deletes - use Reaper Model rather than GCSeconds and 
scheduled repairs
-

 Key: CASSANDRA-3620
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.0.5
Reporter: Dominic Williams
 Fix For: 1.1


Here is a proposal for an improved system for handling distributed deletes.

*** The Problem ***

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other 
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 

-- If you cannot run repair and have to increase GCSeconds, tombstones can 
start overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds. 

2. Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them

3. Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.

4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset

5. Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

** Advantages **

-- The labour/administration overhead associated with running repair will be 
removed

-- The reapers can utilize spare cycles and run constantly to prevent the 
load spikes and performance issues associated with repair

-- There will no longer be the risk of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)

-- Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:

Description:
Here is a proposal for an improved system for handling distributed deletes.

h2The Problem/h2

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Because of the foregoing, in high throughput environments you often cannot make
repair a cron job. You prefer to keep a terminal open and run repair jobs one
by one, making sure they succeed and keeping and eye on overall load so you
don't impact your system. This isn't great, and it is made worse where you have
lots of column families or where you have to run a low GCSeconds on a column
family to reduce tombstone load. You know that if you don't manage to run
repair with the GCSeconds window, you are going to hit problems, and this is
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data
loss, and no GCSeconds. Repair should be an optional maintenance utility used
in special cases, or to ensure ONE reads get consistent data.

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that have
acknowledged them

3. Tombstones are only deleted (or marked for compaction) when they have been
acknowledged by all replicas.

4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK
list is simply reset

5. Background reaper threads constantly stream ACK requests and ACKs from
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are
added/removed that can be discussed in separate Jira

** Advantages **

-- The labour/administration overhead associated with running repair will be
removed

-- The reapers can utilize spare cycles and run constantly to prevent the
load spikes and performance issues associated with repair

-- There will no longer be the risk of data loss if repair can't be run for
some reason (for example because of a new adopter's lack of Cassandra
expertise, a cron script failing, or Cassandra bugs preventing repair being run
etc)

-- Reducing the number of tombstones databases carry will improve performance,
sometimes *dramatically*

was:
Here is a proposal for an improved system for handling distributed deletes.

*** The Problem ***

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:

Description:
Here is a proposal for an improved system for handling distributed deletes.

*The Problem*

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that have
acknowledged them

3. Tombstones are only deleted (or marked for compaction) when they have been
acknowledged by all replicas.

4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK
list is simply reset

5. Background reaper threads constantly stream ACK requests and ACKs from
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are
added/removed that can be discussed in separate Jira

** Advantages **

-- The labour/administration overhead associated with running repair will be
removed

-- The reapers can utilize spare cycles and run constantly to prevent the
load spikes and performance issues associated with repair

-- Reducing the number of tombstones databases carry will improve performance,
sometimes *dramatically*

was:
Here is a proposal for an improved system for handling distributed deletes.

h2The Problem/h2

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:

Description:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds.

# Tombstones have associated ACK lists, which record the replicas that have
acknowledged them

# Tombstones are only deleted (or marked for compaction) when they have been
acknowledged by all replicas.

# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK
list is simply reset

# Background reaper threads constantly stream ACK requests and ACKs from
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are
added/removed that can be discussed in separate Jira

h3. Advantages

-- The labour/administration overhead associated with running repair will be
removed

-- The reapers can utilize spare cycles and run constantly to prevent the
load spikes and performance issues associated with repair

-- Reducing the number of tombstones databases carry will improve performance,
sometimes *dramatically*

was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

h2. Proposed Reaper Model

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that have

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:

Description:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

h2. Proposed Reaper Model

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that have
acknowledged them

3. Tombstones are only deleted (or marked for compaction) when they have been
acknowledged by all replicas.

4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK
list is simply reset

5. Background reaper threads constantly stream ACK requests and ACKs from
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are
added/removed that can be discussed in separate Jira

h3. Advantages

-- The labour/administration overhead associated with running repair will be
removed

-- The reapers can utilize spare cycles and run constantly to prevent the
load spikes and performance issues associated with repair

-- Reducing the number of tombstones databases carry will improve performance,
sometimes *dramatically*

was:
Here is a proposal for an improved system for handling distributed deletes.

*The Problem*

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear

-- If you cannot run repair and have to increase GCSeconds, tombstones can
start overloading your system

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

*** Proposed Reaper Model ***

1. Tombstones do not expire, and there is no GCSeconds.

2. Tombstones have associated ACK lists, which record the replicas that have

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair before GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Advantages

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other 
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 

-- If you cannot run repair and have to increase GCSeconds, tombstones can 
start overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 

# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them

# Tombstones

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Advantages

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Advantages

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair before GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes *dramatically*



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Williams updated CASSANDRA-3620:

Description:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud
environments where a network issue might make a node disappear
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or
because of issues with Cassandra, deleted data can reappear
* If you cannot run repair and have to increase GCSeconds, tombstones can start
overloading your system

Because of the foregoing, in high throughput environments you often cannot make
repair a cron job. You prefer to keep a terminal open and run repair jobs one
by one, making sure they succeed and keeping and eye on overall load so you
don't impact your system. This isn't great, and it is made worse if you have
lots of column families or have to run a low GCSeconds on a column family to
reduce tombstone load. You know that if you don't manage to run repair with the
GCSeconds window, you are going to hit problems, and this is the Sword of
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds.
# Tombstones have associated ACK lists, which record the replicas that have
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been
acknowledged by all replicas.
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK
list is simply reset
# Background reaper threads constantly stream ACK requests and ACKs from
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while cluster
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be
removed
* The reapers can utilize spare cycles and run constantly to prevent the load
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some
reason (for example because of a new adopter's lack of Cassandra expertise, a
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve
performance, sometimes dramatically

was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

Because of the foregoing, in high throughput environments you often cannot make
repair a cron job. You prefer to keep a terminal open and run repair jobs one
by one, making sure they succeed and keeping and eye on overall load so you
don't impact your system. This isn't great, and it is made worse if you have
lots of column families or have to run a low GCSeconds on a column family to
reduce tombstone load. You know that if you don't manage to run repair with the
GCSeconds window, you are going to hit problems, and this is the Sword of
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM
reads will always receive data successfully written with QUORUM.

h2. Reaper Model Proposal

[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-13 Thread Samarth Gahire (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168321#comment-13168321
 ] 

Samarth Gahire commented on CASSANDRA-3589:
---

No, I do not have any secondary indexes on any of the column family and I have 
done the fair comparison and seen some performance hit in sstable-loader 
utility. 

 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Assignee: Sylvain Lebresne
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3621) nodetool is trying to contact old ip addres

2011-12-13 Thread Zenek Kraweznik (Created) (JIRA)

nodetool is trying to contact old ip addres
---

 Key: CASSANDRA-3621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 0.8.8
 Environment: java 1.6.26, linux
Reporter: Zenek Kraweznik


My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 
10.0.2.0/24 network (for security resons).

I want to test new cassandra before upgrading production instances. I've made 
snapshot and moved it to test servers (except system/LocationInfo* files).

Changes in configuration: ip adresses (seeds, listen address etc), cluster 
name. Test server are in 10.0.1.0/24 network.

In logs I see that test nodes are seeing each other, but when i try to show 
ring I get this error:
casstest1:/# nodetool -h 10.0.1.211 ring
Error connection to remote JMX agent!
java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; 
nested exception is:
java.net.NoRouteToHostException: No route to host
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614)
at 
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110)
at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown 
Source)
at 
javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329)
at 
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279)
at 
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140)
at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582)
Caused by: java.net.NoRouteToHostException: No route to host
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at java.net.Socket.init(Socket.java:375)
at java.net.Socket.init(Socket.java:189)
at 
sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22)
at 
sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128)
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595)
... 10 more
casstest1:/#

Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 10.0.1.203
New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# New tombstones replace old tombstones and always start with an empty ACK list
# Upon deletion, a tombstone is written to a relic list, which is scavenged 
according to some configurable period, thereby allowing deleted tombstones to 
still be acknowledged (the writer acknowledges this has some of the drawbacks 
of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds.

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Upon deletion, a tombstone is written to a relic list, which is scavenged 
according to some configurable period, thereby allowing deleted tombstones to 
still be acknowledged (the writer acknowledges this has some of the drawbacks 
of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss,

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# New tombstones replace old tombstones and always start with an empty ACK list
# Upon deletion, a tombstone is written to a relic list/index, which is 
scavenged according to some configurable period, thereby allowing deleted 
tombstones to still be acknowledged (the writer acknowledges this has some of 
the drawbacks of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# Upon deletion, a tombstone is written to a super fast relic index, which is 
scavenged according to some configurable period, thereby allowing deleted 
tombstones to still be acknowledged (the writer acknowledges this has some of 
the drawbacks of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# New tombstones replace old tombstones and always start with an empty ACK list
# Upon deletion, a tombstone is written to a super fast relic index, which is 
scavenged according to some configurable period, thereby allowing deleted 
tombstones to still be acknowledged (the writer acknowledges this has some of 
the drawbacks of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# Upon deletion, a tombstone is written to a super fast relic index, which is 
scavenged according to some configurable period, thereby allowing deleted 
tombstones to still be acknowledged (the writer acknowledges this has some of 
the drawbacks of GCSeconds)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and delete tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# Upon deletion, a tombstone is written to a super fast relic index, which is 
scavenged according to some configurable period, thereby allowing deleted 
tombstones to still be acknowledged (this relic index might simply contain MD5 
hashes of cf-k-n(-sn)-acks)
# Background reaper threads constantly stream ACK requests and ACKs from 
other replicas and delete tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when
## They have been acknowledged by all replicas
## All replicas have acknowledged receiving all acknowledgements
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received
# Once a tombstone has been acknowledged by all replicas, after a configurable 
period, the reaper asks the replicas to acknowledge that the others have 
received all their acknowledgements
## If a node is down or otherwise can't reply, this is retried after a back-off 
period
## If a node is asked to fully acknowledge a tombstone, and it is not ready to 
do so, it may try to receive outstanding acknowledgements so that it can reply 
with an ACK
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when
## They have been acknowledged by all replicas
## All replicas have acknowledged receiving all acknowledgements
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received
# Once a tombstone has been acknowledged by all replicas, after a configurable 
period, the reaper asks the replicas to acknowledge that the others have 
received all their acknowledgements
## If a node is down or otherwise can't reply, this is retried after a back-off 
period
## If a node is asked to fully acknowledge a tombstone, and it is not ready to 
do so, it may try to receive outstanding acknowledgements so that it can reply 
with an ACK
# When a tombstone is deleted, it is added to a fast relic index, comprised 
of MD5 hashes calculated from cf-key-name[-subName]-ackList. The relic index 
makes it possible for a reaper to acknowledge that it has received all 
acknowledgements after it has deleted a tombstone
# The relic index is scavenged according to some configurable period
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when
## They have been acknowledged by all replicas
## All replicas have acknowledged receiving all acknowledgements
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# Once a tombstone has been acknowledged by all replicas, after a configurable 
period, the reaper asks the replicas to acknowledge that the others have 
received all their acknowledgements
## If a node is down or otherwise can't reply, this is retried after a back-off 
period
## If a node is asked to fully acknowledge a tombstone, and it is not ready to 
do so, it may try to receive outstanding acknowledgements so that it can reply 
with an ACK
# When a tombstone is deleted, it is added to a fast relic index, comprised 
of MD5 hashes calculated from cf-key-name[-subName]-ackList. The relic index 
makes it possible for a reaper to acknowledge that it has received all 
acknowledgements after it has deleted a tombstone
# The relic index is scavenged according to some configurable period
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# The relic index is scavenged according to some configurable period
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after the configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas (which is no big deal since 
does not corrupt data)

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize spare cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. Some prefer to keep a terminal open and 
run repair jobs one by one, making sure they succeed and keeping and eye on 
overall load to reduce system impact. This isn't great, and it is made worse if 
you have lots of column families or have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this can 
feel like the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in a cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. Some prefer to keep a terminal open and 
run repair jobs one by one, making sure they succeed and keeping and eye on 
overall load to reduce system impact. This isn't great, and it is made worse 
when there are lots of column families or it is necessary to run a column 
family with a low GCSeconds to reduce tombstone load. You know that if you 
don't manage to run repair with the GCSeconds window, or increase GCSeconds, 
you are going to lose deletes and this can feel like the Sword of Damocles over 
your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with having to run repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in a cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with having to run repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in a cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. Some prefer to keep a terminal open and 
run repair jobs one by one, making sure they succeed and keeping and eye on 
overall load to reduce system impact. This isn't great, and it is made worse 
when there are lots of column families or it is necessary to run a column 
family with a low GCSeconds to reduce tombstone load. You know that if you 
don't manage to run repair with the GCSeconds window, or increase GCSeconds, 
you are going to lose deletes and this can feel like the Sword of Damocles over 
your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. Some prefer to keep a terminal open and 
run repair jobs one by one, making sure they succeed and keeping and eye on 
overall load to reduce system impact. This isn't great, and it is made worse 
when there are lots of column families or it is necessary to run a column 
family with a low GCSeconds to reduce tombstone load. You know that if you 
don't manage to run repair with the GCSeconds window, or increase GCSeconds, 
you are going to lose deletes and this can feel like the Sword of Damocles over 
your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in a cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and the 
problem is made worse when there are lots of column families in a database or 
it is necessary to run a column family with a low GCSeconds to reduce tombstone 
load. The database owner must run repair with the GCSeconds window, or increase 
GCSeconds, to avoid losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and the 
problem is made worse when there are lots of column families in a database or 
it is necessary to run a column family with a low GCSeconds to reduce tombstone 
load. The database owner must run repair within the GCSeconds window, or 
increase GCSeconds, to avoid losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

It would be much better if there was no ongoing requirement to run repair to 
avoid data loss (or rather the potential for data to reappear), and no 
GCSeconds window. Ideally repair would be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, data written to a node that did not receive a 
copy of a delete operation (because for example it was down) can reappear
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and problems 
are exacerbated when there are lots of column families in a database or it is 
necessary to run a column family with a low GCSeconds to reduce tombstone load 
(because there are many write/deletes to that column family). The database 
owner must run repair within the GCSeconds window, or increase GCSeconds, to 
avoid potentially losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

It would be much better if there was no ongoing requirement to run repair to 
ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an 
optional maintenance utility used in special cases, or to ensure ONE reads get 
consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause corruption, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of corruption if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either by error or because of 
issues with Cassandra, data written to a node that did not see a later delete 
can reappear (and a node might miss a delete for several reasons including 
being down or simply dropping requests during load shedding)
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and problems 
are exacerbated when there are lots of column families in a database or it is 
necessary to run a column family with a low GCSeconds to reduce tombstone load 
(because there are many write/deletes to that column family). The database 
owner must run repair within the GCSeconds window, or increase GCSeconds, to 
avoid potentially losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

It would be much better if there was no ongoing requirement to run repair to 
ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an 
optional maintenance utility used in special cases, or to ensure ONE reads get 
consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause corruption, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of corruption if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, data

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:


Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either by error or because of 
issues with Cassandra, data written to a node that did not see a later delete 
can reappear (and a node might miss a delete for several reasons including 
being down or simply dropping requests during load shedding)
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and problems 
are exacerbated when there are lots of column families in a database or it is 
necessary to run a column family with a low GCSeconds to reduce tombstone load 
(because there are many write/deletes to that column family). The database 
owner must run repair within the GCSeconds window, or increase GCSeconds, to 
avoid potentially losing delete operations. 

It would be much better if there was no ongoing requirement to run repair to 
ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an 
optional maintenance utility used in special cases, or to ensure ONE reads get 
consistent data. 

h2. Reaper Model Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast relic index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background reaper threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause corruption, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour  administration overhead associated with running repair can be 
removed
* Reapers can utilize spare cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of corruption if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either by error or because of 
issues with Cassandra, data written to a node that did not see a later delete 
can reappear (and a node might miss a delete for several reasons including 
being down or simply

[jira] [Commented] (CASSANDRA-3621) nodetool is trying to contact old ip address

2011-12-13 Thread Brandon Williams (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168460#comment-13168460
 ] 

Brandon Williams commented on CASSANDRA-3621:
-

You most likely have a hostname resolution problem where the system's hostname 
still resolves to the old IP.

 nodetool is trying to contact old ip address
 

 Key: CASSANDRA-3621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 0.8.8
 Environment: java 1.6.26, linux
Reporter: Zenek Kraweznik

 My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 
 10.0.2.0/24 network (for security resons).
 I want to test new cassandra before upgrading production instances. I've made 
 snapshot and moved it to test servers (except system/LocationInfo* files).
 Changes in configuration: ip adresses (seeds, listen address etc), cluster 
 name. Test server are in 10.0.1.0/24 network.
 In logs I see that test nodes are seeing each other, but when i try to show 
 ring I get this error:
 casstest1:/# nodetool -h 10.0.1.211 ring
 Error connection to remote JMX agent!
 java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; 
 nested exception is:
 java.net.NoRouteToHostException: No route to host
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614)
 at 
 sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198)
 at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110)
 at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown 
 Source)
 at 
 javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329)
 at 
 javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279)
 at 
 javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
 at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140)
 at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110)
 at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582)
 Caused by: java.net.NoRouteToHostException: No route to host
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at java.net.Socket.connect(Socket.java:478)
 at java.net.Socket.init(Socket.java:375)
 at java.net.Socket.init(Socket.java:189)
 at 
 sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22)
 at 
 sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128)
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595)
 ... 10 more
 casstest1:/#
 Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 
 10.0.1.203
 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168467#comment-13168467
 ] 

Jonathan Ellis commented on CASSANDRA-3589:
---

Have you been able to benchmark Sylvain's patch?

 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Assignee: Sylvain Lebresne
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3620:
--

Affects Version/s: (was: 1.0.5)
Fix Version/s: (was: 1.1)

At a high level, I think it's worth trying.  One big drawback is making deletes 
O(N**2) expensive: N acks must be written to each of the N replicas.  That's 81 
writes for a single delete in a cluster with 9 total replicas across 3 DCs, 
which is not a hypothetical situation.

 Proposal for distributed deletes - use Reaper Model rather than GCSeconds 
 and scheduled repairs
 -

 Key: CASSANDRA-3620
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Dominic Williams
  Labels: GCSeconds,, deletes,, distributed_deletes,, 
 merkle_trees, repair,
   Original Estimate: 504h
  Remaining Estimate: 504h

 Here is a proposal for an improved system for handling distributed deletes.
 h2. The Problem
 There are various issues with repair:
 * Repair is expensive anyway
 * Repair jobs are often made more expensive than they should be by other 
 issues (nodes dropping requests, hinted handoff not working, downtime etc)
 * Repair processes can often fail and need restarting, for example in cloud 
 environments where network issues make a node disappear 
 from the ring for a brief moment
 * When you fail to run repair within GCSeconds, either by error or because of 
 issues with Cassandra, data written to a node that did not see a later delete 
 can reappear (and a node might miss a delete for several reasons including 
 being down or simply dropping requests during load shedding)
 * If you cannot run repair and have to increase GCSeconds to prevent deleted 
 data reappearing, in some cases the growing tombstone overhead can 
 significantly degrade performance
 Because of the foregoing, in high throughput environments it can be very 
 difficult to make repair a cron job. It can be preferable to keep a terminal 
 open and run repair jobs one by one, making sure they succeed and keeping and 
 eye on overall load to reduce system impact. This isn't desirable, and 
 problems are exacerbated when there are lots of column families in a database 
 or it is necessary to run a column family with a low GCSeconds to reduce 
 tombstone load (because there are many write/deletes to that column family). 
 The database owner must run repair within the GCSeconds window, or increase 
 GCSeconds, to avoid potentially losing delete operations. 
 It would be much better if there was no ongoing requirement to run repair to 
 ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be 
 an optional maintenance utility used in special cases, or to ensure ONE reads 
 get consistent data. 
 h2. Reaper Model Proposal
 # Tombstones do not expire, and there is no GCSeconds
 # Tombstones have associated ACK lists, which record the replicas that have 
 acknowledged them
 # Tombstones are only deleted (or marked for compaction) when they have been 
 acknowledged by all replicas
 # When a tombstone is deleted, it is added to a fast relic index of MD5 
 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible 
 for a reaper to acknowledge a tombstone after it is deleted
 # Background reaper threads constantly stream ACK requests to other nodes, 
 and stream back ACK responses back to requests they have received (throttling 
 their usage of CPU and bandwidth so as not to affect performance)
 # If a reaper receives a request to ACK a tombstone that does not exist, it 
 creates the tombstone and adds an ACK for the requestor, and replies with an 
 ACK 
 NOTES
 * The existence of entries in the relic index do not affect normal query 
 performance
 * If a node goes down, and comes up after a configurable relic entry timeout, 
 the worst that can happen is that a tombstone that hasn't received all its 
 acknowledgements is re-created across the replicas when the reaper requests 
 their acknowledgements (which is no big deal since this does not corrupt data)
 * Since early removal of entries in the relic index does not cause 
 corruption, it can be kept small, or even kept in memory
 * Simple to implement and predictable 
 h3. Planned Benefits
 * Operations are finely grained (reaper interruption is not an issue)
 * The labour  administration overhead associated with running repair can be 
 removed
 * Reapers can utilize spare cycles and run constantly in background to 
 prevent the load spikes and performance issues associated with repair
 * There will no longer be the threat of corruption if repair can't be run for 
 some reason (for example because of a new

[jira] [Commented] (CASSANDRA-3511) Supercolumn key caches are not saved

2011-12-13 Thread Radim Kolar (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168480#comment-13168480
 ] 

Radim Kolar commented on CASSANDRA-3511:


This is also cache save issue because i seen case where after loading cache 
saved by 1.0.5 cache is not saved anymore. I will attach another demonstration 
file.

Problems are 2:
 1. cache can be saved in incorrect format (maybe truncated?). Save to -tmp and 
rename later?
 2. loading incorrect cache save image will cause that cache is not saved 
anymore. Incorrect image is not overwritten by a good one. Add some kind of 
error/checksum to cache for detecting and rejecting incorrect cache save images 
during load.

 Supercolumn key caches are not saved
 

 Key: CASSANDRA-3511
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3511
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.2, 1.0.3
Reporter: Radim Kolar
Priority: Minor
  Labels: supercolumns
 Attachments: rapidshare-resultcache-KeyCache


 cache saving seems to be broken in 1.0.2 and 1.0.3 i have 2 CF in keyspace 
 with enabled cache saving and only one gets its key cache saved. It worked 
 perfectly in 0.8, both were saved.
 This one works:
 create column family query2
   with column_type = 'Standard'
   and comparator = 'AsciiType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'UTF8Type'
   and rows_cached = 500.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 14400
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 5
   and max_compaction_threshold = 10
   and replicate_on_write = false
   and row_cache_provider = 'ConcurrentLinkedHashCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 This does not
 create column family dkb13
   with column_type = 'Super'
   and comparator = 'LongType'
   and subcomparator = 'AsciiType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'UTF8Type'
   and rows_cached = 600.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 14400
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 5
   and max_compaction_threshold = 10
   and replicate_on_write = false
   and row_cache_provider = 'ConcurrentLinkedHashCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 in second test system i created these 2 column families and none of them got 
 single cache key saved. Both have save period 30 seoonds - their cache should 
 save often. Its not that standard column family works while super does not.
 create column family test1
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'BytesType'
   and rows_cached = 0.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 30
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and row_cache_provider = 'SerializingCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy';
 create column family test2
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'BytesType'
   and rows_cached = 0.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 30
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and row_cache_provider = 'SerializingCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy';
 If this is done on purpose for example cassandra 1.0 is doing some heuristic 
 decision if cache should be saved or not then it should be removed. Saving 
 cache is fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3511) Supercolumn key caches are not saved

2011-12-13 Thread Radim Kolar (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated CASSANDRA-3511:
---

Attachment: failed-to-save-after-load-KeyCache

 Supercolumn key caches are not saved
 

 Key: CASSANDRA-3511
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3511
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.2, 1.0.3
Reporter: Radim Kolar
Priority: Minor
  Labels: supercolumns
 Attachments: failed-to-save-after-load-KeyCache, 
 rapidshare-resultcache-KeyCache


 cache saving seems to be broken in 1.0.2 and 1.0.3 i have 2 CF in keyspace 
 with enabled cache saving and only one gets its key cache saved. It worked 
 perfectly in 0.8, both were saved.
 This one works:
 create column family query2
   with column_type = 'Standard'
   and comparator = 'AsciiType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'UTF8Type'
   and rows_cached = 500.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 14400
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 5
   and max_compaction_threshold = 10
   and replicate_on_write = false
   and row_cache_provider = 'ConcurrentLinkedHashCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 This does not
 create column family dkb13
   with column_type = 'Super'
   and comparator = 'LongType'
   and subcomparator = 'AsciiType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'UTF8Type'
   and rows_cached = 600.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 14400
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 5
   and max_compaction_threshold = 10
   and replicate_on_write = false
   and row_cache_provider = 'ConcurrentLinkedHashCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
 in second test system i created these 2 column families and none of them got 
 single cache key saved. Both have save period 30 seoonds - their cache should 
 save often. Its not that standard column family works while super does not.
 create column family test1
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'BytesType'
   and rows_cached = 0.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 30
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and row_cache_provider = 'SerializingCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy';
 create column family test2
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'BytesType'
   and rows_cached = 0.0
   and row_cache_save_period = 0
   and row_cache_keys_to_save = 2147483647
   and keys_cached = 20.0
   and key_cache_save_period = 30
   and read_repair_chance = 1.0
   and gc_grace = 864000
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and row_cache_provider = 'SerializingCacheProvider'
   and compaction_strategy = 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy';
 If this is done on purpose for example cassandra 1.0 is doing some heuristic 
 decision if cache should be saved or not then it should be removed. Saving 
 cache is fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

svn commit: r1213775 - in /cassandra/branches/cassandra-1.0: CHANGES.txt src/java/org/apache/cassandra/utils/obs/OpenBitSet.java

2011-12-13 Thread jbellis

Author: jbellis
Date: Tue Dec 13 16:38:12 2011
New Revision: 1213775

URL: http://svn.apache.org/viewvc?rev=1213775view=rev
Log:
more efficient allocation of small bloom filters
patch by slebresne; reviewed by jbellis for CASSANDRA-3618

Modified:
cassandra/branches/cassandra-1.0/CHANGES.txt

cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java

Modified: cassandra/branches/cassandra-1.0/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-1.0/CHANGES.txt?rev=1213775r1=1213774r2=1213775view=diff
==
--- cassandra/branches/cassandra-1.0/CHANGES.txt (original)
+++ cassandra/branches/cassandra-1.0/CHANGES.txt Tue Dec 13 16:38:12 2011
@@ -1,5 +1,6 @@
 1.0.7
  * fix assertion when dropping a columnfamily with no sstables (CASSANDRA-3614)
+ * more efficient allocation of small bloom filters (CASSANDRA-3618)
 
 
 1.0.6

Modified: 
cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java
URL: 
http://svn.apache.org/viewvc/cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java?rev=1213775r1=1213774r2=1213775view=diff
==
--- 
cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java
 (original)
+++ 
cassandra/branches/cassandra-1.0/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java
 Tue Dec 13 16:38:12 2011
@@ -76,6 +76,7 @@ Test system: AMD Opteron, 64 bit linux, 
 public class OpenBitSet implements Serializable {
   protected long[][] bits;
   protected int wlen;   // number of words (elements) used in the array
+  private final int pageCount;
   /**
* length of bits[][] page in long[] elements. 
* Choosing unform size for all sizes of bitsets fight fragmentation for 
very large
@@ -95,13 +96,19 @@ public class OpenBitSet implements Seria
   public OpenBitSet(long numBits, boolean allocatePages) 
   {
 wlen= bits2words(numBits);
+int lastPageSize = wlen % PAGE_SIZE;
+int fullPageCount = wlen / PAGE_SIZE;
+pageCount = fullPageCount + (lastPageSize == 0 ? 0 : 1);
 
-bits = new long[getPageCount()][];
-
+bits = new long[pageCount][];
+
 if (allocatePages)
 {
-for (int allocated=0,i=0;allocatedwlen;allocated+=PAGE_SIZE,i++)
-bits[i]=new long[PAGE_SIZE];
+for (int i = 0; i  fullPageCount; ++i)
+bits[i] = new long[PAGE_SIZE];
+
+if (lastPageSize != 0)
+bits[bits.length - 1] = new long[lastPageSize];
 }
   }
 
@@ -119,7 +126,7 @@ public class OpenBitSet implements Seria
   
   public int getPageCount()
   {
-  return wlen / PAGE_SIZE + 1;
+  return pageCount;
   }
 
   public long[] getPage(int pageIdx)

[jira] [Updated] (CASSANDRA-3618) OpenBitSet can allocate more bytes than it needs

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3618:
--

 Reviewer: jbellis
Affects Version/s: (was: 1.0.0)
   1.0.1

Committed.

(This affects 1.0.1+, introduced by CASSANDRA-2466.)

 OpenBitSet can allocate more bytes than it needs
 

 Key: CASSANDRA-3618
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3618
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.1
Reporter: Sylvain Lebresne
Assignee: Sylvain Lebresne
 Fix For: 1.0.7

 Attachments: 0001-Fix-openBitSet.patch


 CASSANDRA-2466 changed OpenBitSet to break big long arrays into pages. 
 However, it always allocate full pages, each page being of size 4096 * 8 
 bytes. This means that we almost always allocate too much bytes, and for a 
 row that has 1 column, the associated row bloom filter allocates 32760 more 
 bytes than it should.
 This has a significant impact on performance. In a small test using the 
 SSTableSimpleUnsortedWriter to generate rows with 1 column, 0.8 is about 
 twice as fast as 1.0 because of that (the difference shrink when there is 
 more columns obviously).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Jonathan Ellis (Created) (JIRA)

clean up openbitset
---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1


Our OpenBitSet no longer supports expanding the set post-construction.  Should 
update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3622:
--

Attachment: 3622.txt

Replaces get/set operations with fastGet/Set operations.  Where there was no 
fast analogue to an expanding method, I removed them.  (All such were 
unused.)

 clean up openbitset
 ---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1

 Attachments: 3622.txt


 Our OpenBitSet no longer supports expanding the set post-construction.  
 Should update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3592) Major Compaction Incredibly Slow

2011-12-13 Thread Dan Hendry (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168529#comment-13168529
 ] 

Dan Hendry commented on CASSANDRA-3592:
---

I can give that a try, I am a little confused about how it will help though. 
CASSANDRA-3618 seems to be related to performance for column families with 
narrow (single column) rows. The compaction slowdown I am seeing is for CFs 
that are characterized by very wide rows (thousands to millions of columns per 
row).

 Major Compaction Incredibly Slow
 

 Key: CASSANDRA-3592
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3592
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.3
 Environment: RHEL6 - 24 core machines 24 GB mem total, 11 GB java heap
 java version 1.6.0_26
 6 node cluster (4@0.8.6, 2@1.0.3)
Reporter: Dan Hendry
  Labels: compaction

 Twice now (on different nodes), I have observed major compaction for certain 
 column families take *significantly* longer on 1.0.3 in comparison to 0.8.6. 
 For example,
 On the 0.8.6 node, the post compaction log message:
 {noformat}CompactionManager.java (line 608) Compacted to XXX. 339,164,959,170 
 to 158,825,469,883 (~46% of original) bytes for 25,996 keys.  Time: 
 26,934,317ms.{noformat}
 On the 1.0.3 node, the post compaction log message:
 {noformat} CompactionTask.java (line 213) Compacted to [XXX].  
 222,338,354,529 to 147,751,403,084 (~66% of original) bytes for 26,100 keys 
 at 0.562045MB/s. Time: 250,703,563ms.{noformat}
 So... literally an order of magnitude slower on 1.0.3 in comparison to 0.8.6.
 Relevant configuration settings:
 * compaction_throughput_mb_per_sec: 0 (why? because the compaction throttling 
 logic as currently implemented is highly unsuitable for wide rows but thats a 
 different issue)
 * in_memory_compaction_limit_in_mb: 128
 Column family characteristics:
 * Many wide rows (~5% of rows greater than  10MB and hundreds of rows 
 greater than 100 MB, with many small columns).
 * Heavy use of expiring columns - each row represents data for a particular 
 hour so typically all columns in the row will expire together.
 * The significant size shrinkage as reported by the log messages is due 
 mainly to expired data being cleaned up (I typically trigger major compaction 
 when 30-50% of the on disk data has expired which is about once every 3 weeks 
 per node).
 * Perhaps obviously: size tiered compaction and no compression (the schema 
 has not changed since the partial upgrade to 1.0.x)
 * Standard column family
 Performance notes during compaction:
 * Nice CPU usage and load average is basically the same between 0.8.6 and 
 1.0.3 - ie, compaction IS running and is not getting stalled or hung up 
 anywhere. 
 * Compaction is IO bound on the 0.8.6 machines - the disks see heavy, 
 constant utilization when compaction is running.
 * Compaction is uses virtually no IO on the 1.0.3 machines - disk utilization 
 is virtually no different when compacting vs not compacting (but at the same 
 time, CPU usage and load average clearly indicate that compaction IS running).
 Finally, I have not had time to profile more thoroughly but jconsole always 
 shows the following stacktrace for the active compaction thread (for the 
 1.0.3 machine):
 {noformat}
 Stack trace: 
  
 org.apache.cassandra.db.ColumnFamilyStore.removeDeletedStandard(ColumnFamilyStore.java:851)
 org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:835)
 org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:826)
 org.apache.cassandra.db.compaction.PrecompactedRow.removeDeletedAndOldShards(PrecompactedRow.java:77)
 org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:102)
 org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:133)
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102)
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87)
 org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116)
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:172)

[jira] [Commented] (CASSANDRA-3592) Major Compaction Incredibly Slow

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168549#comment-13168549
 ] 

Jonathan Ellis commented on CASSANDRA-3592:
---

You're right, that's not likely to help. It sounded like such a good fit 
superficially!

 Major Compaction Incredibly Slow
 

 Key: CASSANDRA-3592
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3592
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.3
 Environment: RHEL6 - 24 core machines 24 GB mem total, 11 GB java heap
 java version 1.6.0_26
 6 node cluster (4@0.8.6, 2@1.0.3)
Reporter: Dan Hendry
  Labels: compaction

 Twice now (on different nodes), I have observed major compaction for certain 
 column families take *significantly* longer on 1.0.3 in comparison to 0.8.6. 
 For example,
 On the 0.8.6 node, the post compaction log message:
 {noformat}CompactionManager.java (line 608) Compacted to XXX. 339,164,959,170 
 to 158,825,469,883 (~46% of original) bytes for 25,996 keys.  Time: 
 26,934,317ms.{noformat}
 On the 1.0.3 node, the post compaction log message:
 {noformat} CompactionTask.java (line 213) Compacted to [XXX].  
 222,338,354,529 to 147,751,403,084 (~66% of original) bytes for 26,100 keys 
 at 0.562045MB/s. Time: 250,703,563ms.{noformat}
 So... literally an order of magnitude slower on 1.0.3 in comparison to 0.8.6.
 Relevant configuration settings:
 * compaction_throughput_mb_per_sec: 0 (why? because the compaction throttling 
 logic as currently implemented is highly unsuitable for wide rows but thats a 
 different issue)
 * in_memory_compaction_limit_in_mb: 128
 Column family characteristics:
 * Many wide rows (~5% of rows greater than  10MB and hundreds of rows 
 greater than 100 MB, with many small columns).
 * Heavy use of expiring columns - each row represents data for a particular 
 hour so typically all columns in the row will expire together.
 * The significant size shrinkage as reported by the log messages is due 
 mainly to expired data being cleaned up (I typically trigger major compaction 
 when 30-50% of the on disk data has expired which is about once every 3 weeks 
 per node).
 * Perhaps obviously: size tiered compaction and no compression (the schema 
 has not changed since the partial upgrade to 1.0.x)
 * Standard column family
 Performance notes during compaction:
 * Nice CPU usage and load average is basically the same between 0.8.6 and 
 1.0.3 - ie, compaction IS running and is not getting stalled or hung up 
 anywhere. 
 * Compaction is IO bound on the 0.8.6 machines - the disks see heavy, 
 constant utilization when compaction is running.
 * Compaction is uses virtually no IO on the 1.0.3 machines - disk utilization 
 is virtually no different when compacting vs not compacting (but at the same 
 time, CPU usage and load average clearly indicate that compaction IS running).
 Finally, I have not had time to profile more thoroughly but jconsole always 
 shows the following stacktrace for the active compaction thread (for the 
 1.0.3 machine):
 {noformat}
 Stack trace: 
  
 org.apache.cassandra.db.ColumnFamilyStore.removeDeletedStandard(ColumnFamilyStore.java:851)
 org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:835)
 org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:826)
 org.apache.cassandra.db.compaction.PrecompactedRow.removeDeletedAndOldShards(PrecompactedRow.java:77)
 org.apache.cassandra.db.compaction.PrecompactedRow.init(PrecompactedRow.java:102)
 org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:133)
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102)
 org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87)
 org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116)
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:172)
 org.apache.cassandra.db.compaction.CompactionManager$4.call(CompactionManager.java:277)
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 java.util.concurrent.FutureTask.run(FutureTask.java:138)

svn commit: r1213827 - in /cassandra/trunk: ./ contrib/ interface/thrift/gen-java/org/apache/cassandra/thrift/ src/java/org/apache/cassandra/utils/obs/

2011-12-13 Thread slebresne

Author: slebresne
Date: Tue Dec 13 18:25:48 2011
New Revision: 1213827

URL: http://svn.apache.org/viewvc?rev=1213827view=rev
Log:
merge from 1.0

Modified:
cassandra/trunk/   (props changed)
cassandra/trunk/CHANGES.txt
cassandra/trunk/contrib/   (props changed)

cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java
   (props changed)

cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java
   (props changed)

cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/InvalidRequestException.java
   (props changed)

cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/NotFoundException.java
   (props changed)

cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/SuperColumn.java
   (props changed)
cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java

Propchange: cassandra/trunk/
--
--- svn:mergeinfo (original)
+++ svn:mergeinfo Tue Dec 13 18:25:48 2011
@@ -4,7 +4,7 @@
 
/cassandra/branches/cassandra-0.8:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976
 /cassandra/branches/cassandra-0.8.0:1125021-1130369
 /cassandra/branches/cassandra-0.8.1:1101014-1125018
-/cassandra/branches/cassandra-1.0:1167085-1211978,1212284
+/cassandra/branches/cassandra-1.0:1167085-1211978,1212284,1213775
 
/cassandra/branches/cassandra-1.0.0:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243
 /cassandra/tags/cassandra-0.7.0-rc3:1051699-1053689
 /cassandra/tags/cassandra-0.8.0-rc1:1102511-1125020

Modified: cassandra/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/cassandra/trunk/CHANGES.txt?rev=1213827r1=1213826r2=1213827view=diff
==
--- cassandra/trunk/CHANGES.txt (original)
+++ cassandra/trunk/CHANGES.txt Tue Dec 13 18:25:48 2011
@@ -24,6 +24,7 @@
  * Remove columns shadowed by a deleted container even when we cannot purge
(CASSANDRA-3538)
  * Improve memtable slice iteration performance (CASSANDRA-3545)
+ * more efficient allocation of small bloom filters (CASSANDRA-3618)
 
 
 1.0.6

Propchange: cassandra/trunk/contrib/
--
--- svn:mergeinfo (original)
+++ svn:mergeinfo Tue Dec 13 18:25:48 2011
@@ -4,7 +4,7 @@
 
/cassandra/branches/cassandra-0.8/contrib:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976
 /cassandra/branches/cassandra-0.8.0/contrib:1125021-1130369
 /cassandra/branches/cassandra-0.8.1/contrib:1101014-1125018
-/cassandra/branches/cassandra-1.0/contrib:1167085-1211978,1212284
+/cassandra/branches/cassandra-1.0/contrib:1167085-1211978,1212284,1213775
 
/cassandra/branches/cassandra-1.0.0/contrib:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243
 /cassandra/tags/cassandra-0.7.0-rc3/contrib:1051699-1053689
 /cassandra/tags/cassandra-0.8.0-rc1/contrib:1102511-1125020

Propchange: 
cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java
--
--- svn:mergeinfo (original)
+++ svn:mergeinfo Tue Dec 13 18:25:48 2011
@@ -4,7 +4,7 @@
 
/cassandra/branches/cassandra-0.8/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976
 
/cassandra/branches/cassandra-0.8.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1125021-1130369
 
/cassandra/branches/cassandra-0.8.1/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1101014-1125018
-/cassandra/branches/cassandra-1.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167085-1211978,1212284
+/cassandra/branches/cassandra-1.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167085-1211978,1212284,1213775
 
/cassandra/branches/cassandra-1.0.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1167104-1167229,1167232-1181093,1181741,1181816,1181820,1182951,1183243
 
/cassandra/tags/cassandra-0.7.0-rc3/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1051699-1053689
 
/cassandra/tags/cassandra-0.8.0-rc1/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java:1102511-1125020

Propchange: 
cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java
--
--- svn:mergeinfo (original)
+++ svn:mergeinfo Tue Dec 13 18:25:48 2011
@@ -4,7 +4,7 @@
 
/cassandra/branches/cassandra-0.8/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java:1090934-1125013,1125019-1198724,1198726-1206097,1206099-1211976
 
/cassandra/branches/cassandra-0.8.0/interface/thrift/gen-java/org/apache/cassandra/thrift/Column.java:1125021-1130369

[jira] [Commented] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Sylvain Lebresne (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168576#comment-13168576
 ] 

Sylvain Lebresne commented on CASSANDRA-3622:
-

The patch renames the fastGet/Set to get/set (which is fine), but do not update 
the call-sites (in BloomFilter.java).

 clean up openbitset
 ---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1

 Attachments: 3622.txt


 Our OpenBitSet no longer supports expanding the set post-construction.  
 Should update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit

2011-12-13 Thread paul cannon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168595#comment-13168595
 ] 

paul cannon commented on CASSANDRA-3477:


Joaquin- ready to close?

 cassandra takes too long to shut down when told to quit
 ---

 Key: CASSANDRA-3477
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Joaquin Casares
Assignee: paul cannon
Priority: Minor
 Fix For: 1.0.6


 The restart command keeps failing and never passes.
 The stop command seems to have completed successfully, but the processes is 
 still listed when I run 'ps auwx | grep cass'.
 Using the Debian6 images on Rackspace. 2 nodes are definitely showing the 
 same behavior.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3101) Should check for errors when calling /bin/ln

2011-12-13 Thread paul cannon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168605#comment-13168605
]

paul cannon commented on CASSANDRA-3101:

This works, except you've taken out a logger.error() call instead of adding
another one. I think it's worth logging an error for the cassandra log in both
cases.

Should check for errors when calling /bin/ln

Key: CASSANDRA-3101
URL: https://issues.apache.org/jira/browse/CASSANDRA-3101
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.4
Reporter: paul cannon
Assignee: Vijay
Priority: Minor
Labels: lhf
Fix For: 1.0.6

Attachments: 0001-0001-throw-IOE-while-calling-bin-ln-v2.patch,
0001-3101-throw-IOE-while-calling-bin-ln.patch

It looks like cassandra.utils.CLibrary.createHardLinkWithExec() does not
check for any errors in the execution of the hard-link-making utility. This
could be bad if, for example, the user has put the snapshot directory on a
different filesystem from the data directory. The hard linking would fail and
the sstable snapshots would not exist, but no error would be reported.
It does look like errors with the more direct JNA link() call are handled
correctly- an exception is thrown. The WithExec version should probably do
the same thing.
Definitely it would be enough to check the process exit value from /bin/ln
for nonzero in the *nix case, but I don't know whether 'fsutil hardlink
create' or 'cmd /c mklink /H' return nonzero on failure.
For bonus points, use any output from the Process's error stream in the text
of the exception, to aid in debugging problems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit

2011-12-13 Thread Joaquin Casares (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168608#comment-13168608
 ] 

Joaquin Casares commented on CASSANDRA-3477:


Sure thing. Haven't seen it on 1.05 yet.

Thanks!

 cassandra takes too long to shut down when told to quit
 ---

 Key: CASSANDRA-3477
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Joaquin Casares
Assignee: paul cannon
Priority: Minor
 Fix For: 1.0.6


 The restart command keeps failing and never passes.
 The stop command seems to have completed successfully, but the processes is 
 still listed when I run 'ps auwx | grep cass'.
 Using the Debian6 images on Rackspace. 2 nodes are definitely showing the 
 same behavior.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3622:
--

Attachment: 3622-v2.txt

Oops, that's what I get for assuming a patch against 1.0 would Just Work 
against 1.1.  v2 attached.

 clean up openbitset
 ---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1

 Attachments: 3622-v2.txt, 3622.txt


 Our OpenBitSet no longer supports expanding the set post-construction.  
 Should update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit

2011-12-13 Thread Jonathan Ellis (Reopened) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis reopened CASSANDRA-3477:
---


 cassandra takes too long to shut down when told to quit
 ---

 Key: CASSANDRA-3477
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Joaquin Casares
Priority: Minor

 The restart command keeps failing and never passes.
 The stop command seems to have completed successfully, but the processes is 
 still listed when I run 'ps auwx | grep cass'.
 Using the Debian6 images on Rackspace. 2 nodes are definitely showing the 
 same behavior.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-3477) cassandra takes too long to shut down when told to quit

2011-12-13 Thread Jonathan Ellis (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis resolved CASSANDRA-3477.
---

Resolution: Cannot Reproduce

Reopened/re-resolved b/c that is actually a different issue.

 cassandra takes too long to shut down when told to quit
 ---

 Key: CASSANDRA-3477
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3477
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Joaquin Casares
Priority: Minor

 The restart command keeps failing and never passes.
 The stop command seems to have completed successfully, but the processes is 
 still listed when I run 'ps auwx | grep cass'.
 Using the Debian6 images on Rackspace. 2 nodes are definitely showing the 
 same behavior.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-1391:
---

Attachment: (was: 
0001-new-migration-schema-and-avro-methods-cleanup.patch)

 Allow Concurrent Schema Migrations
 --

 Key: CASSANDRA-1391
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.7.0
Reporter: Stu Hood
Assignee: Pavel Yaskevich
 Fix For: 1.1

 Attachments: CASSANDRA-1391.patch


 CASSANDRA-1292 fixed multiple migrations started from the same node to 
 properly queue themselves, but it is still possible for migrations initiated 
 on different nodes to conflict and leave the cluster in a bad state. Since 
 the system_add/drop/rename methods are accessible directly from the client 
 API, they should be completely safe for concurrent use.
 It should be possible to allow for most types of concurrent migrations by 
 converting the UUID schema ID into a VersionVectorClock (as provided by 
 CASSANDRA-580).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-1391:
---

Attachment: (was: 0002-avro-removal.patch)

 Allow Concurrent Schema Migrations
 --

 Key: CASSANDRA-1391
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.7.0
Reporter: Stu Hood
Assignee: Pavel Yaskevich
 Fix For: 1.1

 Attachments: CASSANDRA-1391.patch


 CASSANDRA-1292 fixed multiple migrations started from the same node to 
 properly queue themselves, but it is still possible for migrations initiated 
 on different nodes to conflict and leave the cluster in a bad state. Since 
 the system_add/drop/rename methods are accessible directly from the client 
 API, they should be completely safe for concurrent use.
 It should be possible to allow for most types of concurrent migrations by 
 converting the UUID schema ID into a VersionVectorClock (as provided by 
 CASSANDRA-580).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-1391) Allow Concurrent Schema Migrations

2011-12-13 Thread Pavel Yaskevich (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Yaskevich updated CASSANDRA-1391:
---

Attachment: 0002-avro-removal.patch
0001-new-migration-schema-and-avro-methods-cleanup.patch

rebased with the lastest trunk (last commit 
e37bd7e8d344332ff41bd1015e6018c81ca81fa3)

 Allow Concurrent Schema Migrations
 --

 Key: CASSANDRA-1391
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1391
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.7.0
Reporter: Stu Hood
Assignee: Pavel Yaskevich
 Fix For: 1.1

 Attachments: 
 0001-new-migration-schema-and-avro-methods-cleanup.patch, 
 0002-avro-removal.patch, CASSANDRA-1391.patch


 CASSANDRA-1292 fixed multiple migrations started from the same node to 
 properly queue themselves, but it is still possible for migrations initiated 
 on different nodes to conflict and leave the cluster in a bad state. Since 
 the system_add/drop/rename methods are accessible directly from the client 
 API, they should be completely safe for concurrent use.
 It should be possible to allow for most types of concurrent migrations by 
 converting the UUID schema ID into a VersionVectorClock (as provided by 
 CASSANDRA-580).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3213) Upgrade Thrift to 0.7.0

2011-12-13 Thread Jake Farrell (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168654#comment-13168654
 ] 

Jake Farrell commented on CASSANDRA-3213:
-

Jake Luciani and I where talking about this, changing to update to 0.8 and 
removing custom THsHa and use the default. I'll have a patch for this shortly 

 Upgrade Thrift to 0.7.0
 ---

 Key: CASSANDRA-3213
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3213
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jake Farrell
Assignee: Jake Farrell
Priority: Trivial
  Labels: thrift
 Fix For: 1.1

 Attachments: v1-0001-update-generated-thrift-code.patch, 
 v1-0002-upgrade-thrift-jar-and-license.patch, v1-0003-update-build-xml.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3624) Hinted Handoff - related OOM

2011-12-13 Thread Marcus Eriksson (Created) (JIRA)

Hinted Handoff - related OOM


 Key: CASSANDRA-3624
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624
 Project: Cassandra
  Issue Type: Bug
Reporter: Marcus Eriksson


One of our nodes had collected alot of hints for another node, so when the dead 
node came back and the row mutations were read back from disk, the node died 
with an OOM-exception (and kept dying after restart, even with increased heap 
(from 8G to 12G)). The heap dump contained alot of SuperColumns and our 
application does not use those (but HH does). 

I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does 
not fit in memory (will check this tomorrow)

A simple fix (if my assumption above is correct) would be to reduce the 
PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to 
reduce the memory pressure. The performance hit would be small since we are 
doing the hinted handoff throttle delay sleep before sending every *mutation* 
anyway (not every page), thoughts?

If anyone runs in to the same problem, I got the node started again by simply 
removing the HintsColumnFamily* files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Sylvain Lebresne (Created) (JIRA)

Do something about DynamicCompositeType
---

 Key: CASSANDRA-3625
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Sylvain Lebresne


Currently, DynamicCompositeType is a super dangerous type. We cannot leave it 
that way or people will get hurt.

Let's recall that DynamicCompositeType allows composite column names without 
any limitation on what each component type can be. It was added to basically 
allow to use different rows of the same column family to each store a different 
index. So for instance you would have:
{noformat}
index1: {
  bar:24 - someval
  bar:42 - someval
  foo:12 - someval
  ...
}
index2: {
  0:uuid1:3.2 - someval
  1:uuid2:2.2 - someval
  ...
}

{noformat}
where index1, index2, ... are rows.
So each row have columns whose names have similar structure (so they can be 
compared), but between rows the structure can be different (we neve compare two 
columns from two different rows).

But the problem is the following: what happens if in the index1 row above, you 
insert a column whose name is 0:uuid1 ? There is no really meaningful way to 
compare bar:24 and 0:uuid1. The current implementation of 
DynamicCompositeType, when confronted with this, says that it is a user error 
and throw a MarshalException.
The problem with that is that the exception is not throw at insert time, and it 
*cannot* be because of the dynamic nature of the comparator. But that means 
that if you do insert the wrong column in the wrong row, you end up 
*corrupting* a sstable.

It is too dangerous a behavior. And it's probably made worst by the fact that 
some people probably think that DynamicCompositeType should be superior to 
CompositeType since you know, it's dynamic.

One solution to that problem could be to decide of some random (but 
predictable) order between two incomparable component. For example we could 
design that IntType  LongType  StringType ...

Note that even if we do that, I would suggest renaming the DynamicCompositeType 
to something that suggest that CompositeType is always preferable to 
DynamicCompositeType unless you're really doing very advanced stuffs.

Opinions?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN

2011-12-13 Thread Peter Schuller (Created) (JIRA)

Nodes can get stuck in UP state forever, despite being DOWN
---

 Key: CASSANDRA-3626
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller


This is a proposed phrasing for an upstream ticket named Newly discovered 
nodes that are down get stuck in UP state forever (will edit w/ feedback until 
done):

We have a observed a problem with gossip which, when you are bootstrapping a 
new node (or replacing using the replace_token support), any node in the 
cluster which is Down at the time the node is started, will be assumed to be Up 
and then *never ever* flapped back to Down until you restart the node.

This has at least two implications to replacing or bootstrapping new nodes when 
there are nodes down in the ring:

* If the new node happens to select a node listed as (UP but in reality is 
DOWN) as a stream source, streaming will sit there hanging forever.
* If that doesn't happen (by picking another host), it will instead finish 
bootstrapping correctly, and begin servicing requests all the while thinking 
DOWN nodes are UP, and thus routing requests to them, generating timeouts.

The way to get out of this is to restart the node(s) that you bootstrapped.

I have tested and confirmed the symptom (that the bootstrapped node things 
other nodes are Up) using a fairly recent 1.0. The main debugging effort 
happened on 0.8 however, so all details below refer to 0.8 but are probably 
similar in 1.0.

Steps to reproduce:

* Bring up a cluster of = 3 nodes. *Ensure RF is  N*, so that the cluster is 
operative with one node removed.
* Pick two random nodes A, and B. Shut them *both* off.
* Wait for everyone to realize they are both off (for good measure).
* Now, take node A and nuke it's data directories and re-start it, such that it 
comes up w/ normal bootstrap (or use replace_token; didn't test that but should 
not affect it).
* Watch how node A starts up, all the while believing node B is down, even 
though all other nodes in the cluster agree that B is down and B is in fact 
still turned off.

The mechanism by which it initially goes into Up state is that the node 
receives a gossip response from any other node in the cluster, and 
GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().

Gossiper.applyStateLocally() doesn't have any local endpoint state for the 
cluster, so the else statement at the end (it's a new node) gets triggered 
and handleMajorStateChange() is called. handleMajorStateChange() always calls 
markAlive(), unless the state is a dead state (but dead here does not mean 
not up, but refers to joining/hibernate etc).

So at this point the node is up in the mind of the node you just bootstrapped.

Now, in each gossip round doStatusCheck() is called, which iterates over all 
nodes (including the one falsly Up) and among other things, calls 
FailureDetector.interpret() on each node.

FailureDetector.interpret() is meant to update its sense of Phi for the node, 
and potentially convict it. However there is a short-circuit at the top, 
whereby if we do not yet have any arrival window for the node, we simply return 
immediately.

Arrival intervals are only added as a result of a FailureDetector.report() 
call, which never happens in this case because the initial endpoint state we 
added, which came from a remote node that was up, had the latest version of the 
gossip state (so Gossiper.reportFailureDetector() will never call report()).

The result is that the node can never ever be convicted.

Now, let's ignore for a moment the problem that a node that is actually Down 
will be thought to be Up temporarily for a little while. That is sub-optimal, 
but let's aim for a fix to the more serious problem in this ticket - which is 
that is stays up forever.

Considered solutions:

* When interpret() gets called and there is no arrival window, we could add a 
faked arrival window far back in time to cause the node to have history and be 
marked down. This works in the particular test case. The problem is that 
since we are not ourselves actively trying to gossip to these nodes with any 
particular speed, it might take a significant time before we get any kind of 
confirmation from someone else that it's actually Up in cases where the node 
actually *is* Up, so it's not clear that this is a good idea.

* When interpret() gets called and there is no arrival window, we can simply 
convict it immediately. This has roughly similar behavior as the previous 
suggestion.

* When interpret() gets called and there is no arrival window, we can add a 
faked arrival window at the current time, which will allow it to be treated as 
Up until the usual time has passed before we exceed the Phi conviction 
threshold.

* When interpret() gets called and

[jira] [Updated] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN

2011-12-13 Thread Chris Goffinet (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Goffinet updated CASSANDRA-3626:
--

Reviewer: lenn0x
Affects Version/s: 0.8.8
1.0.5

Nodes can get stuck in UP state forever, despite being DOWN
---

Key: CASSANDRA-3626
URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.8.8, 1.0.5
Reporter: Peter Schuller
Assignee: Peter Schuller

This is a proposed phrasing for an upstream ticket named Newly discovered
nodes that are down get stuck in UP state forever (will edit w/ feedback
until done):
We have a observed a problem with gossip which, when you are bootstrapping a
new node (or replacing using the replace_token support), any node in the
cluster which is Down at the time the node is started, will be assumed to be
Up and then *never ever* flapped back to Down until you restart the node.
This has at least two implications to replacing or bootstrapping new nodes
when there are nodes down in the ring:
* If the new node happens to select a node listed as (UP but in reality is
DOWN) as a stream source, streaming will sit there hanging forever.
* If that doesn't happen (by picking another host), it will instead finish
bootstrapping correctly, and begin servicing requests all the while thinking
DOWN nodes are UP, and thus routing requests to them, generating timeouts.
The way to get out of this is to restart the node(s) that you bootstrapped.
I have tested and confirmed the symptom (that the bootstrapped node things
other nodes are Up) using a fairly recent 1.0. The main debugging effort
happened on 0.8 however, so all details below refer to 0.8 but are probably
similar in 1.0.
Steps to reproduce:
* Bring up a cluster of = 3 nodes. *Ensure RF is N*, so that the cluster
is operative with one node removed.
* Pick two random nodes A, and B. Shut them *both* off.
* Wait for everyone to realize they are both off (for good measure).
* Now, take node A and nuke it's data directories and re-start it, such that
it comes up w/ normal bootstrap (or use replace_token; didn't test that but
should not affect it).
* Watch how node A starts up, all the while believing node B is down, even
though all other nodes in the cluster agree that B is down and B is in fact
still turned off.
The mechanism by which it initially goes into Up state is that the node
receives a gossip response from any other node in the cluster, and
GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().
Gossiper.applyStateLocally() doesn't have any local endpoint state for the
cluster, so the else statement at the end (it's a new node) gets triggered
and handleMajorStateChange() is called. handleMajorStateChange() always calls
markAlive(), unless the state is a dead state (but dead here does not mean
not up, but refers to joining/hibernate etc).
So at this point the node is up in the mind of the node you just bootstrapped.
Now, in each gossip round doStatusCheck() is called, which iterates over all
nodes (including the one falsly Up) and among other things, calls
FailureDetector.interpret() on each node.
FailureDetector.interpret() is meant to update its sense of Phi for the node,
and potentially convict it. However there is a short-circuit at the top,
whereby if we do not yet have any arrival window for the node, we simply
return immediately.
Arrival intervals are only added as a result of a FailureDetector.report()
call, which never happens in this case because the initial endpoint state we
added, which came from a remote node that was up, had the latest version of
the gossip state (so Gossiper.reportFailureDetector() will never call
report()).
The result is that the node can never ever be convicted.
Now, let's ignore for a moment the problem that a node that is actually Down
will be thought to be Up temporarily for a little while. That is sub-optimal,
but let's aim for a fix to the more serious problem in this ticket - which is
that is stays up forever.
Considered solutions:
* When interpret() gets called and there is no arrival window, we could add a
faked arrival window far back in time to cause the node to have history and
be marked down. This works in the particular test case. The problem is that
since we are not ourselves actively trying to gossip to these nodes with any
particular speed, it might take a significant time before we get any kind of
confirmation from someone else that it's actually Up in cases where the node
actually *is* Up, so it's not clear that this is a good idea.
* When interpret() gets called and there is no arrival

[jira] [Commented] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Sylvain Lebresne (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168849#comment-13168849
]

Sylvain Lebresne commented on CASSANDRA-3143:
-

{quote}
bq. Preceding point apart, we would at least need a way to deactivate row
caching on a per-cf basis. We may also want that for key cache, though this
seems less critical. My initial idea would be to either have a boolean flag if
we only want to allow disabling row cache, or some multi-value caches option
that could be none, key_only, row_only or all.

This is going to be moved to the separate task.
{quote}

I'm not a fan of that idea. We just cannot release this without a way to
deactivate the row cache as this would make the row cache unusable for most
users. IMHO, that's a good definition of something that should not be moved to
a separate task.

{quote}
bq. Why does the getRowCacheKeysToSave() option disappeared?

Because we don't control that anymore, rely on cache LRU policy instead.
{quote}

I don't understand how relying on cache LRU policy has anything to do with
that. The initial motivation for that option is that people don't want to
reload the full extend of the row cache on restart because it takes forever,
but they don't want to start with cold caches either. I don't see how making
the cache global changes anything on that. I agree that the number of row cache
key to save should now be a global option, but I disagree that it should be
removed.

Otherwise:
* The code around CFS.prepareRowForCaching is weird. First the comment seems to
suggest that prepareRowForCaching is used exclusively from CacheService while
it's use below in cacheRow. It also adds a copy of the columns which I don't
think is necessary since we already copy in MappedFileDataInput. Overall I'm
not sure prepareRowForCaching is useful and CacheService.readSavedRowCache
could use cacheRow directly
* I don't think CacheService.reloadKeyCache works correctly. It only populate
the cache with fake values that won't get updated unless you reload the
sstables, which has no reason to happen. I'm fine removing the key cache
reloading altogether, but as an alternative, why not save the value of the key
cache too? The thing is, I'm not very comfortable with the current 'two phase'
key cache loading: if we ever have a bug in the SSTReader.load method, the
actual pre-loading with -1 values will be harmful, which seems unnecessarily
fragile. Saving the values on disk would avoid that.
* Talking of the key cache save, the format used by the patch is really really
not compact. For each key we save the full path to the sstable, which can
easily be 50 bytes. Maybe we could associate an int to each descriptor during
the save and save the association of descriptor - id separately. * Still
worth allowing to chose how may keys to save
* The cache sizings don't take the keys into account. For the row cache, one
could make the argument that the overhead of the keys is negligible compared to
the values. For the key cache however, the key are bigger than the values.
* The patch mistakenly remove the help for 'nodetool upgradesstables' (in
NodeCmd.java)
* Would be worth adding a global cache log line in StatusLogger.
* Patch wrongly reintroduces memtable_operations and memtable_throughput to
CliHelp.
* The default row cache provider since 1.0 is the serializing one, this patch
sets the ConcurrentLinkedHashCacheProvider instead.

And a number of nits:
* In CFS, it's probably faster/simpler to use metadata.cfId rather than
Schema.instance.getId(table.name, this.columnFamily)
* In CacheService, calling scheduleSaving with -1 as second argument would be
slightly faster than using Integer.MAX_VALUE.
* In SSTableReader.cacheKey, the assert {{key.key == null}} is useless in trunk
(DK with key == null can't be constructed).
* In AbstractCassandraDaemon, there's a unecessary import of
javax.management.RuntimeErrorException
* There is some comments duplication in the yaml file
* I wonder if the reduce cache capacity thing still makes sense after this
patch?
* In AutosavingCache, I think we could declare AutoSavingCacheK extends
CacheKey, V and get rid of the translateKey() method

Global caches (key/row)
---

Key: CASSANDRA-3143
URL: https://issues.apache.org/jira/browse/CASSANDRA-3143
Project: Cassandra
Issue Type: Improvement
Reporter: Pavel Yaskevich
Assignee: Pavel Yaskevich
Priority: Minor
Labels: Core
Fix For: 1.1

[jira] [Created] (CASSANDRA-3627) IN (...) SELECTs don't honor KEY keyword

2011-12-13 Thread Eric Evans (Created) (JIRA)

IN (...) SELECTs don't honor KEY keyword


 Key: CASSANDRA-3627
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3627
 Project: Cassandra
  Issue Type: Bug
  Components: API
Affects Versions: 1.0.5, 0.8.8
Reporter: Eric Evans


The WHERE clause of a SELECT ... IN (...) will not work with the KEY keyword, 
(but does with named/aliased keys).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3620) Proposal for distributed deletes - use Reaper Model rather than GCSeconds and scheduled repairs

2011-12-13 Thread Dominic Williams (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168855#comment-13168855
 ] 

Dominic Williams commented on CASSANDRA-3620:
-

Make it optional per column family? Repair would still need to exist anyway so 
could fall back to that for cases like this.

 Proposal for distributed deletes - use Reaper Model rather than GCSeconds 
 and scheduled repairs
 -

 Key: CASSANDRA-3620
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Dominic Williams
  Labels: GCSeconds,, deletes,, distributed_deletes,, 
 merkle_trees, repair,
   Original Estimate: 504h
  Remaining Estimate: 504h

 Here is a proposal for an improved system for handling distributed deletes.
 h2. The Problem
 There are various issues with repair:
 * Repair is expensive anyway
 * Repair jobs are often made more expensive than they should be by other 
 issues (nodes dropping requests, hinted handoff not working, downtime etc)
 * Repair processes can often fail and need restarting, for example in cloud 
 environments where network issues make a node disappear 
 from the ring for a brief moment
 * When you fail to run repair within GCSeconds, either by error or because of 
 issues with Cassandra, data written to a node that did not see a later delete 
 can reappear (and a node might miss a delete for several reasons including 
 being down or simply dropping requests during load shedding)
 * If you cannot run repair and have to increase GCSeconds to prevent deleted 
 data reappearing, in some cases the growing tombstone overhead can 
 significantly degrade performance
 Because of the foregoing, in high throughput environments it can be very 
 difficult to make repair a cron job. It can be preferable to keep a terminal 
 open and run repair jobs one by one, making sure they succeed and keeping and 
 eye on overall load to reduce system impact. This isn't desirable, and 
 problems are exacerbated when there are lots of column families in a database 
 or it is necessary to run a column family with a low GCSeconds to reduce 
 tombstone load (because there are many write/deletes to that column family). 
 The database owner must run repair within the GCSeconds window, or increase 
 GCSeconds, to avoid potentially losing delete operations. 
 It would be much better if there was no ongoing requirement to run repair to 
 ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be 
 an optional maintenance utility used in special cases, or to ensure ONE reads 
 get consistent data. 
 h2. Reaper Model Proposal
 # Tombstones do not expire, and there is no GCSeconds
 # Tombstones have associated ACK lists, which record the replicas that have 
 acknowledged them
 # Tombstones are only deleted (or marked for compaction) when they have been 
 acknowledged by all replicas
 # When a tombstone is deleted, it is added to a fast relic index of MD5 
 hashes of cf-key-name[-subName]-ackList. The relic index makes it possible 
 for a reaper to acknowledge a tombstone after it is deleted
 # Background reaper threads constantly stream ACK requests to other nodes, 
 and stream back ACK responses back to requests they have received (throttling 
 their usage of CPU and bandwidth so as not to affect performance)
 # If a reaper receives a request to ACK a tombstone that does not exist, it 
 creates the tombstone and adds an ACK for the requestor, and replies with an 
 ACK 
 NOTES
 * The existence of entries in the relic index do not affect normal query 
 performance
 * If a node goes down, and comes up after a configurable relic entry timeout, 
 the worst that can happen is that a tombstone that hasn't received all its 
 acknowledgements is re-created across the replicas when the reaper requests 
 their acknowledgements (which is no big deal since this does not corrupt data)
 * Since early removal of entries in the relic index does not cause 
 corruption, it can be kept small, or even kept in memory
 * Simple to implement and predictable 
 h3. Planned Benefits
 * Operations are finely grained (reaper interruption is not an issue)
 * The labour  administration overhead associated with running repair can be 
 removed
 * Reapers can utilize spare cycles and run constantly in background to 
 prevent the load spikes and performance issues associated with repair
 * There will no longer be the threat of corruption if repair can't be run for 
 some reason (for example because of a new adopter's lack of Cassandra 
 expertise, a cron script failing, or Cassandra bugs preventing repair being 
 run etc)
 * Deleting tombstones earlier, thereby reducing the number

[jira] [Created] (CASSANDRA-3628) Make Pig/CassandraStorage delete functionality disabled by default and configurable

2011-12-13 Thread Jeremy Hanna (Created) (JIRA)

Make Pig/CassandraStorage delete functionality disabled by default and 
configurable
---

 Key: CASSANDRA-3628
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3628
 Project: Cassandra
  Issue Type: Task
Reporter: Jeremy Hanna
Assignee: Jeremy Hanna


Right now, there is a way to delete column with the CassandraStorage 
loadstorefunc.  In practice it is a bad idea to have that enabled by default.  
A scenario: do an outer join and you don't have a value for something and then 
you write out to cassandra all of the attributes of that relation.  You've just 
inadvertently deleted a column for all the rows that didn't have that value as 
a result of the outer join.  It can be argued that you want to be careful with 
how you project after the join.  However, I would think disabling by default 
and having a configurable property to enable it for the instances when you 
explicitly want to use it is the right plan.

Fwiw, we had a bug in one of our scripts that did exactly as described above.  
It's good to fix the bug.  It's bad to implicitly delete data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Sylvain Lebresne (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168867#comment-13168867
 ] 

Sylvain Lebresne commented on CASSANDRA-3622:
-

+1

 clean up openbitset
 ---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1

 Attachments: 3622-v2.txt, 3622.txt


 Our OpenBitSet no longer supports expanding the set post-construction.  
 Should update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair

2011-12-13 Thread Sylvain Lebresne (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168924#comment-13168924
]

Sylvain Lebresne commented on CASSANDRA-3483:
-

I haven't applied the patch yet, it needs rebase and preferably against trunk
since that is the likely target for this, but a few comments.

We could have more reuse of code between Boostrapper ant the rebuild command.
Typically:
* RangeStreamer.getAllRangeWithSourcesFor does essentially the same thing that
Boostrapper.getRangesWithSources, so it would be nice to do some reuse.
* In rebuild, we essentially have the code of Boostrapper.getWorkMap, again
would be nice to do some code reuse.

I think we should move all of those in RangeStreamer and ultimately
Boostrapper.boostrap() should be just one call to rebuild with the right
arguments (mostly the correct tokenMetada instance and the myRange
collection).

A few nits:
* rebuild code could be simplified slightly by using
StorageService.getLocalRanges()
* rebuild doesn't fully respect the code style.

Support bringing up a new datacenter to existing cluster without repair
---

Key: CASSANDRA-3483
URL: https://issues.apache.org/jira/browse/CASSANDRA-3483
Project: Cassandra
Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Chris Goffinet
Assignee: Peter Schuller
Attachments: CASSANDRA-3483-0.8-prelim.txt, CASSANDRA-3483-1.0.txt

Was talking to Brandon in irc, and we ran into a case where we want to bring
up a new DC to an existing cluster. He suggested from jbellis the way to do
it currently was set strategy options of dc2:0, then add the nodes. After the
nodes are up, change the RF of dc2, and run repair.
I'd like to avoid a repair as it runs AES and is a bit more intense than how
bootstrap works currently by just streaming ranges from the SSTables. Would
it be possible to improve this functionality (adding a new DC to existing
cluster) than the proposed method? We'd be happy to do a patch if we got some
input on the best way to go about it.

[jira] [Commented] (CASSANDRA-3143) Global caches (key/row)

2011-12-13 Thread Pavel Yaskevich (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168929#comment-13168929
]

Pavel Yaskevich commented on CASSANDRA-3143:

bq. I'm not a fan of that idea. We just cannot release this without a way to
deactivate the row cache as this would make the row cache unusable for most
users. IMHO, that's a good definition of something that should not be moved to
a separate task.

Couldn't we do that the same way we did with compression options? I'm happy to
make it a sub-task, I just want the main code to be settled before starting
with that.

bq. Why does the getRowCacheKeysToSave() option disappeared?

Is that going to have the same use case as it did per-CF? Meaning we would be
saving a top of the cache and it doesn't guarantee that system doesn't start
almost cold...

{quote}
Talking of the key cache save, the format used by the patch is really really
not compact. For each key we save the full path to the sstable, which can
easily be 50 bytes. Maybe we could associate an int to each descriptor during
the save and save the association of descriptor - id separately. * Still worth
allowing to chose how may keys to save
{quote}

Do you think that it worse the effort of maintaining (also persisting) such
descriptor - id relationship exclusively for key cache? Meaning it's already
very compact cache e.g. even with descriptor 50 bytes we would need ~20 mb to
store 20 keys...

bq. The cache sizings don't take the keys into account. For the row cache, one
could make the argument that the overhead of the keys is negligible compared to
the values. For the key cache however, the key are bigger than the values.

We do that because CLHM only allows to measure values, to do something about it
we would need to re-write Weighter interface and change core semantics of
CLHM...

Global caches (key/row)
---

Attachments: 0001-global-key-cache.patch,
0002-global-row-cache-and-ASC.readSaved-changed-to-abstra.patch,
0003-CacheServiceMBean-and-correct-key-cache-loading.patch,
0004-key-row-cache-tests-and-tweaks.patch,
0005-cleanup-of-the-CFMetaData-and-thrift-avro-CfDef-and-.patch,
0006-row-key-cache-improvements-according-to-Sylvain-s-co.patch

Caches are difficult to configure well as ColumnFamilies are added, similar
to how memtables were difficult pre-CASSANDRA-2006.

svn commit: r1214016 - /cassandra/tags/cassandra-1.0.6/

2011-12-13 Thread slebresne

Author: slebresne
Date: Wed Dec 14 01:17:43 2011
New Revision: 1214016

URL: http://svn.apache.org/viewvc?rev=1214016view=rev
Log:
Create 1.0.6 branch

Added:
cassandra/tags/cassandra-1.0.6/   (props changed)
  - copied from r1212944, cassandra/branches/cassandra-1.0/

Propchange: cassandra/tags/cassandra-1.0.6/
--
--- svn:ignore (added)
+++ svn:ignore Wed Dec 14 01:17:43 2011
@@ -0,0 +1,8 @@
+.classpath
+.project
+.settings
+temp-testng-customsuite.xml
+build
+build.properties
+.idea
+out

Propchange: cassandra/tags/cassandra-1.0.6/
--
--- svn:mergeinfo (added)
+++ svn:mergeinfo Wed Dec 14 01:17:43 2011
@@ -0,0 +1,16 @@
+/cassandra/branches/cassandra-0.6:922689-1052356,1052358-1053452,1053454,1053456-1131291
+/cassandra/branches/cassandra-0.7:1026516-1211709
+/cassandra/branches/cassandra-0.7.0:1053690-1055654
+/cassandra/branches/cassandra-0.8:1090934-1125013,1125019-1212854,1212938
+/cassandra/branches/cassandra-0.8.0:1125021-1130369
+/cassandra/branches/cassandra-0.8.1:1101014-1125018
+/cassandra/branches/cassandra-1.0:1167106,1167185
+/cassandra/branches/cassandra-1.0.0:1167104-1181093,1181741,1181816,1181820,1182951,1183243
+/cassandra/branches/cassandra-1.0.5:1208016
+/cassandra/tags/cassandra-0.7.0-rc3:1051699-1053689
+/cassandra/tags/cassandra-0.8.0-rc1:1102511-1125020
+/cassandra/trunk:1167085-1167102,1169870
+/incubator/cassandra/branches/cassandra-0.3:774578-796573
+/incubator/cassandra/branches/cassandra-0.4:810145-834239,834349-834350
+/incubator/cassandra/branches/cassandra-0.5:72-915439
+/incubator/cassandra/branches/cassandra-0.6:911237-922688

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Ed Anuff (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168957#comment-13168957
]

Ed Anuff commented on CASSANDRA-3625:
-

I don't think you mean random but predictable so much as deterministic but
opaque in your description of the correct behavior.

I raised this issue with DynamicCompositeType when it was introduced and I
suggested we use the alias character byte or a hash of the classname (see
https://issues.apache.org/jira/browse/CASSANDRA-2231#comment-13002170 ). I
still think that's the best approach.

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

Currently, DynamicCompositeType is a super dangerous type. We cannot leave it
that way or people will get hurt.
Let's recall that DynamicCompositeType allows composite column names without
any limitation on what each component type can be. It was added to basically
allow to use different rows of the same column family to each store a
different index. So for instance you would have:
{noformat}
index1: {
bar:24 - someval
bar:42 - someval
foo:12 - someval
...
}
index2: {
0:uuid1:3.2 - someval
1:uuid2:2.2 - someval
...
}

{noformat}
where index1, index2, ... are rows.
So each row have columns whose names have similar structure (so they can be
compared), but between rows the structure can be different (we neve compare
two columns from two different rows).
But the problem is the following: what happens if in the index1 row above,
you insert a column whose name is 0:uuid1 ? There is no really meaningful way
to compare bar:24 and 0:uuid1. The current implementation of
DynamicCompositeType, when confronted with this, says that it is a user error
and throw a MarshalException.
The problem with that is that the exception is not throw at insert time, and
it *cannot* be because of the dynamic nature of the comparator. But that
means that if you do insert the wrong column in the wrong row, you end up
*corrupting* a sstable.
It is too dangerous a behavior. And it's probably made worst by the fact that
some people probably think that DynamicCompositeType should be superior to
CompositeType since you know, it's dynamic.
One solution to that problem could be to decide of some random (but
predictable) order between two incomparable component. For example we could
design that IntType LongType StringType ...
Note that even if we do that, I would suggest renaming the
DynamicCompositeType to something that suggest that CompositeType is always
preferable to DynamicCompositeType unless you're really doing very advanced
stuffs.
Opinions?

[jira] [Commented] (CASSANDRA-1391) Allow Concurrent Schema Migrations

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168969#comment-13168969
]

Jonathan Ellis commented on CASSANDRA-1391:
---

Thanks, Pavel. This is getting closer. But I think continuing to use UUIDs is
the wrong approach. In particular, code like this means we've failed to
achieve our goal:

{code}
. if (newVersion.timestamp() = lastVersion.timestamp())
throw new ConfigurationException(New version timestamp is not
newer than the current version timestamp.);
{code}

If two migrations X and Y propagate through the cluster concurrently from
different coordinators, some nodes will apply X first, some Y; whichever
migration has a lower timestamp will then error out on the remaining nodes and
we'll end up with the same kind of version conflict snafu we encounter now.

Here's how I think it should work:

* Coordinator turns KsDef and CfDef objects into RowMutations by applying them
to the existing (local) schema. Maybe you use something like your
attributesToCheck code since you already have that written. Give that mutation
a normal local timestamp (FBU.timestampMicros).

Then each node applying the change:
* makes a deep copy of the existing schema ColumnFamily objects
* calls Table.apply on the migration RowMutations
* calls ColumnFamily.diff on the new schema ColumnFamily object vs the copied
one. (This is where I was going above by saying let the existing resolve code
do the work. No matter which order nodes apply X and Y in, they will always
agree on the result after applying both. Note that this does not depend on X
and Y getting correctly ordered timestamps, either.)
* makes the appropriate Table + CFS + Schema changes dicated by the diff
* (above obvously needs to be synchronized at least against the Table/CFS
objects affected)

Schema version may then be computed as an md5 of the Schema objects. (Again:
goal is that nodes can apply X and Y in any order, and we don't care. So
version needs to be entirely content-based, not time-based.) Probably the
easiest way to do this is to just use CF.updateDigest. We can cut this down to
the first 16 bytes if we need to cram it into a UUID, but I don't see a reason
for that (the Thrift API uses Strings already).

Nit: flushSystemCFs could use FBUtilities.waitOnFutures(flushes) instead of
rolling its own multi-future wait.

Allow Concurrent Schema Migrations
--

Key: CASSANDRA-1391
URL: https://issues.apache.org/jira/browse/CASSANDRA-1391
Project: Cassandra
Issue Type: Improvement
Components: Core
Affects Versions: 0.7.0
Reporter: Stu Hood
Assignee: Pavel Yaskevich
Fix For: 1.1

Attachments:
0001-new-migration-schema-and-avro-methods-cleanup.patch,
0002-avro-removal.patch, CASSANDRA-1391.patch

CASSANDRA-1292 fixed multiple migrations started from the same node to
properly queue themselves, but it is still possible for migrations initiated
on different nodes to conflict and leave the cluster in a bad state. Since
the system_add/drop/rename methods are accessible directly from the client
API, they should be completely safe for concurrent use.
It should be possible to allow for most types of concurrent migrations by
converting the UUID schema ID into a VersionVectorClock (as provided by
CASSANDRA-580).

[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair

2011-12-13 Thread Peter Schuller (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168975#comment-13168975
]

Peter Schuller commented on CASSANDRA-3483:
---

I'll get it rebased once it's otherwise okay.

As for re-use: I had intermediate versions that tried to do this, but ever time
I ended up realizing that it was exploding in verbosity at the point where I
was using the abstraction so it didn't actually help. However, I think there
were a few changes towards the end after which I didn't re-evaluate.

I'll look at it again and see what I can do.

Support bringing up a new datacenter to existing cluster without repair
---

[jira] [Commented] (CASSANDRA-3483) Support bringing up a new datacenter to existing cluster without repair

2011-12-13 Thread Peter Schuller (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168976#comment-13168976
]

Peter Schuller commented on CASSANDRA-3483:
---

I'll get it rebased once it's otherwise okay.

I'll look at it again and see what I can do.

Support bringing up a new datacenter to existing cluster without repair
---

svn commit: r1214034 - in /cassandra/trunk/src/java/org/apache/cassandra/utils: BloomFilter.java obs/OpenBitSet.java

2011-12-13 Thread jbellis

Author: jbellis
Date: Wed Dec 14 02:18:44 2011
New Revision: 1214034

URL: http://svn.apache.org/viewvc?rev=1214034view=rev
Log:
clean up OpenBitSet
patch by jbellis; reviewed by slebresne for CASSANDRA-3622

Modified:
cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java
cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java

Modified: cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java
URL: 
http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java?rev=1214034r1=1214033r2=1214034view=diff
==
--- cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java 
(original)
+++ cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java Wed 
Dec 14 02:18:44 2011
@@ -113,7 +113,7 @@ public class BloomFilter extends Filter
 {
 for (long bucketIndex : getHashBuckets(key))
 {
-bitset.fastSet(bucketIndex);
+bitset.set(bucketIndex);
 }
 }
 
@@ -121,7 +121,7 @@ public class BloomFilter extends Filter
 {
   for (long bucketIndex : getHashBuckets(key))
   {
-  if (!bitset.fastGet(bucketIndex))
+  if (!bitset.get(bucketIndex))
   {
   return false;
   }

Modified: 
cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java
URL: 
http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java?rev=1214034r1=1214033r2=1214034view=diff
==
--- cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java 
(original)
+++ cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java Wed 
Dec 14 02:18:44 2011
@@ -21,8 +21,10 @@ import java.util.Arrays;
 import java.io.Serializable;
 import java.util.BitSet;
 
-/** An open BitSet implementation that allows direct access to the array of 
words
- * storing the bits.
+/**
+ * An open BitSet implementation that allows direct access to the arrays of 
words
+ * storing the bits.  Derived from Lucene's OpenBitSet, but with a paged 
backing array
+ * (see bits delaration, below).
  * p/
  * Unlike java.util.bitset, the fact that bits are packed into an array of 
longs
  * is part of the interface.  This allows efficient implementation of other 
algorithms
@@ -39,77 +41,38 @@ import java.util.BitSet;
  * hence people re-implement their own version in order to get better 
performance).
  * If you want a safe, totally encapsulated (and slower and limited) BitSet
  * class, use codejava.util.BitSet/code.
- * p/
- * h3Performance Results/h3
- *
- Test system: Pentium 4, Sun Java 1.5_06 -server -Xbatch -Xmx64M
-br/BitSet size = 1,000,000
-br/Results are java.util.BitSet time divided by OpenBitSet time.
-table border=1
- tr
-  th/th thcardinality/th thintersect_count/th thunion/th 
thnextSetBit/th thget/th thiterator/th
- /tr
- tr
-  th50% full/th td3.36/td td3.96/td td1.44/td td1.46/td 
td1.99/td td1.58/td
- /tr
- tr
-   th1% full/th td3.31/td td3.90/td tdnbsp;/td td1.04/td 
tdnbsp;/td td0.99/td
- /tr
-/table
-br/
-Test system: AMD Opteron, 64 bit linux, Sun Java 1.5_06 -server -Xbatch -Xmx64M
-br/BitSet size = 1,000,000
-br/Results are java.util.BitSet time divided by OpenBitSet time.
-table border=1
- tr
-  th/th thcardinality/th thintersect_count/th thunion/th 
thnextSetBit/th thget/th thiterator/th
- /tr
- tr
-  th50% full/th td2.50/td td3.50/td td1.00/td td1.03/td 
td1.12/td td1.25/td
- /tr
- tr
-   th1% full/th td2.51/td td3.49/td tdnbsp;/td td1.00/td 
tdnbsp;/td td1.02/td
- /tr
-/table
  */
 
 public class OpenBitSet implements Serializable {
-  protected long[][] bits;
-  protected int wlen;   // number of words (elements) used in the array
-  private final int pageCount;
   /**
-   * length of bits[][] page in long[] elements. 
-   * Choosing unform size for all sizes of bitsets fight fragmentation for 
very large
-   * bloom filters.
+   * We break the bitset up into multiple arrays to avoid promotion failure 
caused by attempting to allocate
+   * large, contiguous arrays (CASSANDRA-2466).  All sub-arrays but the last 
are uniformly PAGE_SIZE words;
+   * to avoid waste in small bloom filters (of which Cassandra has many: one 
per row) the last sub-array
+   * is sized to exactly the remaining number of words required to achieve the 
desired set size (CASSANDRA-3618).
*/
-  protected static final int PAGE_SIZE= 4096; 
+  private final long[][] bits;
+  private int wlen; // number of words (elements) used in the array
+  private final int pageCount;
+  private static final int PAGE_SIZE = 4096;
 
-  /** Constructs an OpenBitSet large enough to hold numBits.
-   *
+  /**
+   * Constructs an OpenBitSet large enough to hold numBits.
* @param numBits
*/
   public OpenBitSet(long numBits) 
   {
-  this(numBits,true);
-  }
-  
-

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169013#comment-13169013
]

Jonathan Ellis commented on CASSANDRA-3625:
---

bq. One solution to that problem could be to decide of some random (but
predictable) order between two incomparable component.

That's the most straightforward suggestion IMO.

bq. I suggested we use the alias character byte or a hash of the classname

Couldn't we just fall back to lexical sorting for non-comparable types? Might
as well keep it simple.

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

[jira] [Updated] (CASSANDRA-3619) Use a separate writer thread for the SSTableSimpleUnsortedWriter

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3619:
--

Reviewer: yukim

 Use a separate writer thread for the SSTableSimpleUnsortedWriter
 

 Key: CASSANDRA-3619
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3619
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.8.1
Reporter: Sylvain Lebresne
Assignee: Sylvain Lebresne
Priority: Minor
 Fix For: 1.1

 Attachments: 0001-Add-separate-writer-thread.patch


 Currently SSTableSimpleUnsortedWriter doesn't use any threading. This means 
 that the thread using it is blocked while the buffered data is written on 
 disk and that nothing is written on disk while data is added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3624) Hinted Handoff - related OOM

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169051#comment-13169051
 ] 

Jonathan Ellis commented on CASSANDRA-3624:
---

That makes sense.  (How big are your mutations?)

We added adaptive page sizing back in CASSANDRA-2652, but apparently removed it 
for the CASSANDRA-2045 redesign.

 Hinted Handoff - related OOM
 

 Key: CASSANDRA-3624
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624
 Project: Cassandra
  Issue Type: Bug
Reporter: Marcus Eriksson

 One of our nodes had collected alot of hints for another node, so when the 
 dead node came back and the row mutations were read back from disk, the node 
 died with an OOM-exception (and kept dying after restart, even with increased 
 heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our 
 application does not use those (but HH does). 
 I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does 
 not fit in memory (will check this tomorrow)
 A simple fix (if my assumption above is correct) would be to reduce the 
 PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to 
 reduce the memory pressure. The performance hit would be small since we are 
 doing the hinted handoff throttle delay sleep before sending every *mutation* 
 anyway (not every page), thoughts?
 If anyone runs in to the same problem, I got the node started again by simply 
 removing the HintsColumnFamily* files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3622) clean up openbitset

2011-12-13 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169052#comment-13169052
 ] 

Hudson commented on CASSANDRA-3622:
---

Integrated in Cassandra #1255 (See 
[https://builds.apache.org/job/Cassandra/1255/])
clean up OpenBitSet
patch by jbellis; reviewed by slebresne for CASSANDRA-3622

jbellis : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1214034
Files : 
* /cassandra/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java
* /cassandra/trunk/src/java/org/apache/cassandra/utils/obs/OpenBitSet.java


 clean up openbitset
 ---

 Key: CASSANDRA-3622
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3622
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 1.1

 Attachments: 3622-v2.txt, 3622.txt


 Our OpenBitSet no longer supports expanding the set post-construction.  
 Should update documentation to reflect that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-3624) Hinted Handoff - related OOM

2011-12-13 Thread Jonathan Ellis (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3624:
--

Attachment: 3624.txt

Patch to add back adaptive page sizing, and drops the default size to 128 
columns.

 Hinted Handoff - related OOM
 

 Key: CASSANDRA-3624
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Marcus Eriksson
  Labels: hintedhandoff
 Fix For: 1.0.7

 Attachments: 3624.txt


 One of our nodes had collected alot of hints for another node, so when the 
 dead node came back and the row mutations were read back from disk, the node 
 died with an OOM-exception (and kept dying after restart, even with increased 
 heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our 
 application does not use those (but HH does). 
 I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does 
 not fit in memory (will check this tomorrow)
 A simple fix (if my assumption above is correct) would be to reduce the 
 PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to 
 reduce the memory pressure. The performance hit would be small since we are 
 doing the hinted handoff throttle delay sleep before sending every *mutation* 
 anyway (not every page), thoughts?
 If anyone runs in to the same problem, I got the node started again by simply 
 removing the HintsColumnFamily* files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3624) Hinted Handoff - related OOM

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169056#comment-13169056
 ] 

Jonathan Ellis commented on CASSANDRA-3624:
---

bq. The performance hit would be small since we are doing the hinted handoff 
throttle delay sleep before sending every mutation anyway 

True, but this is likely to change (see Jake's comments to CASSANDRA-3554).

 Hinted Handoff - related OOM
 

 Key: CASSANDRA-3624
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3624
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Marcus Eriksson
Assignee: Jonathan Ellis
  Labels: hintedhandoff
 Fix For: 1.0.7

 Attachments: 3624.txt


 One of our nodes had collected alot of hints for another node, so when the 
 dead node came back and the row mutations were read back from disk, the node 
 died with an OOM-exception (and kept dying after restart, even with increased 
 heap (from 8G to 12G)). The heap dump contained alot of SuperColumns and our 
 application does not use those (but HH does). 
 I'm guessing that each mutation is big so that PAGE_SIZE*mutation_size does 
 not fit in memory (will check this tomorrow)
 A simple fix (if my assumption above is correct) would be to reduce the 
 PAGE_SIZE in HintedHandOffManager.java to something like 10 (or even 1?) to 
 reduce the memory pressure. The performance hit would be small since we are 
 doing the hinted handoff throttle delay sleep before sending every *mutation* 
 anyway (not every page), thoughts?
 If anyone runs in to the same problem, I got the node started again by simply 
 removing the HintsColumnFamily* files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Ed Anuff (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169057#comment-13169057
]

Ed Anuff commented on CASSANDRA-3625:
-

Each component in a composite consists of a type (either an alias byte or a
Cassandra comparator type name) and the value. I'm suggesting doing a compare
on the type in the case of types not being equivalent. The comparison could be
a lexical compare or a hash comparison. I think doing the compare on the
component type is better since the purpose of the composite is for slices and
if we do a lexical compare of the component values then the slices are going to
have weird results in the middle of them. For example, a row that had dynamic
composite columns (ed,5), (jonathan,6), and (103, 32), that was sliced from
(ed) to (jonathan) could have the (103, 32) in the middle. If we compare
on the type, then that never happens.

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Jonathan Ellis (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169060#comment-13169060
]

Jonathan Ellis commented on CASSANDRA-3625:
---

bq. For example, a row that had dynamic composite columns (ed,5),
(jonathan,6), and (103, 32), that was sliced from (ed) to (jonathan)
could have the (103, 32) in the middle

Right, but I thought we were positing that You Shouldn't Do That. In which
case as long as it doesn't crash, I'm good. :)

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Ed Anuff (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169068#comment-13169068
]

Ed Anuff commented on CASSANDRA-3625:
-

I'm not positing that at all, I can think of a number of good reasons why it
can happen and is even desirable. I'd really strongly urge we do the compare
on the component type. I don't think the fix is any more complicated and it
will be much preferable from a data modelling standpoint.

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

[wiki.cassandra-jdbc] push by - Edited wiki page HowToBuild through web user interface. on 2011-12-13 23:50 GMT

2011-12-13 Thread cassandra-jdbc . apache-extras . org


Revision: 5280e68bfdf5
Author:   john.eric.evans john.eric.ev...@gmail.com
Date: Tue Dec 13 15:50:25 2011
Log:  Edited wiki page HowToBuild through web user interface.
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5280e68bfdf5repo=wiki

Modified:
 /HowToBuild.wiki

===
--- /HowToBuild.wikiTue Dec 13 15:48:41 2011
+++ /HowToBuild.wikiTue Dec 13 15:50:25 2011
@@ -1,3 +1,4 @@
+#labels Featured
 #Maven, FML.

 = Building =

[wiki.cassandra-jdbc] push by - Some build doc. on 2011-12-13 23:48 GMT

2011-12-13 Thread cassandra-jdbc . apache-extras . org


Revision: a8f3cd03dba3
Author:   john.eric.evans john.eric.ev...@gmail.com
Date: Tue Dec 13 15:48:41 2011
Log:  Some build doc.
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=a8f3cd03dba3repo=wiki

Added:
 /HowToBuild.wiki

===
--- /dev/null
+++ /HowToBuild.wikiTue Dec 13 15:48:41 2011
@@ -0,0 +1,49 @@
+#Maven, FML.
+
+= Building =
+
+== Satisfying Dependencies ==
+
+The JDBC driver has a dependency on two [http://cassandra.apache.org  
Cassandra] jars, `cassandra-clientutil` and `cassandra-thrift`, neither of  
which will be available through a Maven repository until the release of  
Cassandra 1.1.0.  In the meantime you must ~~shave a yak~~ satisfy this  
dependency manually.

+
+First, download the source and build the jar artifacts.
+
+{{{
+$ svn checkout https://svn.apache.org/repos/asf/cassandra/trunk cassandra
+$ cd cassandra
+$ ant jar
+}}}
+
+When complete, install the artifacts to `~/.m2`
+
+{{{
+mvn install:install-file -DgroupId=org.apache.cassandra \
+-DartifactId=cassandra-clientutil -Dversion=1.1-dev-SNAPSHOT  
-Dpackaging=jar \

+-Dfile=build/apache-cassandra-clientutil-1.1-dev-SNAPSHOT.jar
+...
+mvn install:install-file -DgroupId=org.apache.cassandra \
+-DartifactId=cassandra-thrift -Dversion=1.1-dev-SNAPSHOT  
-Dpackaging=jar \

+-Dfile=build/apache-cassandra-thrift-1.1-dev-SNAPSHOT.jar
+}}}
+
+== Building ==
+
+[http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/checkout  
Checkout  
the source] and build with either Maven:

+
+{{{
+$ mvn compile
+}}}
+
+Or ant:
+
+{{{
+$ ant
+}}}
+
+== IDE ==
+
+To generate project files for [http://www.eclipse.org/ Eclipse]:
+
+{{{
+$ mvn eclipse:eclipse
+}}}

[cassandra-jdbc] 2 new revisions pushed by john.eri...@gmail.com on 2011-12-13 23:29 GMT

2011-12-13 Thread cassandra-jdbc . apache-extras . org


2 new revisions:

Revision: 5ec85ae43461
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 12:57:26 2011
Log:  add dependency on thrift (temporary?)
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5ec85ae43461

Revision: c80adc5f4bd2
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:21:15 2011
Log:  IN (...) is broken and requires an aliased key...
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=c80adc5f4bd2

==
Revision: 5ec85ae43461
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 12:57:26 2011
Log:  add dependency on thrift (temporary?)

http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=5ec85ae43461

Modified:
 /pom.xml

===
--- /pom.xmlMon Nov  7 15:59:43 2011
+++ /pom.xmlTue Dec 13 12:57:26 2011
@@ -132,6 +132,11 @@
   version1.6.1/version
   scopetest/scope
 /dependency
+dependency
+  groupIdorg.apache.thrift/groupId
+  artifactIdlibthrift/artifactId
+  version0.6.1/version
+/dependency
   /dependencies

   build

==
Revision: c80adc5f4bd2
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:21:15 2011
Log:  IN (...) is broken and requires an aliased key

See https://issues.apache.org/jira/browse/CASSANDRA-3627

http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=c80adc5f4bd2

Modified:
 /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java
 /src/test/java/org/apache/cassandra/cql/Schema.java
 /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java

===
--- /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java	Thu Oct 13  
01:56:33 2011
+++ /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java	Tue Dec 13  
15:21:15 2011

@@ -67,8 +67,9 @@
 String[] inserts =
 {
 String.format(UPDATE Standard1 SET '%s' = '%s', '%s' = '%s'  
WHERE KEY = '%s', first, firstrec, last, lastrec, jsmith),
-UPDATE JdbcInteger SET 1 = 11, 2 = 22, 42='fortytwo' WHERE  
KEY = ' + jsmith + ',
-UPDATE JdbcInteger SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith  
+ ',
+UPDATE JdbcInteger0 SET 1 = 11, 2 = 22, 42='fortytwo' WHERE  
KEY = ' + jsmith + ',
+UPDATE JdbcInteger0 SET 3 = 33, 4 = 44 WHERE KEY = ' +  
jsmith + ',
+UPDATE JdbcInteger1 SET 1 = 'One', 2 = 'Two', 3 = 'Three'  
WHERE id = rowOne,
 UPDATE JdbcLong SET 1 = 11, 2 = 22 WHERE KEY = ' + jsmith  
+ ',
 UPDATE JdbcAscii SET 'first' = 'firstrec', last = 'lastrec'  
WHERE key = ' + jsmith + ',
 String.format(UPDATE JdbcBytes SET '%s' = '%s', '%s' = '%s'  
WHERE key = '%s', first, firstrec, last, lastrec, jsmith),

@@ -133,8 +134,8 @@
 {
 String key = bytesToHex(Integer.getBytes());
 Statement stmt = con.createStatement();
-stmt.executeUpdate(update JdbcInteger set 1=36893488147419103232,  
42='fortytwofortytwo' where key=' + key + ');
-ResultSet rs = stmt.executeQuery(select 1, 2, 42 from JdbcInteger  
where key=' + key + ');
+stmt.executeUpdate(update JdbcInteger0 set  
1=36893488147419103232, 42='fortytwofortytwo' where key=' + key + ');
+ResultSet rs = stmt.executeQuery(select 1, 2, 42 from  
JdbcInteger0 where key=' + key + ');

 assert rs.next();
 assert rs.getObject(1).equals(new  
BigInteger(36893488147419103232));
 assert rs.getString(42).equals(fortytwofortytwo) :  
rs.getString(42);

@@ -145,7 +146,7 @@
 expectedMetaData(md, 2, BigInteger.class.getName(), JdbcInteger,  
Schema.KEYSPACE_NAME, 2, Types.BIGINT, JdbcInteger.class.getSimpleName(),  
true, false);
 expectedMetaData(md, 3, String.class.getName(), JdbcInteger,  
Schema.KEYSPACE_NAME, 42, Types.VARCHAR, JdbcUTF8.class.getSimpleName(),  
false, true);


-rs = stmt.executeQuery(select key, 1, 2, 42 from JdbcInteger  
where key=' + key + ');
+rs = stmt.executeQuery(select key, 1, 2, 42 from JdbcInteger0  
where key=' + key + ');

 assert rs.next();
 assert Arrays.equals(rs.getBytes(key), hexToBytes(key));
 assert rs.getObject(1).equals(new  
BigInteger(36893488147419103232));

@@ -281,13 +282,13 @@
 {
 Statement stmt = con.createStatement();
 ListString keys = Arrays.asList(jsmith);
-String selectQ = SELECT 1, 2 FROM JdbcInteger WHERE KEY=' +  
jsmith + ';
+String selectQ = SELECT 1, 2 FROM JdbcInteger0 WHERE KEY=' +  
jsmith + ';
 checkResultSet(stmt.executeQuery(selectQ), Int, 1,  
keys, 1, 2);


-selectQ = SELECT 3, 4 FROM JdbcInteger WHERE KEY=' + jsmith  
+ ';
+selectQ = SELECT 3, 4 FROM JdbcInteger0 WHERE KEY=' + jsmith  
+ ';

[cassandra-jdbc] 3 new revisions pushed by john.eri...@gmail.com on 2011-12-13 23:26 GMT

2011-12-13 Thread cassandra-jdbc . apache-extras . org


3 new revisions:

Revision: 92cb0506c77b
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 12:57:26 2011
Log:  add dependency on thrift (temporary?)
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=92cb0506c77b

Revision: 93551543de06
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:20:46 2011
Log:  do not hard code host/port
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=93551543de06

Revision: 6acadeb166f9
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:21:15 2011
Log:  IN (...) is broken and requires an aliased key...
http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=6acadeb166f9

==
Revision: 92cb0506c77b
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 12:57:26 2011
Log:  add dependency on thrift (temporary?)

http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=92cb0506c77b

Modified:
 /pom.xml

===
--- /pom.xmlThu Dec  1 10:36:57 2011
+++ /pom.xmlTue Dec 13 12:57:26 2011
@@ -132,6 +132,11 @@
   version1.6.1/version
   scopetest/scope
 /dependency
+dependency
+  groupIdorg.apache.thrift/groupId
+  artifactIdlibthrift/artifactId
+  version0.6.1/version
+/dependency
   /dependencies

   build

==
Revision: 93551543de06
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:20:46 2011
Log:  do not hard code host/port

http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=93551543de06

Modified:
 /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java

===
--- /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java	 
Thu Dec  1 10:38:15 2011
+++ /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java	 
Tue Dec 13 15:20:46 2011

@@ -38,17 +38,17 @@
 public class PreparedStatementTest
 {
 private static java.sql.Connection con = null;
-
-//private static final Schema schema = new  
Schema(ConnectionDetails.getHost(), ConnectionDetails.getPort());

-private static final Schema schema = new Schema(localhost, 9160);
+private static final Schema schema = new  
Schema(ConnectionDetails.getHost(), ConnectionDetails.getPort());


 @BeforeClass
 public static void waxOn() throws Exception
 {
 schema.createSchema();
 Class.forName(org.apache.cassandra.cql.jdbc.CassandraDriver);
-con =  
DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s,  
ConnectionDetails.getHost(), ConnectionDetails.getPort(),  
Schema.KEYSPACE_NAME));
-//con =  
DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s, localhost,  
9160, Schema.KEYSPACE_NAME));
+con =  
DriverManager.getConnection(String.format(jdbc:cassandra://%s:%d/%s,
+ 
ConnectionDetails.getHost(),
+ 
ConnectionDetails.getPort(),
+ 
Schema.KEYSPACE_NAME));

 }

 @Test

==
Revision: 6acadeb166f9
Author:   Eric Evans e...@acunu.com
Date: Tue Dec 13 15:21:15 2011
Log:  IN (...) is broken and requires an aliased key

See https://issues.apache.org/jira/browse/CASSANDRA-3627

http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/detail?r=6acadeb166f9

Modified:
 /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java
 /src/test/java/org/apache/cassandra/cql/Schema.java
 /src/test/java/org/apache/cassandra/cql/jdbc/PreparedStatementTest.java

===
--- /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java	Thu Oct 13  
01:56:33 2011
+++ /src/test/java/org/apache/cassandra/cql/JdbcDriverTest.java	Tue Dec 13  
15:21:15 2011

@@ -67,8 +67,9 @@
 String[] inserts =
 {
 String.format(UPDATE Standard1 SET '%s' = '%s', '%s' = '%s'  
WHERE KEY = '%s', first, firstrec, last, lastrec, jsmith),
-UPDATE JdbcInteger SET 1 = 11, 2 = 22, 42='fortytwo' WHERE  
KEY = ' + jsmith + ',
-UPDATE JdbcInteger SET 3 = 33, 4 = 44 WHERE KEY = ' + jsmith  
+ ',
+UPDATE JdbcInteger0 SET 1 = 11, 2 = 22, 42='fortytwo' WHERE  
KEY = ' + jsmith + ',
+UPDATE JdbcInteger0 SET 3 = 33, 4 = 44 WHERE KEY = ' +  
jsmith + ',
+UPDATE JdbcInteger1 SET 1 = 'One', 2 = 'Two', 3 = 'Three'  
WHERE id = rowOne,
 UPDATE JdbcLong SET 1 = 11, 2 = 22 WHERE KEY = ' + jsmith  
+ ',
 UPDATE JdbcAscii SET 'first' = 'firstrec', last = 'lastrec'  
WHERE key = ' + jsmith + ',
 String.format(UPDATE JdbcBytes SET '%s' =

[jira] [Commented] (CASSANDRA-3625) Do something about DynamicCompositeType

2011-12-13 Thread Matt Stump (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169082#comment-13169082
]

Matt Stump commented on CASSANDRA-3625:
---

Until a long term solution is found, would it be possible to get something in
the documentation warning people away from DynamicCompositeType? It was
featured rather prominently in Ed's talk so people may mistakingly believe that
DynamicCompositeType is the preferred method to create dynamic indexes.

Do something about DynamicCompositeType
---

Key: CASSANDRA-3625
URL: https://issues.apache.org/jira/browse/CASSANDRA-3625
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Sylvain Lebresne

[jira] [Commented] (CASSANDRA-3621) nodetool is trying to contact old ip address

2011-12-13 Thread Zenek Kraweznik (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169137#comment-13169137
 ] 

Zenek Kraweznik commented on CASSANDRA-3621:


I've restored backup on test cluster, so hostnames must change.

 nodetool is trying to contact old ip address
 

 Key: CASSANDRA-3621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 0.8.8
 Environment: java 1.6.26, linux
Reporter: Zenek Kraweznik

 My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 
 10.0.2.0/24 network (for security resons).
 I want to test new cassandra before upgrading production instances. I've made 
 snapshot and moved it to test servers (except system/LocationInfo* files).
 Changes in configuration: ip adresses (seeds, listen address etc), cluster 
 name. Test server are in 10.0.1.0/24 network.
 In logs I see that test nodes are seeing each other, but when i try to show 
 ring I get this error:
 casstest1:/# nodetool -h 10.0.1.211 ring
 Error connection to remote JMX agent!
 java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; 
 nested exception is:
 java.net.NoRouteToHostException: No route to host
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614)
 at 
 sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198)
 at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110)
 at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown 
 Source)
 at 
 javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329)
 at 
 javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279)
 at 
 javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
 at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140)
 at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110)
 at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582)
 Caused by: java.net.NoRouteToHostException: No route to host
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at java.net.Socket.connect(Socket.java:478)
 at java.net.Socket.init(Socket.java:375)
 at java.net.Socket.init(Socket.java:189)
 at 
 sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22)
 at 
 sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128)
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595)
 ... 10 more
 casstest1:/#
 Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 
 10.0.1.203
 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-3629) Bootstrapping nodes don't ensure schema is ready before continuing

2011-12-13 Thread Brandon Williams (Created) (JIRA)

Bootstrapping nodes don't ensure schema is ready before continuing
--

 Key: CASSANDRA-3629
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3629
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 1.0.7


A bootstrapping node will assume that after it has slept for RING_DELAY it has 
all of the schema migrations and can continue the bootstrap process.  However, 
with a large enough amount of migrations this is not sufficient and causes 
problems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3621) nodetool is trying to contact old ip address

2011-12-13 Thread Zenek Kraweznik (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169137#comment-13169137
 ] 

Zenek Kraweznik edited comment on CASSANDRA-3621 at 12/14/11 7:40 AM:
--

I've restored backup on test cluster, so hostnames must change. And I've never 
used hostname in configuration. What's the meaning of hostname here?

  was (Author: zenek_kraweznik0):
I've restored backup on test cluster, so hostnames must change.
  
 nodetool is trying to contact old ip address
 

 Key: CASSANDRA-3621
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3621
 Project: Cassandra
  Issue Type: Bug
Affects Versions: 0.8.8
 Environment: java 1.6.26, linux
Reporter: Zenek Kraweznik

 My cassandra used to have adresses in 10.0.1.0/24 adresses, I moved it to 
 10.0.2.0/24 network (for security resons).
 I want to test new cassandra before upgrading production instances. I've made 
 snapshot and moved it to test servers (except system/LocationInfo* files).
 Changes in configuration: ip adresses (seeds, listen address etc), cluster 
 name. Test server are in 10.0.1.0/24 network.
 In logs I see that test nodes are seeing each other, but when i try to show 
 ring I get this error:
 casstest1:/# nodetool -h 10.0.1.211 ring
 Error connection to remote JMX agent!
 java.rmi.ConnectIOException: Exception creating connection to: 10.1.0.201; 
 nested exception is:
 java.net.NoRouteToHostException: No route to host
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:614)
 at 
 sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198)
 at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110)
 at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown 
 Source)
 at 
 javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2329)
 at 
 javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279)
 at 
 javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
 at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140)
 at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110)
 at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582)
 Caused by: java.net.NoRouteToHostException: No route to host
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at java.net.Socket.connect(Socket.java:478)
 at java.net.Socket.init(Socket.java:375)
 at java.net.Socket.init(Socket.java:189)
 at 
 sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22)
 at 
 sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128)
 at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595)
 ... 10 more
 casstest1:/#
 Old production adresses in 10.0.1.0/24 were: 10.0.1.201, 10.0.1.202, 
 10.0.1.203
 New adresses for tests: 10.0.1.211, 10.0.1.212, 10.0.1.213

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 104 matches

Mail list logo