Re: Nodes Dying in 2.1.2

2015-01-03 Thread Phil Burress
Looks like someone else is experiencing almost exactly what we are seeing:
https://issues.apache.org/jira/browse/CASSANDRA-8552


On Mon, Dec 29, 2014 at 5:14 PM, Robert Coli rc...@eventbrite.com wrote:

 Might be https://issues.apache.org/jira/browse/CASSANDRA-8061 or one of
 the linked/duplicate tickets.

 =Rob

 On Mon, Dec 29, 2014 at 1:40 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Dec 24, 2014 at 9:41 AM, Phil Burress philtburr...@gmail.com
 wrote:

 Just upgraded our cluster from 2.1.1 to 2.1.2 and our nodes keep dying.
 The kernel is killing the process due to out of memory:


 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


 Appears to only occur during compactions. We've tried playing with the
 heap settings but nothing has worked thus far. We did not have this issue
 until we upgraded. Anyone else run into this or have suggestions?


 I would :

 1) see if downgrade is possible (while unsupported, it probably is
 possible) and downgrade if so
 2) search JIRA 2.1 era for related issues
 3) examine changes from 2.1.1 to 2.1.2 which relate to compaction
 4) file a JIRA describing yr experience if no prior one exists

 =Rob





Re: 答复: Downgrade from 2.1.2 to 2.1.1

2014-12-31 Thread Phil Burress
Why don't you use incremental repairs? Is there a known issue with
incremental repairs in 2.1.x?

On Tue, Dec 30, 2014 at 10:22 PM, 李建奇 lijia...@jd.com wrote:

 We also suffer some problem from 2.1.2 . But I think we can deal with .

 First I don’t use incremental repair.

 Second  we restart node after repair . It will release sstable tmplink .

 Third , don’t use stop COMPACTION command.



 If we read 2.1.2 release notes ,we find it solve some issues with 2.1.1 .

 We wait for 2.1.3 .




 --

 李建奇

 京东商城 【 运营研发 架构师】

 =

 手机:18600360358

 地址:北京市大兴区 朝林广场25层

 邮编:100101

 =

 网购上京东,省钱又放心!

 www.jd.com



 *发件人:* Phil Burress [mailto:philtburr...@gmail.com]
 *发送时间:* 2014年12月31日 2:53
 *收件人:* user@cassandra.apache.org
 *主题:* Re: Downgrade from 2.1.2 to 2.1.1



 Thanks Rob.



 On Tue, Dec 30, 2014 at 1:38 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Dec 30, 2014 at 9:42 AM, Phil Burress philtburr...@gmail.com
 wrote:

 We are having a lot of problems with release 2.1.2. It was suggested here
 we should downgrade to 2.1.1 if possible.



 For the experts out there, do you foresee any issues in doing this?



 Not sure if advice from the person who suggested the downgrade is what
 you're looking for, but...



 The classes of risk are :



 - Incompatible changes to system keyspace values (unlikely, but possible
 in a minor version)

 - File format changes (very unlikely in a minor version)

 - Network protocol changes (very unlikely in a minor version)

 - Unexpected exceptions of other classes (monitor in logs)



 Really, read the CHANGES.txt for 2.1.2 and look for the above classes of
 risk. If you have any questions about specific tickets, feel free to ask
 on-thread?



 It's also worth pointing out that you can just downgrade a single node and
 see if it still works. If it does, and doesn't except, you're probably fine?



 =Rob

 PS - Pro-forma disclaimer that downgrading is officially unsupported.





Re: Downgrade from 2.1.2 to 2.1.1

2014-12-30 Thread Phil Burress
Thanks Rob.

On Tue, Dec 30, 2014 at 1:38 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Dec 30, 2014 at 9:42 AM, Phil Burress philtburr...@gmail.com
 wrote:

 We are having a lot of problems with release 2.1.2. It was suggested here
 we should downgrade to 2.1.1 if possible.

 For the experts out there, do you foresee any issues in doing this?


 Not sure if advice from the person who suggested the downgrade is what
 you're looking for, but...

 The classes of risk are :

 - Incompatible changes to system keyspace values (unlikely, but possible
 in a minor version)
 - File format changes (very unlikely in a minor version)
 - Network protocol changes (very unlikely in a minor version)
 - Unexpected exceptions of other classes (monitor in logs)

 Really, read the CHANGES.txt for 2.1.2 and look for the above classes of
 risk. If you have any questions about specific tickets, feel free to ask
 on-thread?

 It's also worth pointing out that you can just downgrade a single node and
 see if it still works. If it does, and doesn't except, you're probably fine?

 =Rob
 PS - Pro-forma disclaimer that downgrading is officially unsupported.



Nodes Dying in 2.1.2

2014-12-24 Thread Phil Burress
Just upgraded our cluster from 2.1.1 to 2.1.2 and our nodes keep dying. The
kernel is killing the process due to out of memory:

kernel:  Out of memory: Kill process 6267 (java) score 998 or sacrifice
child

Appears to only occur during compactions. We've tried playing with the heap
settings but nothing has worked thus far. We did not have this issue until
we upgraded. Anyone else run into this or have suggestions?

Thanks!

Phil


Re: nodetool repair -snapshot option?

2014-07-01 Thread Phil Burress
Thanks! We retrieved all the ranges and started running repair on them. We
ran through all of them but found one single range which brought the ENTIRE
cluster down. All of the other ranges ran quickly and smoothly. This one
problematic range reliably brings it down every time we try to run repair
on it. Any thoughts on why one specific range would be a troublemaker?


On Tue, Jul 1, 2014 at 11:44 AM, Ken Hancock ken.hanc...@schange.com
wrote:

 I also expanded on a script originally written by Matt Stump @ Datastax.
 The readme has the reasoning behind requiring sub-range repairs.

 https://github.com/hancockks/cassandra_range_repair




 On Mon, Jun 30, 2014 at 10:20 PM, Phil Burress philburress...@gmail.com
 wrote:

 @Paulo, this is very cool! Thanks very much for the link!


 On Mon, Jun 30, 2014 at 9:37 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 If you find it useful, I created a tool where you input the node IP,
 keyspace, column family, and optionally the number of partitions (default:
 32K), and it outputs the list of subranges for that node, CF, partition
 size: https://github.com/pauloricardomg/cassandra-list-subranges

 So you can basically iterate over the output of that and do subrange
 repair for each node and cf, maybe in parallel. :)


 On Mon, Jun 30, 2014 at 10:26 PM, Phil Burress philburress...@gmail.com
  wrote:

 One last question. Any tips on scripting a subrange repair?


 On Mon, Jun 30, 2014 at 7:12 PM, Phil Burress philburress...@gmail.com
  wrote:

 We are running repair -pr. We've tried subrange manually and that
 seems to work ok. I guess we'll go with that going forward. Thanks for all
 the info!


 On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 Are you running full repair or on subset? If you are running full
 repair then try running on sub-set of ranges which means less data to 
 worry
 during repair and that would help JAVA heap in general. You will have to 
 do
 multiple iterations to complete entire range but at-least it will work.

 -jaydeep


 On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com
 wrote:

 Repair uses snapshot option by default since 2.0.2 (see NEWS.txt).


 As a general meta comment, the process by which operationally
 important defaults change in Cassandra seems ad-hoc and sub-optimal.

 For to record, my view was that this change, which makes repair even
 slower than it previously was, was probably overly optimistic.

 It's also weird in that it changes default behavior which has been
 unchanged since the start of Cassandra time and is therefore probably
 automated against. Why was it so critically important to switch to 
 snapshot
 repair that it needed to be shotgunned as a new default in 2.0.2?

 =Rob








 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200





 --
 *Ken Hancock *| System Architect, Advanced Advertising
 SeaChange International
 50 Nagog Park
 Acton, Massachusetts 01720
 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC
 http://www.schange.com/en-US/Company/InvestorRelations.aspx
 Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com
  | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image:
 LinkedIn] http://www.linkedin.com/in/kenhancock

 [image: SeaChange International]
  http://www.schange.com/This e-mail and any attachments may contain
 information which is SeaChange International confidential. The information
 enclosed is intended only for the addressees herein and may not be copied
 or forwarded without permission from SeaChange International.



nodetool repair -snapshot option?

2014-06-30 Thread Phil Burress
We are running into an issue with nodetool repair. One or more of our nodes
will die with OOM errors when running nodetool repair on a single node. Was
reading this http://www.datastax.com/dev/blog/advanced-repair-techniques
and it mentioned using the -snapshot option, however, that doesn't appear
to be an option in the version we have. We are running 2.0.7 with vnodes.
Any insight into what might be causing these OOMs and/or what version this
-snapshot option is available in?

Thanks!

Phil


Re: nodetool repair -snapshot option?

2014-06-30 Thread Phil Burress
We are running repair -pr. We've tried subrange manually and that seems to
work ok. I guess we'll go with that going forward. Thanks for all the info!


On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia 
chovatia.jayd...@gmail.com wrote:

 Are you running full repair or on subset? If you are running full repair
 then try running on sub-set of ranges which means less data to worry during
 repair and that would help JAVA heap in general. You will have to do
 multiple iterations to complete entire range but at-least it will work.

 -jaydeep


 On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com
 wrote:

 Repair uses snapshot option by default since 2.0.2 (see NEWS.txt).


 As a general meta comment, the process by which operationally important
 defaults change in Cassandra seems ad-hoc and sub-optimal.

 For to record, my view was that this change, which makes repair even
 slower than it previously was, was probably overly optimistic.

 It's also weird in that it changes default behavior which has been
 unchanged since the start of Cassandra time and is therefore probably
 automated against. Why was it so critically important to switch to snapshot
 repair that it needed to be shotgunned as a new default in 2.0.2?

 =Rob






Re: nodetool repair -snapshot option?

2014-06-30 Thread Phil Burress
One last question. Any tips on scripting a subrange repair?


On Mon, Jun 30, 2014 at 7:12 PM, Phil Burress philburress...@gmail.com
wrote:

 We are running repair -pr. We've tried subrange manually and that seems to
 work ok. I guess we'll go with that going forward. Thanks for all the info!


 On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 Are you running full repair or on subset? If you are running full repair
 then try running on sub-set of ranges which means less data to worry during
 repair and that would help JAVA heap in general. You will have to do
 multiple iterations to complete entire range but at-least it will work.

 -jaydeep


 On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com
 wrote:

 Repair uses snapshot option by default since 2.0.2 (see NEWS.txt).


 As a general meta comment, the process by which operationally important
 defaults change in Cassandra seems ad-hoc and sub-optimal.

 For to record, my view was that this change, which makes repair even
 slower than it previously was, was probably overly optimistic.

 It's also weird in that it changes default behavior which has been
 unchanged since the start of Cassandra time and is therefore probably
 automated against. Why was it so critically important to switch to snapshot
 repair that it needed to be shotgunned as a new default in 2.0.2?

 =Rob







Re: nodetool repair -snapshot option?

2014-06-30 Thread Phil Burress
@Paulo, this is very cool! Thanks very much for the link!


On Mon, Jun 30, 2014 at 9:37 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 If you find it useful, I created a tool where you input the node IP,
 keyspace, column family, and optionally the number of partitions (default:
 32K), and it outputs the list of subranges for that node, CF, partition
 size: https://github.com/pauloricardomg/cassandra-list-subranges

 So you can basically iterate over the output of that and do subrange
 repair for each node and cf, maybe in parallel. :)


 On Mon, Jun 30, 2014 at 10:26 PM, Phil Burress philburress...@gmail.com
 wrote:

 One last question. Any tips on scripting a subrange repair?


 On Mon, Jun 30, 2014 at 7:12 PM, Phil Burress philburress...@gmail.com
 wrote:

 We are running repair -pr. We've tried subrange manually and that seems
 to work ok. I guess we'll go with that going forward. Thanks for all the
 info!


 On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 Are you running full repair or on subset? If you are running full
 repair then try running on sub-set of ranges which means less data to worry
 during repair and that would help JAVA heap in general. You will have to do
 multiple iterations to complete entire range but at-least it will work.

 -jaydeep


 On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com
 wrote:

 Repair uses snapshot option by default since 2.0.2 (see NEWS.txt).


 As a general meta comment, the process by which operationally
 important defaults change in Cassandra seems ad-hoc and sub-optimal.

 For to record, my view was that this change, which makes repair even
 slower than it previously was, was probably overly optimistic.

 It's also weird in that it changes default behavior which has been
 unchanged since the start of Cassandra time and is therefore probably
 automated against. Why was it so critically important to switch to 
 snapshot
 repair that it needed to be shotgunned as a new default in 2.0.2?

 =Rob








 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200



Ec2 Network I/O

2014-05-19 Thread Phil Burress
Has anyone experienced network i/o issues with ec2? We are seeing a lot of
these in our logs:

HintedHandOffManager.java (line 477) Timed out replaying hints to
/10.0.x.xxx; aborting (15 delivered)

and these...

Cannot handshake version with /10.0.x.xxx

and these...

java.io.IOException: Cannot proceed on repair because a neighbor
(/10.0.x.xxx) is dead: session failed

Occurs on all of our nodes. Even though in all cases, the host that is
being reported as down or unavailable is up and readily 'pingable'.

We are using shared tenancy on all our nodes (instance type m1.xlarge) with
cassandra 2.0.7. Any suggestions on how to debug these errors?

Is there a recommendation to move to Placement Groups for Cassandra?

Thanks!

Phil


Re: Recommended Approach for Config Changes

2014-04-28 Thread Phil Burress
Thanks for all the good info.

We have found that running drain first before restarting should always be
done, even if there is not much data or I/O.

Also, we've found that node tool drain returns often before it's finished,
so it's important to watch the logs (or opscenter) for it and any
compaction tasks to complete before restarting.
On Apr 25, 2014 1:09 PM, Jon Haddad j...@jonhaddad.com wrote:

 You might want to take a peek at what’s happening in the process via
 strace -p or tcpdump.  I can’t remember ever waiting an hour for a node to
 rejoin.


 On Apr 25, 2014, at 8:59 AM, Tyler Hobbs ty...@datastax.com wrote:


 On Fri, Apr 25, 2014 at 10:43 AM, Phil Burress 
 philburress...@gmail.comwrote:

 Thanks. I made a change to a single node and it took almost an hour to
 rejoin the cluster (go from DN to UP in nodetool status). The cluster is
 pretty much idle right now and has a very small dataset. Is that normal?


 Not unless it had to replay a lot of commitlogs on startup.  If you look
 at your logs and see that that's the case, you may want to run 'nodetool
 drain' before stopping the node.


 --
 Tyler Hobbs
 DataStax http://datastax.com/





Recommended Approach for Config Changes

2014-04-25 Thread Phil Burress
If I wanted to make a configuration change to a single node in a cluster,
what is the recommended approach for doing that? Is it ok to just stop that
instance, make the change and then restart it?


Re: Bootstrap Timing

2014-04-25 Thread Phil Burress
Just a follow-up on this for any interested parties. Ultimately we've
determined that the bootstrap/join process is broken in Cassandra. We ended
up creating an entirely new cluster and migrating the data.


On Mon, Apr 21, 2014 at 10:32 AM, Phil Burress philburress...@gmail.comwrote:

 The new node has managed to stay up without dying for about 24 hours
 now... but it still is in JOINING state. A new concern has popped up. Disk
 usage is at 500GB on the new node. The three original nodes have about 40GB
 each. Any ideas why this is happening?


 On Sat, Apr 19, 2014 at 9:19 PM, Phil Burress philburress...@gmail.comwrote:

 Thank you all for your advice and good info. The node has died a couple
 of times with out of memory errors. I've restarted each time but it starts
 re - running compaction and then dies again.

 Is there a better way to do this?
 On Apr 18, 2014 6:06 PM, Steven A Robenalt srobe...@stanford.edu
 wrote:

 That's what I'd be doing, but I wouldn't expect it to run for 3 days
 this time. My guess is that whatever was going wrong with the bootstrap
 when you had 3 nodes starting at once was interfering with the completion
 of the 1 remaining node of those 3. A clean bootstrap of a single node
 should complete eventually, and I would think it'll be a lot less than 3
 days. Our database is much smaller than yours at the moment, so I can't
 really guide you on how long it should take, but I'd think that others on
 the list with similar database sizes might be able to give you a better
 idea.

 Steve



 On Fri, Apr 18, 2014 at 1:43 PM, Phil Burress 
 philburress...@gmail.comwrote:

 First, I just stopped 2 of the nodes and left one running. But this
 morning, I stopped that third node, cleared out the data, restarted and let
 it rejoin again. It appears streaming is done (according to netstats),
 right now it appears to be running compaction and building secondary index
 (according to compactionstats). Just sit and wait I guess?


 On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Looking back through this email chain, it looks like Phil said he
 wasn't using vnodes.

 For the record, we are using vnodes since we brought up our first
 cluster, and have not seen any issues with bootstrapping new nodes either
 to replace existing nodes, or to grow/shrink the cluster. We did adhere to
 the caveats that new nodes should not be seed nodes, and that we should
 allow each node to join the cluster completely before making any other
 changes.

 Phil, when you dropped to adding just the single node to your cluster,
 did you start over with the newly added node (blowing away the database
 created on the previous startup), or did you shut down the other 2 added
 nodes and leave the remaining one in progress to continue?

 Steve


 On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress 
 philburress...@gmail.com wrote:

 nodetool netstats shows 84 files. They are all at 100%. Nothing
 showing in Pending or Active for Read Repair Stats.

 I'm assuming this means it's done. But it still shows JOINING. Is
 there an undocumented step I'm missing here? This whole process seems
 broken to me.


 Lately it seems like a lot more people than usual are :

 1) using vnodes
 2) unable to bootstrap new nodes

 If I were you, I would likely file a JIRA detailing your negative
 experience with this core functionality.

 =Rob






 --
 Steve Robenalt
 Software Architect
  HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









Re: Bootstrap Timing

2014-04-25 Thread Phil Burress
Cassandra 2.0.6


On Fri, Apr 25, 2014 at 10:31 AM, James Rothering jrother...@codojo.mewrote:

 What version of C* is this?


 On Fri, Apr 25, 2014 at 6:55 AM, Phil Burress philburress...@gmail.comwrote:

 Just a follow-up on this for any interested parties. Ultimately we've
 determined that the bootstrap/join process is broken in Cassandra. We ended
 up creating an entirely new cluster and migrating the data.


 On Mon, Apr 21, 2014 at 10:32 AM, Phil Burress 
 philburress...@gmail.comwrote:

 The new node has managed to stay up without dying for about 24 hours
 now... but it still is in JOINING state. A new concern has popped up. Disk
 usage is at 500GB on the new node. The three original nodes have about 40GB
 each. Any ideas why this is happening?


 On Sat, Apr 19, 2014 at 9:19 PM, Phil Burress 
 philburress...@gmail.comwrote:

 Thank you all for your advice and good info. The node has died a couple
 of times with out of memory errors. I've restarted each time but it starts
 re - running compaction and then dies again.

 Is there a better way to do this?
 On Apr 18, 2014 6:06 PM, Steven A Robenalt srobe...@stanford.edu
 wrote:

 That's what I'd be doing, but I wouldn't expect it to run for 3 days
 this time. My guess is that whatever was going wrong with the bootstrap
 when you had 3 nodes starting at once was interfering with the completion
 of the 1 remaining node of those 3. A clean bootstrap of a single node
 should complete eventually, and I would think it'll be a lot less than 3
 days. Our database is much smaller than yours at the moment, so I can't
 really guide you on how long it should take, but I'd think that others on
 the list with similar database sizes might be able to give you a better
 idea.

 Steve



 On Fri, Apr 18, 2014 at 1:43 PM, Phil Burress 
 philburress...@gmail.com wrote:

 First, I just stopped 2 of the nodes and left one running. But this
 morning, I stopped that third node, cleared out the data, restarted and 
 let
 it rejoin again. It appears streaming is done (according to netstats),
 right now it appears to be running compaction and building secondary 
 index
 (according to compactionstats). Just sit and wait I guess?


 On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Looking back through this email chain, it looks like Phil said he
 wasn't using vnodes.

 For the record, we are using vnodes since we brought up our first
 cluster, and have not seen any issues with bootstrapping new nodes 
 either
 to replace existing nodes, or to grow/shrink the cluster. We did adhere 
 to
 the caveats that new nodes should not be seed nodes, and that we should
 allow each node to join the cluster completely before making any other
 changes.

 Phil, when you dropped to adding just the single node to your
 cluster, did you start over with the newly added node (blowing away the
 database created on the previous startup), or did you shut down the 
 other 2
 added nodes and leave the remaining one in progress to continue?

 Steve


 On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli 
 rc...@eventbrite.comwrote:

 On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress 
 philburress...@gmail.com wrote:

 nodetool netstats shows 84 files. They are all at 100%. Nothing
 showing in Pending or Active for Read Repair Stats.

 I'm assuming this means it's done. But it still shows JOINING.
 Is there an undocumented step I'm missing here? This whole process 
 seems
 broken to me.


 Lately it seems like a lot more people than usual are :

 1) using vnodes
 2) unable to bootstrap new nodes

 If I were you, I would likely file a JIRA detailing your negative
 experience with this core functionality.

 =Rob






 --
 Steve Robenalt
 Software Architect
  HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu











Re: Recommended Approach for Config Changes

2014-04-25 Thread Phil Burress
Thanks. I made a change to a single node and it took almost an hour to
rejoin the cluster (go from DN to UP in nodetool status). The cluster is
pretty much idle right now and has a very small dataset. Is that normal?


On Fri, Apr 25, 2014 at 10:08 AM, Chris Lohfink clohf...@blackbirdit.comwrote:

 Yes.

 Some changes you can manually have take affect without a restart (ie
 compactionthroughput, things settable from jmx).  There is also config
 changes you cant really make like switching the snitch and such without a
 big todo.

 ---
 Chris

 On Apr 25, 2014, at 8:53 AM, Phil Burress philburress...@gmail.com
 wrote:

  If I wanted to make a configuration change to a single node in a
 cluster, what is the recommended approach for doing that? Is it ok to just
 stop that instance, make the change and then restart it?




Re: Bootstrap Timing

2014-04-21 Thread Phil Burress
The new node has managed to stay up without dying for about 24 hours now...
but it still is in JOINING state. A new concern has popped up. Disk usage
is at 500GB on the new node. The three original nodes have about 40GB each.
Any ideas why this is happening?


On Sat, Apr 19, 2014 at 9:19 PM, Phil Burress philburress...@gmail.comwrote:

 Thank you all for your advice and good info. The node has died a couple of
 times with out of memory errors. I've restarted each time but it starts re
 - running compaction and then dies again.

 Is there a better way to do this?
 On Apr 18, 2014 6:06 PM, Steven A Robenalt srobe...@stanford.edu
 wrote:

 That's what I'd be doing, but I wouldn't expect it to run for 3 days this
 time. My guess is that whatever was going wrong with the bootstrap when you
 had 3 nodes starting at once was interfering with the completion of the 1
 remaining node of those 3. A clean bootstrap of a single node should
 complete eventually, and I would think it'll be a lot less than 3 days. Our
 database is much smaller than yours at the moment, so I can't really guide
 you on how long it should take, but I'd think that others on the list with
 similar database sizes might be able to give you a better idea.

 Steve



 On Fri, Apr 18, 2014 at 1:43 PM, Phil Burress 
 philburress...@gmail.comwrote:

 First, I just stopped 2 of the nodes and left one running. But this
 morning, I stopped that third node, cleared out the data, restarted and let
 it rejoin again. It appears streaming is done (according to netstats),
 right now it appears to be running compaction and building secondary index
 (according to compactionstats). Just sit and wait I guess?


 On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Looking back through this email chain, it looks like Phil said he
 wasn't using vnodes.

 For the record, we are using vnodes since we brought up our first
 cluster, and have not seen any issues with bootstrapping new nodes either
 to replace existing nodes, or to grow/shrink the cluster. We did adhere to
 the caveats that new nodes should not be seed nodes, and that we should
 allow each node to join the cluster completely before making any other
 changes.

 Phil, when you dropped to adding just the single node to your cluster,
 did you start over with the newly added node (blowing away the database
 created on the previous startup), or did you shut down the other 2 added
 nodes and leave the remaining one in progress to continue?

 Steve


 On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress 
 philburress...@gmail.com wrote:

 nodetool netstats shows 84 files. They are all at 100%. Nothing
 showing in Pending or Active for Read Repair Stats.

 I'm assuming this means it's done. But it still shows JOINING. Is
 there an undocumented step I'm missing here? This whole process seems
 broken to me.


 Lately it seems like a lot more people than usual are :

 1) using vnodes
 2) unable to bootstrap new nodes

 If I were you, I would likely file a JIRA detailing your negative
 experience with this core functionality.

 =Rob






 --
 Steve Robenalt
 Software Architect
  HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








Re: Bootstrap Timing

2014-04-19 Thread Phil Burress
Thank you all for your advice and good info. The node has died a couple of
times with out of memory errors. I've restarted each time but it starts re
- running compaction and then dies again.

Is there a better way to do this?
On Apr 18, 2014 6:06 PM, Steven A Robenalt srobe...@stanford.edu wrote:

 That's what I'd be doing, but I wouldn't expect it to run for 3 days this
 time. My guess is that whatever was going wrong with the bootstrap when you
 had 3 nodes starting at once was interfering with the completion of the 1
 remaining node of those 3. A clean bootstrap of a single node should
 complete eventually, and I would think it'll be a lot less than 3 days. Our
 database is much smaller than yours at the moment, so I can't really guide
 you on how long it should take, but I'd think that others on the list with
 similar database sizes might be able to give you a better idea.

 Steve



 On Fri, Apr 18, 2014 at 1:43 PM, Phil Burress philburress...@gmail.comwrote:

 First, I just stopped 2 of the nodes and left one running. But this
 morning, I stopped that third node, cleared out the data, restarted and let
 it rejoin again. It appears streaming is done (according to netstats),
 right now it appears to be running compaction and building secondary index
 (according to compactionstats). Just sit and wait I guess?


 On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt srobe...@stanford.edu
  wrote:

 Looking back through this email chain, it looks like Phil said he wasn't
 using vnodes.

 For the record, we are using vnodes since we brought up our first
 cluster, and have not seen any issues with bootstrapping new nodes either
 to replace existing nodes, or to grow/shrink the cluster. We did adhere to
 the caveats that new nodes should not be seed nodes, and that we should
 allow each node to join the cluster completely before making any other
 changes.

 Phil, when you dropped to adding just the single node to your cluster,
 did you start over with the newly added node (blowing away the database
 created on the previous startup), or did you shut down the other 2 added
 nodes and leave the remaining one in progress to continue?

 Steve


 On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress philburress...@gmail.com
  wrote:

 nodetool netstats shows 84 files. They are all at 100%. Nothing
 showing in Pending or Active for Read Repair Stats.

 I'm assuming this means it's done. But it still shows JOINING. Is
 there an undocumented step I'm missing here? This whole process seems
 broken to me.


 Lately it seems like a lot more people than usual are :

 1) using vnodes
 2) unable to bootstrap new nodes

 If I were you, I would likely file a JIRA detailing your negative
 experience with this core functionality.

 =Rob






 --
 Steve Robenalt
 Software Architect
  HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








Re: Bootstrap Timing

2014-04-18 Thread Phil Burress
nodetool netstats shows 84 files. They are all at 100%. Nothing showing in
Pending or Active for Read Repair Stats.

I'm assuming this means it's done. But it still shows JOINING. Is there
an undocumented step I'm missing here? This whole process seems broken to
me.


On Thu, Apr 17, 2014 at 4:32 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 16, 2014 at 1:56 PM, Phil Burress philburress...@gmail.comwrote:

 I've shut down two of the nodes and am bootstrapping one right now. Is
 there any way to tell when it will finish bootstrapping?


 nodetool netstats will show the progress of the streams involved, which
 could help you estimate.

 =Rob




Re: Bootstrap Timing

2014-04-18 Thread Phil Burress
First, I just stopped 2 of the nodes and left one running. But this
morning, I stopped that third node, cleared out the data, restarted and let
it rejoin again. It appears streaming is done (according to netstats),
right now it appears to be running compaction and building secondary index
(according to compactionstats). Just sit and wait I guess?


On Fri, Apr 18, 2014 at 2:23 PM, Steven A Robenalt srobe...@stanford.eduwrote:

 Looking back through this email chain, it looks like Phil said he wasn't
 using vnodes.

 For the record, we are using vnodes since we brought up our first cluster,
 and have not seen any issues with bootstrapping new nodes either to replace
 existing nodes, or to grow/shrink the cluster. We did adhere to the caveats
 that new nodes should not be seed nodes, and that we should allow each node
 to join the cluster completely before making any other changes.

 Phil, when you dropped to adding just the single node to your cluster, did
 you start over with the newly added node (blowing away the database created
 on the previous startup), or did you shut down the other 2 added nodes and
 leave the remaining one in progress to continue?

 Steve


 On Fri, Apr 18, 2014 at 10:40 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Apr 18, 2014 at 5:05 AM, Phil Burress 
 philburress...@gmail.comwrote:

 nodetool netstats shows 84 files. They are all at 100%. Nothing showing
 in Pending or Active for Read Repair Stats.

 I'm assuming this means it's done. But it still shows JOINING. Is
 there an undocumented step I'm missing here? This whole process seems
 broken to me.


 Lately it seems like a lot more people than usual are :

 1) using vnodes
 2) unable to bootstrap new nodes

 If I were you, I would likely file a JIRA detailing your negative
 experience with this core functionality.

 =Rob






 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








Bootstrap Timing

2014-04-16 Thread Phil Burress
Greetings,

How long does bootstrapping typically take? I have 3 existing nodes in our
cluster with about 40GB each. I've added three new nodes to the cluster.
They have been in bootstrap mode for a little over 3 days now. Should I be
concerned? Is there a way to tell how long it will take to finish?

Running Cassandra version 2.0.6. on Ubuntu 12.04.

Thanks very much!

Phil


Re: Bootstrap Timing

2014-04-16 Thread Phil Burress
Thanks very much for the response. I'm not using vnodes, does that matter?


On Wed, Apr 16, 2014 at 2:13 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 16, 2014 at 11:10 AM, Phil Burress 
 philburress...@gmail.comwrote:

 How long does bootstrapping typically take? I have 3 existing nodes in
 our cluster with about 40GB each. I've added three new nodes to the
 cluster. They have been in bootstrap mode for a little over 3 days now.
 Should I be concerned? Is there a way to tell how long it will take to
 finish?


 Adding more than one node at a time to a cluster (especially with vnodes)
 is Not Supported. If I were you, I would stop all 3 bootstraps and then do
 one at a time.

 =Rob




Re: Bootstrap Timing

2014-04-16 Thread Phil Burress
Also, one more quick question. For the new nodes, do I add all three
existing nodes as seeds? Or just add one?


On Wed, Apr 16, 2014 at 2:16 PM, Phil Burress philburress...@gmail.comwrote:

 Thanks very much for the response. I'm not using vnodes, does that matter?


 On Wed, Apr 16, 2014 at 2:13 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 16, 2014 at 11:10 AM, Phil Burress 
 philburress...@gmail.comwrote:

 How long does bootstrapping typically take? I have 3 existing nodes in
 our cluster with about 40GB each. I've added three new nodes to the
 cluster. They have been in bootstrap mode for a little over 3 days now.
 Should I be concerned? Is there a way to tell how long it will take to
 finish?


 Adding more than one node at a time to a cluster (especially with vnodes)
 is Not Supported. If I were you, I would stop all 3 bootstraps and then do
 one at a time.

  =Rob






Re: Bootstrap Timing

2014-04-16 Thread Phil Burress
Thanks!


On Wed, Apr 16, 2014 at 2:50 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 16, 2014 at 11:16 AM, Phil Burress 
 philburress...@gmail.comwrote:

 Thanks very much for the response. I'm not using vnodes, does that
 matter?


 Not in your case. In some cases it is safe to bootstrap multiple nodes
 into a cluster at once AT SPECIFIC TOKENS, because there is more than one
 replica set to bootstrap them into safely. Even in this case, it is not
 recommended.


 For the new nodes, do I add all three existing nodes as seeds? Or just
 add one?


 One should be sufficient, but all three could not hurt.

 =Rob




Re: Bootstrap Timing

2014-04-16 Thread Phil Burress
I've shut down two of the nodes and am bootstrapping one right now. Is
there any way to tell when it will finish bootstrapping?


On Wed, Apr 16, 2014 at 2:56 PM, Phil Burress philburress...@gmail.comwrote:

 Thanks!


 On Wed, Apr 16, 2014 at 2:50 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 16, 2014 at 11:16 AM, Phil Burress 
 philburress...@gmail.comwrote:

 Thanks very much for the response. I'm not using vnodes, does that
 matter?


 Not in your case. In some cases it is safe to bootstrap multiple nodes
 into a cluster at once AT SPECIFIC TOKENS, because there is more than one
 replica set to bootstrap them into safely. Even in this case, it is not
 recommended.


 For the new nodes, do I add all three existing nodes as seeds? Or just
 add one?


 One should be sufficient, but all three could not hurt.

 =Rob