Re: Node down during move

2014-12-29 Thread Robert Coli
On Tue, Dec 23, 2014 at 12:29 AM, Jiri Horky ho...@avast.com wrote:

 just a follow up. We've seen this behavior multiple times now. It seems
 that the receiving node loses connectivity to the cluster and thus
 thinks that it is the sole online node, whereas the rest of the cluster
 thinks that it is the only offline node, really just after the streaming
 is over. I am not sure what causes that, but it is reproducible. Restart
 of the affected node helps.


Streaming is pretty broken throughout 1.x. Unfortunately no one is likely
to fix whatever is wrong in your old version.

You could try tuning the phi detector, IIRC by increasing the number.

=Rob


Re: Node down during move

2014-12-23 Thread Jiri Horky
Hi,

just a follow up. We've seen this behavior multiple times now. It seems
that the receiving node loses connectivity to the cluster and thus
thinks that it is the sole online node, whereas the rest of the cluster
thinks that it is the only offline node, really just after the streaming
is over. I am not sure what causes that, but it is reproducible. Restart
of the affected node helps.

We have 3 datacenters (RF=1 for each datacenter) where we are moving the
tokens. This happens only in one of them.

Regards
Jiri Horky


On 12/19/2014 08:20 PM, Jiri Horky wrote:
 Hi list,

 we added a new node to existing 8-nodes cluster with C* 1.2.9 without
 vnodes and because we are almost totally out of space, we are shuffling
 the token fone node after another (not in parallel). During one of this
 move operations, the receiving node died and thus the streaming failed:

  WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
 StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
  INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
 ColumnFamilyStore.java (line 629) Enqueuing flush of
 Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
  INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
 CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
 /X.Y.Z.18:2,5,RMI Runtime]
 java.lang.RuntimeException: java.io.IOException: Broken pipe
 at com.google.common.base.Throwables.propagate(Throwables.java:160)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: Broken pipe
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

 After restart of the receiving node, we tried to perform the move again,
 but it failed with:

 Exception in thread main java.io.IOException: target token
 113427455640312821154458202477256070486 is already owned by another node.
 at
 org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

 So we tried to move it with a token just 1 higher, to trigger the
 movement. This didn't move anything, but finished successfully:

  INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
 from /X.Y.Z.18

 Now, it is quite improbable that the first streaming was done and it
 died just after copying everything, as the ERROR was the last message
 about streaming in the logs. Is there any way how to make sure the data
 are really moved and thus running nodetool cleanup is safe?

 Thank you.
 Jiri Hoky



Node down during move

2014-12-19 Thread Jiri Horky
Hi list,

we added a new node to existing 8-nodes cluster with C* 1.2.9 without
vnodes and because we are almost totally out of space, we are shuffling
the token fone node after another (not in parallel). During one of this
move operations, the receiving node died and thus the streaming failed:

 WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
 INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
ColumnFamilyStore.java (line 629) Enqueuing flush of
Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
/X.Y.Z.18:2,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

After restart of the receiving node, we tried to perform the move again,
but it failed with:

Exception in thread main java.io.IOException: target token
113427455640312821154458202477256070486 is already owned by another node.
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

So we tried to move it with a token just 1 higher, to trigger the
movement. This didn't move anything, but finished successfully:

 INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
from /X.Y.Z.18

Now, it is quite improbable that the first streaming was done and it
died just after copying everything, as the ERROR was the last message
about streaming in the logs. Is there any way how to make sure the data
are really moved and thus running nodetool cleanup is safe?
   
Thank you.
Jiri Hoky