og a JIRA for the “failure node rejoin” issue (
> https://issues.apache.org/*jira*/browse/
> <https://issues.apache.org/jira/browse/>*cassandra
> <https://issues.apache.org/jira/browse/cassandra>). I*t sounds like
> unexpected behaviour to me. However, I’m not sure it will be view
You could certainly log a JIRA for the “failure node rejoin” issue (
https://issues.apache.org/*jira*/browse/
<https://issues.apache.org/jira/browse/>*cassandra
<https://issues.apache.org/jira/browse/cassandra>). I*t sounds like
unexpected behaviour to me. However, I’m not sure it wi
Hi Ben,
I continue to investigate the data loss issue.
I'm investigating logs and source code and try to reproduce the data loss
issue with a simple test.
I also try my destructive test with DROP instead of TRUNCATE.
BTW, I want to discuss the issue of the title "failure node rejoin&q
>From a quick look I couldn’t find any defects other than the ones you’ve
found that seem potentially relevant to your issue (if any one else on the
list knows of one please chime in). Maybe the next step, if you haven’t
done so already, is to check your Cassandra logs for any signs of issues
(ie
Thanks Ben,
I tried 2.2.8 and could reproduce the problem.
So, I'm investigating some bug fixes of repair and commitlog between 2.2.8
and 3.0.9.
- CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
- CASSANDRA-12436: "Under some races commit log may incorrectly think it
There have been a few commit log bugs around in the last couple of months
so perhaps you’ve hit something that was fixed recently. Would be
interesting to know the problem is still occurring in 2.2.8.
I suspect what is happening is that when you do your initial read (without
flush) to check the
I tried C* 3.0.9 instead of 2.2.
The data lost problem hasn't happen for now (without `nodetool flush`).
Thanks
On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito wrote:
> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did
Thanks Ben,
When I added `nodetool flush` on all nodes after step 2, the problem didn't
happen.
Did replay from old commit logs delete rows?
Perhaps, the flush operation just detected that some nodes were down in
step 2 (just after truncating tables).
(Insertion and check in step2 would succeed
Definitely sounds to me like something is not working as expected but I
don’t really have any idea what would cause that (other than the fairly
extreme failure scenario). A couple of things I can think of to try to
narrow it down:
1) Run nodetool flush on all nodes after step 2 - that will make
Just to confirm, are you saying:
a) after operation 2, you select all and get 1000 rows
b) after operation 3 (which only does updates and read) you select and only
get 953 rows?
If so, that would be very unexpected. If you run your tests without killing
nodes do you get the expected (1,000) rows?
> Are you certain your tests don’t generate any overlapping inserts (by PK)?
Yes. The operation 2) also checks the number of rows just after all
insertions.
On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater
wrote:
> OK. Are you certain your tests don’t generate any
OK. Are you certain your tests don’t generate any overlapping inserts (by
PK)? Cassandra basically treats any inserts with the same primary key as
updates (so 1000 insert operations may not necessarily result in 1000 rows
in the DB).
On Fri, 21 Oct 2016 at 16:30 Yuji Ito
thanks Ben,
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
mismatch between actual and expected) - at that end of operation (2) or
after operation (3)?
after operation 3), at operation 4) which reads all rows by cqlsh with
CL.SERIAL
> 2) What replication factor and
A couple of questions:
1) At what stage did you have (or expect to have) 1000 rows (and have the
mismatch between actual and expected) - at that end of operation (2) or
after operation (3)?
2) What replication factor and replication strategy is used by the test
keyspace? What consistency level is
Thanks Ben,
I tried to run a rebuild and repair after the failure node rejoined the
cluster as a "new" node with -Dcassandra.replace_address_first_boot.
The failure node could rejoined and I could read all rows successfully.
(Sometimes a repair failed because the node cannot access other node. If
OK, that’s a bit more unexpected (to me at least) but I think the solution
of running a rebuild or repair still applies.
On Tue, 18 Oct 2016 at 15:45 Yuji Ito wrote:
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2
Thanks Ben, Jeff
Sorry that my explanation confused you.
Only node1 is the seed node.
Node2 whose C* data is deleted is NOT a seed.
I restarted the failure node(node2) after restarting the seed node(node1).
The restarting node2 succeeded without the exception.
(I couldn't restart node2 before
The unstated "problem" here is that node1 is a seed, which implies
auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly setup
to start without bootstrapping).
That means once the data dir is wiped, it's going to start again without a
bootstrap, and make a single node
OK, sorry - I think understand what you are asking now.
However, I’m still a little confused by your description. I think your
scenario is:
1) Stop C* on all nodes in a cluster (Nodes A,B,C)
2) Delete all data from Node A
3) Restart Node A
4) Restart Node B,C
Is this correct?
If so, this isn’t
The exception you run into is expected behavior. This is because as Ben
pointed out, when you delete everything (including system schemas), C*
cluster thinks you're bootstrapping a new node. However, node2's IP is
still in gossip and this is why you see the exception.
I'm not clear the reasoning
To cassandra, the node where you deleted the files looks like a brand new
machine. It doesn’t automatically rebuild machines to prevent accidental
replacement. You need to tell it to build the “new” machines as a
replacement for the “old” machine with that IP by setting
Hi all,
A failure node can rejoin a cluster.
On the node, all data in /var/lib/cassandra were deleted.
Is it normal?
I can reproduce it as below.
cluster:
- C* 2.2.7
- a cluster has node1, 2, 3
- node1 is a seed
- replication_factor: 3
how to:
1) stop C* process and delete all data in
22 matches
Mail list logo