> Brutally. kill -9.
that's fine. I was thinking about reboot -f -n

> We are wondering if the fsync of the commit log was working.
I would say yes only because there other reported problems. 

I think case I would not expect to see data lose. If you are still in a test 
scenario can you try to reproduce the problem ? If possible can you reproduce 
it with a single node ?

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/08/2012, at 11:00 AM, rubbish me <rubbish...@googlemail.com> wrote:

> Thanks, Aaron, for your reply - please see the inline.
> 
> 
> On 24 Aug 2012, at 11:04, aaron morton wrote:
> 
>>> - we are running on production linux VMs (not ideal but this is out of our 
>>> hands)
>> Is the VM doing anything wacky with the IO ?
> 
> Could be.  But I thought we would ask here first.  This is a bit difficult to 
> prove cos we dont have the control over these VMs.
> 
>>  
>> 
>>> As part of a DR exercise, we killed all 6 nodes in DC1,
>> Nice disaster. Out of interest, what was the shutdown process ?
> 
> Brutally. kill -9.
> 
> 
>> 
>>> We noticed that data that was written an hour before the exercise, around 
>>> the last memtables being flushed,was not found in DC1. 
>> To confirm, data was written to DC 1 at CL LOCAL_QUORUM before the DR 
>> exercise. 
>> 
>> Was the missing data written before or after the memtable flush ? I'm trying 
>> to understand if the data should have been in the commit log or the 
>> memtables. 
> 
> Missing data was those written after the last flush.  These data was 
> retrievable before the DR exercise.
> 
>> 
>> Can you provide some more info on how you are detecting it is not found in 
>> DC 1?
>> 
> 
> We tried hector, consistencylevel=local quorum.  We had missing column or the 
> whole row.  
> 
> We tried cassandra-cli on DC1 nodes, same.
> 
> However once we run the same query on DC2, C* must have then done a 
> read-repair. That particular piece of result data would appear in DC1 again.
> 
> 
>>> If we understand correctly, commit logs are being written first and then to 
>>> disk every 10s. 
>> Writes are put into a bounded queue and processed as fast as the IO can keep 
>> up. Every 10s a sync messages is added to the queue. Not that the commit log 
>> segment may rotate at any time which requires a sync. 
>> 
>> A loss of data across all nodes in a DC seems odd. If you can provide some 
>> more information we may be able to help. 
> 
> 
> We are wondering if the fsync of the commit log was working.  But we saw no 
> errors / warning in logs.  Wondering if there is way to verify....
> 
> 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 24/08/2012, at 6:01 AM, rubbish me <rubbish...@googlemail.com> wrote:
>> 
>>> Hi all
>>> 
>>> First off, let's introduce the setup. 
>>> 
>>> - 6 x C* 1.1.2 in active DC (DC1), another 6 in another (DC2)
>>> - keyspace's RF=3 in each DC
>>> - Hector as client.
>>> - client talks only to DC1 unless DC1 can't serve the request. In which 
>>> case talks only to DC2
>>> - commit log was periodically sync with the default setting of 10s. 
>>> - consistency policy = LOCAL QUORUM for both read and write. 
>>> - we are running on production linux VMs (not ideal but this is out of our 
>>> hands)
>>> -----
>>> As part of a DR exercise, we killed all 6 nodes in DC1, hector starts 
>>> talking to DC2, all the data was still there, everything continued to work 
>>> perfectly. 
>>> 
>>> Then we brought all nodes, one by one, in DC1 up. We saw a message saying 
>>> all the commit logs were replayed. No errors reported.  We didn't run 
>>> repair at this time. 
>>> 
>>> We noticed that data that was written an hour before the exercise, around 
>>> the last memtables being flushed,was not found in DC1. 
>>> 
>>> If we understand correctly, commit logs are being written first and then to 
>>> disk every 10s. At worst we lost the last 10s of data. What could be the 
>>> cause of this behaviour? 
>>> 
>>> With the blessing of C* we could recovered all these data from DC2. But we 
>>> would like to understand why. 
>>> 
>>> Many thanks in advanced. 
>>> 
>>> Amy
>>> 
>>> 
>> 
> 

Reply via email to