[jira] [Comment Edited] (CASSANDRA-5789) Data not fully replicated with 2 nodes and replication factor 2

James Lee (JIRA) Wed, 07 Aug 2013 06:53:33 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13731998#comment-13731998
 ]


James Lee edited comment on CASSANDRA-5789 at 8/7/13 1:52 PM:
--------------------------------------------------------------

Repro script for the bug, run it as follows:
-- The script assumes you have two-node Cassandra cluster set up and running.
-- The system running the test should have Python (I used 2.7) with pycassa 
installed.
-- Run the setup stage as follows: "python CassBugRepro.py -c ip1,ip2 -s -f".  
This creates keyspaces and writes 2M rows into them.
-- Once the above has completed, wait until all hints have been delivered (I 
checked using nodetool).
-- Then run the next stage which does random read/writes: "python 
CassBugRepro.py -c ip1,ip2 -r".
-- If the bug has been repro'd, you'll see output like "NotFoundException for 
DN 11055691"; where we haven't found something we'd previously sucessfully 
written.

Note that repeating the above but omitting the "-f" parameter in the setup 
stage will reduce the rate at which we initially populate the keys.  I then see 
no read failures.
                
      was (Author: jameslee):
    Repro script for the bug, run it as follows:
-- The script assumes you have two-node Cassandra cluster set up and running.
-- The system running the test should have Python (I used 2.7) with pycassa 
installed.
-- Run the setup stage as follows: "python CassBugRepro.py -c ip1,ip2 -s -f".  
This creates keyspaces and writes 2M rows into them.
-- Once the above has completed, wait until all hints have been delivered (I 
checked using nodetool).
-- Then run the next stage which does random read/writes: "python 
CassBugRepro.py -c ip1,ip2 -r".
-- If the bug has been repro'd, you'll see output like "NotFoundException for 
DN 11055691"; where we haven't found something we'd previously sucessfully 
written.
                  
> Data not fully replicated with 2 nodes and replication factor 2
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-5789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5789
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.2.2, 1.2.6
>         Environment: Official Datastax Cassandra 1.2.6, running on Linux RHEL 
> 6.2.  I've seen the same behavior with Cassandra 1.2.2.
> Sun Java 1.7.0_10-b18 64-bit
> Java heap settings: -Xms8192M -Xmx8192M -Xmn2048M
>            Reporter: James Lee
>         Attachments: CassBugRepro.py
>
>
> I'm seeing a problem with a 2-node Cassandra test deployment, where it seems 
> that data isn't being replicated among the nodes as I would expect.
> The setup and test is as follows:
> - Two Cassandra nodes in the cluster (they each have themselves and the other 
> node as seeds in cassandra.yaml).
> - Create 40 keyspaces, each with simple replication strategy and 
> replication factor 2.
> - Populate 125,000 rows into each keyspace, using a pycassa client with a 
> connection pool pointed at both nodes.  These are populated with writes using 
> consistency level of 1.
> - Wait until nodetool on each node reports that there are no hinted handoffs 
> outstanding (see output below).
> - Do random reads of the rows in the keyspaces, again using a pycassa client 
> with a connection pool pointed at both nodes.  These are read using 
> consistency level 1.
> I'm finding that the vast majority of reads are successful, but a small 
> proportion (~0.1%) are returned as Not Found.  If I manually try to look up 
> those keys using cassandra-cli, I see that they are returned when querying 
> one of the nodes, but not when querying the other.  So it seems like some of 
> the rows have simply not been replicated, even though the write for these 
> rows was reported to the client as successful.
> If I reduce the rate at which the test tool initially writes data into the 
> database then I don't see any failed reads, so this seems like a load-related 
> issue.  My understanding is that if all writes were successful and there are 
> no pending hinted handoffs, then the data should be fully-replicated and 
> reads should return it (even with read and write consistency of 1).
> Here's the output from notetool on the two nodes:
> comet-mvs01:/dsc-cassandra-1.2.6# ./bin/nodetool tpstats
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> ReadStage                         0         0              2         0        
>          0
> RequestResponseStage              0         0         878494         0        
>          0
> MutationStage                     0         0        2869107         0        
>          0
> ReadRepairStage                   0         0              0         0        
>          0
> ReplicateOnWriteStage             0         0              0         0        
>          0
> GossipStage                       0         0           2208         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> MigrationStage                    0         0            994         0        
>          0
> MemtablePostFlusher               0         0           4399         0        
>          0
> FlushWriter                       0         0           2264         0        
>        556
> MiscStage                         0         0              0         0        
>          0
> commitlog_archiver                0         0              0         0        
>          0
> InternalResponseStage             0         0            153         0        
>          0
> HintedHandoff                     0         0              2         0        
>          0
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> BINARY                       0
> READ                         0
> MUTATION                 87655
> _TRACE                       0
> REQUEST_RESPONSE             0
> comet-mvs02:/dsc-cassandra-1.2.6# ./bin/nodetool tpstats
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> ReadStage                         0         0            868         0        
>          0
> RequestResponseStage              0         0        3919665         0        
>          0
> MutationStage                     0         0        8177325         0        
>          0
> ReadRepairStage                   0         0            113         0        
>          0
> ReplicateOnWriteStage             0         0              0         0        
>          0
> GossipStage                       0         0           9624         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> MigrationStage                    0         0           2666         0        
>          0
> MemtablePostFlusher               0         0           7869         0        
>          0
> FlushWriter                       0         0           4273         0        
>       1179
> MiscStage                         0         0              0         0        
>          0
> commitlog_archiver                0         0              0         0        
>          0
> InternalResponseStage             0         0            215         0        
>          0
> HintedHandoff                     0         0              8         0        
>          0
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> BINARY                       0
> READ                         0
> MUTATION                531988
> _TRACE                       0
> REQUEST_RESPONSE             0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-5789) Data not fully replicated with 2 nodes and replication factor 2

Reply via email to