Hi all,
We are using a C* 2.2.8 cluster in our production system, composed of 5 nodes in 1 DC with RF=3. Our clients mostly write with CL.ALL and read with CL.ONE (both will be switched to quorum soon). We face several problems while trying to persist classical "follow relationship". Did anyone of you have similar problems / or have any idea on what could be wrong? 1) First our model. It is based on two tables : follower and following, that should be identical. First one is for queries on getting followers of a user, latter is for getting who a user is following. followings (uid bigint, ts timeuuid, fid bigint, PRIMARY KEY (uid, ts)) WITH CLUSTERING ORDER BY (ts DESC); followers (uid bigint, ts timeuuid, fid bigint, PRIMARY KEY (uid, ts)) WITH CLUSTERING ORDER BY (ts DESC); 2) Both tables have secondary indexes on fid columns. 3) Definitely, a new follow relationship should insert one row to each table and delete should work on both too. *Problems :* 1) We have a serious discrepancy problems between tables. With "nodetool cfstats" followings is 18mb, follower is 19mb in total. For demonstration purposes of this problem, I got followers of the most-followed user from both tables. A) select * from followers where uid=12345678 B) select * from followings where fid=12345678 using a small script on unix, i could find out this info on sets A and B: count( A < B ) = 1247 count( B < A ) = 185 count( A ∩ B ) = 20894 2) Even more interesting than that is, if I query follower table on secondary index, I don't get a row that I normally get with filtering just on partition key. Let me try to visualize it : select uid,ts,fid from followers where fid=X (cannot find uid=12345678) A | BBB | X C | DDD | X E | FFF | X select uid,ts,fid from followers where uid=12345678 | grep X 12345678 | GGG | *X* *My thoughts :* 1) Currently, we don't use batches during inserts and deletes to both tables. Would this help with our problems? 2) I was first suspicious of a corruption in secondary indexes. But actually, through the use of secondary index, I get consistent results. 3) I also thought, there could be the case of zombie rows. However we didn't have any long downtimes with our nodes. But, to our shame, we haven't been running any scheduled repairs on the cluster. 4) Finally, do you think that there may be problem with our modelling? Thanks in advance.