Hi all,

We are using a C* 2.2.8 cluster in our production system, composed of 5
nodes in 1 DC with RF=3. Our clients mostly write with CL.ALL and read with
CL.ONE (both will be switched to quorum soon).

We face several problems while trying to persist classical "follow
relationship". Did anyone of you have similar problems / or have any idea
on what could be wrong?

1) First our model. It is based on two tables : follower and following,
that should be identical. First one is for queries on getting followers of
a user, latter is for getting who a user is following.

followings (uid bigint, ts timeuuid, fid bigint, PRIMARY KEY (uid, ts))
WITH CLUSTERING ORDER BY (ts DESC);

followers (uid bigint, ts timeuuid, fid bigint, PRIMARY KEY (uid, ts)) WITH
CLUSTERING ORDER BY (ts DESC);


2) Both tables have secondary indexes on fid columns.

3) Definitely, a new follow relationship should insert one row to each
table and delete should work on both too.



*Problems :*

1) We have a serious discrepancy problems between tables. With "nodetool
cfstats" followings is 18mb, follower is 19mb in total. For demonstration
purposes of this problem, I got followers of the most-followed user from
both tables.

A) select * from followers where uid=12345678
B) select * from followings where fid=12345678

using a small script on unix, i could find out this info on sets A and B:
count( A < B ) = 1247
count( B < A ) = 185
count( A ∩ B ) = 20894


2) Even more interesting than that is, if I query follower table on
secondary index, I don't get a row that I normally get with filtering just
on partition key. Let me try to visualize it :

select uid,ts,fid from followers where fid=X (cannot find uid=12345678)
     A | BBB | X
     C | DDD | X
     E | FFF | X

select uid,ts,fid from followers where uid=12345678 | grep X
 12345678 | GGG | *X*


*My thoughts :*

1) Currently, we don't use batches during inserts and deletes to both
tables. Would this help with our problems?

2) I was first suspicious of a corruption in secondary indexes. But
actually, through the use of secondary index, I get consistent results.

3) I also thought, there could be the case of zombie rows. However we
didn't have any long downtimes with our nodes. But, to our shame, we
haven't been running any scheduled repairs on the cluster.

4) Finally, do you think that there may be problem with our modelling?


Thanks in advance.

Reply via email to