It would help if you could narrow it down to "here are the keys I expect to see that I am not," especially if you can reproduce on a single-node cluster.
On Sat, Feb 6, 2010 at 2:04 AM, Jack Culpepper <jackculpep...@gmail.com> wrote: > Hi Jonathon, > > I am seeing a dramatic difference in the number of keys I can scan > when I use these two methods. > > The former (deprecated) method seems to return the correct result. > That is, it's on the right order of magnitude of around 500K, and if I > continue to insert keys via a separate process as I repeatedly count > them, the count grows. The recommended alternative, get_range_slice(), > returns far fewer keys and if I count repeatedly as I insert using a > separate process, the count bounces around erratically. > > I am using the python thrift interface against a two node setup. I am > running the current 0.5.0 release (just upgraded from rc1 since I saw > some other thrift bug was fixed). Here is my program (there are three > commented lines to switch from one method to the other): > > if sys.argv[1] == "count_things": > > from thrift import Thrift > from thrift.transport import TTransport > from thrift.transport import TSocket > from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated > from cassandra import Cassandra > > socket = TSocket.TSocket("10.212.230.176", 9160) > transport = TTransport.TBufferedTransport(socket) > protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) > client = Cassandra.Client(protocol) > > transport.open() > > column_parent = ColumnParent(column_family="thing") > slice_range = SliceRange(start="key", finish="key") > predicate = SlicePredicate(slice_range=slice_range) > > done = False > seg = 1000 > start = "" > > while not done: > #result = client.get_key_range("gg", "thing", start, "", seg, > ConsistencyLevel.ONE) > result = client.get_range_slice("gg", column_parent, > predicate, start, "", seg, ConsistencyLevel.ONE) > > if len(result) < seg: done = True > #else: start = result[seg-1] > else: start = result[seg-1].key > > record_count += len(result) > > t = now() > dt = t - startTime > record_per_sec = record_count / dt > #print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f > %s"%(startTime,t,dt,record_per_sec,record_count,result[0],result[-1]), > print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f > %s"%(startTime,t,dt,record_per_sec,record_count,result[0].key,result[-1].key), > print > > An example of the output using get_range_slice(), without a concurrent > insertion process -- it counts 133674 keys. > > start 1265440888 now 1265441098 dt 210 rec/s 636.1996 rec 133674 s > 9f9dd2c0f043902f7f571942cfac3f6c28b82cec f > 9ffff14fd361b981faea6a04c5ef5699a96a8d6d > > Using get_key_range() I get 459351 keys, and the throughput is less: > > start 1265442143 now 1265443092 dt 948 rec/s 484.2775 rec 459351 s > ffce8099f808d10a09db471b04793315f555ccbd f > ffffffa1b5e3aeb9ca92d4d848280093bdf49892 > > get_range_slice() seems to skip keys in each of the segments. > > The "thing" column family is a super column. There are no errors > reported to the log. The keys I am inserting are python generated > UUIDs: > > import uuid > key = uuid.uuid4().hex > > I'm not posting the program that inserts the data, but I can if that > would be help. Thanks very much, > > Jack >