get_key_range() vs. get_range_slice() -- scan/counting errors

Jack Culpepper Sat, 06 Feb 2010 00:05:47 -0800

Hi Jonathon,

I am seeing a dramatic difference in the number of keys I can scan
when I use these two methods.


The former (deprecated) method seems to return the correct result.
That is, it's on the right order of magnitude of around 500K, and if I
continue to insert keys via a separate process as I repeatedly count
them, the count grows. The recommended alternative, get_range_slice(),
returns far fewer keys and if I count repeatedly as I insert using a
separate process, the count bounces around erratically.

I am using the python thrift interface against a two node setup. I am
running the current 0.5.0 release (just upgraded from rc1 since I saw
some other thrift bug was fixed). Here is my program (there are three
commented lines to switch from one method to the other):

if sys.argv[1] == "count_things":

    from thrift import Thrift
    from thrift.transport import TTransport
    from thrift.transport import TSocket
    from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated
    from cassandra import Cassandra

    socket = TSocket.TSocket("10.212.230.176", 9160)
    transport = TTransport.TBufferedTransport(socket)
    protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
    client = Cassandra.Client(protocol)

    transport.open()

    column_parent = ColumnParent(column_family="thing")
    slice_range = SliceRange(start="key", finish="key")
    predicate = SlicePredicate(slice_range=slice_range)

    done = False
    seg = 1000
    start = ""

    while not done:
        #result = client.get_key_range("gg", "thing", start, "", seg,
ConsistencyLevel.ONE)
        result = client.get_range_slice("gg", column_parent,
predicate, start, "", seg, ConsistencyLevel.ONE)

        if len(result) < seg: done = True
        #else: start = result[seg-1]
        else: start = result[seg-1].key

        record_count += len(result)

        t = now()
        dt = t - startTime
        record_per_sec = record_count / dt
        #print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f
%s"%(startTime,t,dt,record_per_sec,record_count,result[0],result[-1]),
        print "\rstart %d now %d dt %d rec/s %.4f rec %d s %s f
%s"%(startTime,t,dt,record_per_sec,record_count,result[0].key,result[-1].key),
    print

An example of the output using get_range_slice(), without a concurrent
insertion process -- it counts 133674 keys.

start 1265440888 now 1265441098 dt 210 rec/s 636.1996 rec 133674 s
9f9dd2c0f043902f7f571942cfac3f6c28b82cec f
9ffff14fd361b981faea6a04c5ef5699a96a8d6d

Using get_key_range() I get 459351 keys, and the throughput is less:

start 1265442143 now 1265443092 dt 948 rec/s 484.2775 rec 459351 s
ffce8099f808d10a09db471b04793315f555ccbd f
ffffffa1b5e3aeb9ca92d4d848280093bdf49892

get_range_slice() seems to skip keys in each of the segments.

The "thing" column family is a super column. There are no errors
reported to the log. The keys I am inserting are python generated
UUIDs:

import uuid
key = uuid.uuid4().hex

I'm not posting the program that inserts the data, but I can if that
would be help. Thanks very much,

Jack

get_key_range() vs. get_range_slice() -- scan/counting errors

Reply via email to