Ok, I'll try prepared statements. But while sending my statements async might speed up my client, it wouldn't improve throughput on the cassandra nodes would it? They're running at pretty high loads and only about 10% idle, so my concern is that they can't handle the data any faster, so something's wrong on the server side. I don't really think there's anything on the client side that matters for this problem.

Of course I know there are obvious h/w things I can do to improve server performance: SSDs, more RAM, more cores, etc. But I thought the servers I have would be able to handle more rows/sec than say Mysql, since write speed is supposed to be one of Cassandra's strengths.

On 08/19/2013 09:03 PM, John Sanda wrote:
I'd suggest using prepared statements that you initialize at application start up and switching to use Session.executeAsync coupled with Google Guava Futures API to get better throughput on the client side.


On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman <8fo...@gmail.com <mailto:8fo...@gmail.com>> wrote:

    Sure, I've tried different numbers for batches and threads, but
    generally I'm running 10-30 threads at a time on the client, each
    sending a batch of 100 insert statements in every call, using the
    QueryBuilder.batch() API from the latest datastax java driver,
    then calling the Session.execute() function (synchronous) on the
    Batch.

    I can't post my code, but my client does this on each iteration:
    -- divides up the set of inserts by the number of threads
    -- stores the current time
    -- tells all the threads to send their inserts
    -- then when they've all returned checks the elapsed time

    At about 2000 rows for each iteration, 20 threads with 100 inserts
    each finish in about 1 second.  For 4000 rows, 40 threads with 100
    inserts each finish in about 1.5 - 2 seconds, and as I said all 3
    cassandra nodes have a heavy CPU load while the client is hardly
    loaded.  I've tried with 10 threads and more inserts per batch, or
    up to 60 threads with fewer, doesn't seem to make a lot of
    difference.


    On 08/19/2013 05:00 PM, Nate McCall wrote:
    How big are the batch sizes? In other words, how many rows are
    you sending per insert operation?

    Other than the above, not much else to suggest without seeing
    some example code (on pastebin, gist or similar, ideally).

    On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8fo...@gmail.com
    <mailto:8fo...@gmail.com>> wrote:

        I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5
        on 2.5Ghz machines not shared with any other VMs).  I'm
        inserting time-series data into a single column-family using
        "wide rows" (timeuuids) and have a 3-part partition key so my
        primary key is something like ((a, b, day), in-time-uuid), x,
        y, z).

        My java client is feeding rows (about 1k of raw data size
        each) in batches using multiple threads, and the fastest I
        can get it run reliably is about 2000 rows/second.  Even at
        that speed, all 3 cassandra nodes are very CPU bound, with
        loads of 6-9 each (and the client machine is hardly breaking
        a sweat).  I've tried turning off compression in my table
        which reduced the loads slightly but not much.  There are no
        other updates or reads occurring, except the datastax opscenter.

        I was expecting to be able to insert at least 10k rows/second
        with this configuration, and after a lot of reading of docs,
        blogs, and google, can't really figure out what's slowing my
        client down.  When I increase the insert speed of my client
        beyond 2000/second, the server responses are just too slow
        and the client falls behind.  I had a single-node Mysql
        database that can handle 10k of these data rows/second, so I
        really feel like I'm missing something in Cassandra.  Any ideas?






--

- John

Reply via email to