So I tried inserting prepared statements separately (no batch), and my server nodes load definitely dropped significantly. Throughput from my client improved a bit, but only a few %. I was able to *almost* get 5000 rows/sec (sort of) by also reducing the rows/insert-thread to 20-50 and eliminating all overhead from the timing, i.e. timing only the tight for loop of inserts. But that's still a lot slower than I expected.

I couldn't do batches because the driver doesn't allow prepared statements in a batch (QueryBuilder API). It appears the batch itself could possibly be a prepared statement, but since I have 40+ columns on each insert that would take some ugly code to build so I haven't tried it yet.

I'm using CL "ONE" on the inserts and RF 2 in my schema.

On 08/20/2013 08:04 AM, Nate McCall wrote:
John makes a good point re:prepared statements (I'd increase batch sizes again once you did this as well - separate, incremental runs of course so you can gauge the effect of each). That should take out some of the processing overhead of statement validation in the server (some - that load spike still seems high though).

I'd actually be really interested as to what your results were after doing so - i've not tried any A/B testing here for prepared statements on inserts.

Given your load is on the server, i'm not sure adding more async indirection on the client would buy you too much though.

Also, at what RF and consistency level are you writing?


On Tue, Aug 20, 2013 at 8:56 AM, Keith Freeman <8fo...@gmail.com <mailto:8fo...@gmail.com>> wrote:

    Ok, I'll try prepared statements.   But while sending my
    statements async might speed up my client, it wouldn't improve
    throughput on the cassandra nodes would it?  They're running at
    pretty high loads and only about 10% idle, so my concern is that
    they can't handle the data any faster, so something's wrong on the
    server side.  I don't really think there's anything on the client
    side that matters for this problem.

    Of course I know there are obvious h/w things I can do to improve
    server performance: SSDs, more RAM, more cores, etc.  But I
    thought the servers I have would be able to handle more rows/sec
    than say Mysql, since write speed is supposed to be one of
    Cassandra's strengths.


    On 08/19/2013 09:03 PM, John Sanda wrote:
    I'd suggest using prepared statements that you initialize at
    application start up and switching to use Session.executeAsync
    coupled with Google Guava Futures API to get better throughput on
    the client side.


    On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman <8fo...@gmail.com
    <mailto:8fo...@gmail.com>> wrote:

        Sure, I've tried different numbers for batches and threads,
        but generally I'm running 10-30 threads at a time on the
        client, each sending a batch of 100 insert statements in
        every call, using the QueryBuilder.batch() API from the
        latest datastax java driver, then calling the
        Session.execute() function (synchronous) on the Batch.

        I can't post my code, but my client does this on each iteration:
        -- divides up the set of inserts by the number of threads
        -- stores the current time
        -- tells all the threads to send their inserts
        -- then when they've all returned checks the elapsed time

        At about 2000 rows for each iteration, 20 threads with 100
        inserts each finish in about 1 second.  For 4000 rows, 40
        threads with 100 inserts each finish in about 1.5 - 2
        seconds, and as I said all 3 cassandra nodes have a heavy CPU
        load while the client is hardly loaded.  I've tried with 10
        threads and more inserts per batch, or up to 60 threads with
        fewer, doesn't seem to make a lot of difference.


        On 08/19/2013 05:00 PM, Nate McCall wrote:
        How big are the batch sizes? In other words, how many rows
        are you sending per insert operation?

        Other than the above, not much else to suggest without
        seeing some example code (on pastebin, gist or similar,
        ideally).

        On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman
        <8fo...@gmail.com <mailto:8fo...@gmail.com>> wrote:

            I've got a 3-node cassandra cluster (16G/4-core VMs ESXi
            v5 on 2.5Ghz machines not shared with any other VMs).
             I'm inserting time-series data into a single
            column-family using "wide rows" (timeuuids) and have a
            3-part partition key so my primary key is something like
            ((a, b, day), in-time-uuid), x, y, z).

            My java client is feeding rows (about 1k of raw data
            size each) in batches using multiple threads, and the
            fastest I can get it run reliably is about 2000
            rows/second.  Even at that speed, all 3 cassandra nodes
            are very CPU bound, with loads of 6-9 each (and the
            client machine is hardly breaking a sweat).  I've tried
            turning off compression in my table which reduced the
            loads slightly but not much.  There are no other updates
            or reads occurring, except the datastax opscenter.

            I was expecting to be able to insert at least 10k
            rows/second with this configuration, and after a lot of
            reading of docs, blogs, and google, can't really figure
            out what's slowing my client down.  When I increase the
            insert speed of my client beyond 2000/second, the server
            responses are just too slow and the client falls behind.
             I had a single-node Mysql database that can handle 10k
            of these data rows/second, so I really feel like I'm
            missing something in Cassandra.  Any ideas?






--
    - John



Reply via email to