Re: High latencies for simple queries

2015-03-31 Thread Tyler Hobbs
To clarify, that's in Cassandra 2.1+.  In 2.0 and earlier, we used
http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/ for cqlsh.

On Tue, Mar 31, 2015 at 10:40 AM, Tyler Hobbs ty...@datastax.com wrote:

 The python driver that we bundle with Cassandra for cqlsh is the normal
 python driver (https://github.com/datastax/python-driver), although
 sometimes it's patched for bugfixes or is not an official release.

 On Sat, Mar 28, 2015 at 5:36 PM, Ben Bromhead b...@instaclustr.com wrote:

 cqlsh runs on the internal cassandra python drivers: cassandra-pylib and
 cqlshlib.

 I would not recommend using them at all (nothing wrong with them, they
 are just not built with external users in mind).

 I have never used python-driver in anger so I can't comment on whether it
 is genuinely slower than the internal C* python driver, but this might be a
 question for python-driver folk.

 On 28 March 2015 at 00:34, Artur Siekielski a...@vhex.net wrote:

 On 03/28/2015 12:13 AM, Ben Bromhead wrote:

 One other thing to keep in mind / check is that doing these tests
 locally the cassandra driver will connect using the network stack,
 whereas postgres supports local connections over a unix domain socket
 (this is also enabled by default).

 Unix domain sockets are significantly faster than tcp as you don't have
 a network stack to traverse. I think any driver using libpq will attempt
 to use the domain socket when connecting locally.


 Good catch. I assured that psycopg2 connects through a TCP socket and
 the numbers increased by about 20%, but it still is an order of magnitude
 faster than Cassandra.


 But I'm going to hazard a guess something else is going on with the
 Cassandra connection as I'm able to get 0.5ms queries locally and that's
 even with trace turned on.


 Using python-driver?




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | (650) 284 9692




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: High latencies for simple queries

2015-03-31 Thread Tyler Hobbs
The python driver that we bundle with Cassandra for cqlsh is the normal
python driver (https://github.com/datastax/python-driver), although
sometimes it's patched for bugfixes or is not an official release.

On Sat, Mar 28, 2015 at 5:36 PM, Ben Bromhead b...@instaclustr.com wrote:

 cqlsh runs on the internal cassandra python drivers: cassandra-pylib and
 cqlshlib.

 I would not recommend using them at all (nothing wrong with them, they are
 just not built with external users in mind).

 I have never used python-driver in anger so I can't comment on whether it
 is genuinely slower than the internal C* python driver, but this might be a
 question for python-driver folk.

 On 28 March 2015 at 00:34, Artur Siekielski a...@vhex.net wrote:

 On 03/28/2015 12:13 AM, Ben Bromhead wrote:

 One other thing to keep in mind / check is that doing these tests
 locally the cassandra driver will connect using the network stack,
 whereas postgres supports local connections over a unix domain socket
 (this is also enabled by default).

 Unix domain sockets are significantly faster than tcp as you don't have
 a network stack to traverse. I think any driver using libpq will attempt
 to use the domain socket when connecting locally.


 Good catch. I assured that psycopg2 connects through a TCP socket and the
 numbers increased by about 20%, but it still is an order of magnitude
 faster than Cassandra.


 But I'm going to hazard a guess something else is going on with the
 Cassandra connection as I'm able to get 0.5ms queries locally and that's
 even with trace turned on.


 Using python-driver?




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | (650) 284 9692




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: High latencies for simple queries

2015-03-28 Thread Ben Bromhead
cqlsh runs on the internal cassandra python drivers: cassandra-pylib and
cqlshlib.

I would not recommend using them at all (nothing wrong with them, they are
just not built with external users in mind).

I have never used python-driver in anger so I can't comment on whether it
is genuinely slower than the internal C* python driver, but this might be a
question for python-driver folk.

On 28 March 2015 at 00:34, Artur Siekielski a...@vhex.net wrote:

 On 03/28/2015 12:13 AM, Ben Bromhead wrote:

 One other thing to keep in mind / check is that doing these tests
 locally the cassandra driver will connect using the network stack,
 whereas postgres supports local connections over a unix domain socket
 (this is also enabled by default).

 Unix domain sockets are significantly faster than tcp as you don't have
 a network stack to traverse. I think any driver using libpq will attempt
 to use the domain socket when connecting locally.


 Good catch. I assured that psycopg2 connects through a TCP socket and the
 numbers increased by about 20%, but it still is an order of magnitude
 faster than Cassandra.


 But I'm going to hazard a guess something else is going on with the
 Cassandra connection as I'm able to get 0.5ms queries locally and that's
 even with trace turned on.


 Using python-driver?




-- 

Ben Bromhead

Instaclustr | www.instaclustr.com | @instaclustr
http://twitter.com/instaclustr | (650) 284 9692


Re: High latencies for simple queries

2015-03-28 Thread Artur Siekielski

On 03/28/2015 12:13 AM, Ben Bromhead wrote:

One other thing to keep in mind / check is that doing these tests
locally the cassandra driver will connect using the network stack,
whereas postgres supports local connections over a unix domain socket
(this is also enabled by default).

Unix domain sockets are significantly faster than tcp as you don't have
a network stack to traverse. I think any driver using libpq will attempt
to use the domain socket when connecting locally.


Good catch. I assured that psycopg2 connects through a TCP socket and 
the numbers increased by about 20%, but it still is an order of 
magnitude faster than Cassandra.




But I'm going to hazard a guess something else is going on with the
Cassandra connection as I'm able to get 0.5ms queries locally and that's
even with trace turned on.


Using python-driver?


High latencies for simple queries

2015-03-27 Thread Artur Siekielski
I'm running Cassandra locally and I see that the execution time for the 
simplest queries is 1-2 milliseconds. By a simple query I mean either 
INSERT or SELECT from a small table with short keys.


While this number is not high, it's about 10-20 times slower than 
Postgresql (even if INSERTs are wrapped in transactions). I know that 
the nature of Cassandra compared to Postgresql is different, but for 
some scenarios this difference can matter.


The question is: is it normal for Cassandra to have a minimum latency of 
1 millisecond?


I'm using Cassandra 2.1.2, python-driver.




Re: High latencies for simple queries

2015-03-27 Thread Tyler Hobbs
Just to check, are you concerned about minimizing that latency or
maximizing throughput?

I'll that latency is what you're actually concerned about.  A fair amount
of that latency is probably happening in the python driver.  Although it
can easily execute ~8k operations per second (using cpython), in some
scenarios it can be difficult to guarantee sub-ms latency for an individual
query due to how some of the internals work.  In particular, it uses
python's Conditions for cross-thread signalling (from the event loop thread
to the application thread).  Unfortunately, python's Condition
implementation includes a loop with a minimum sleep of 1ms if the Condition
isn't already set when you start the wait() call.  This is why, with a
single application thread, you will typically see a minimum of 1ms latency.

Another source of similar latencies for the python driver is the Asyncore
event loop, which is used when libev isn't available.  I would make sure
that you can use the LibevConnection class with the driver to avoid this.

On Fri, Mar 27, 2015 at 6:24 AM, Artur Siekielski a...@vhex.net wrote:

 I'm running Cassandra locally and I see that the execution time for the
 simplest queries is 1-2 milliseconds. By a simple query I mean either
 INSERT or SELECT from a small table with short keys.

 While this number is not high, it's about 10-20 times slower than
 Postgresql (even if INSERTs are wrapped in transactions). I know that the
 nature of Cassandra compared to Postgresql is different, but for some
 scenarios this difference can matter.

 The question is: is it normal for Cassandra to have a minimum latency of 1
 millisecond?

 I'm using Cassandra 2.1.2, python-driver.





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: High latencies for simple queries

2015-03-27 Thread Artur Siekielski
Yes, I'm concerned about the latency. Throughput can be high even when 
using Python: http://datastax.github.io/python-driver/performance.html. 
But in my scenarios I need to run queries sequentially, so latencies 
matter. And Cassandra requires issuing more queries than SQL databases 
so these latencies can add up to a significant amount.


I was running Asyncore event loop, because it looks like libev isn't 
supported for PyPy which I'm using. I've switched to CPython and 
LibevConnection for a moment and I don't think I've noticed a major 
speedup, and a minimum latency is still 1ms.


Overall, it looks to me that the issue is not that important, because 
using multi-master, multi-dc databases always involve getting higher and 
somewhat unpredictable latencies, so relying on sub-millisecond 
latencies on production clusters is not very realistic.



On 03/27/2015 04:28 PM, Tyler Hobbs wrote:

Just to check, are you concerned about minimizing that latency or
maximizing throughput?

I'll that latency is what you're actually concerned about.  A fair
amount of that latency is probably happening in the python driver.
Although it can easily execute ~8k operations per second (using
cpython), in some scenarios it can be difficult to guarantee sub-ms
latency for an individual query due to how some of the internals work.
In particular, it uses python's Conditions for cross-thread signalling
(from the event loop thread to the application thread).  Unfortunately,
python's Condition implementation includes a loop with a minimum sleep
of 1ms if the Condition isn't already set when you start the wait()
call.  This is why, with a single application thread, you will typically
see a minimum of 1ms latency.

Another source of similar latencies for the python driver is the
Asyncore event loop, which is used when libev isn't available.  I would
make sure that you can use the LibevConnection class with the driver to
avoid this.

On Fri, Mar 27, 2015 at 6:24 AM, Artur Siekielski a...@vhex.net
mailto:a...@vhex.net wrote:

I'm running Cassandra locally and I see that the execution time for
the simplest queries is 1-2 milliseconds. By a simple query I mean
either INSERT or SELECT from a small table with short keys.

While this number is not high, it's about 10-20 times slower than
Postgresql (even if INSERTs are wrapped in transactions). I know
that the nature of Cassandra compared to Postgresql is different,
but for some scenarios this difference can matter.

The question is: is it normal for Cassandra to have a minimum
latency of 1 millisecond?

I'm using Cassandra 2.1.2, python-driver.




Re: High latencies for simple queries

2015-03-27 Thread Artur Siekielski
I think that in your example Postgres spends most time on waiting for 
fsync() to complete. On Linux, for a battery-backed raid controller, 
it's safe to mount ext4 filesystem with barrier=0 option which 
improves fsync() performance a lot. I have partitions mounted with this 
option and I did a test from Python, using psycopg2 driver, and I got 
the following latencies, in milliseconds:

- INSERT without COMMIT: 0.04
- INSERT with COMMIT: 0.12
- SELECT: 0.05
I'm also repeating benchmark runs multiple times (I'm using Python's 
timeit module).


On 03/27/2015 07:58 PM, Ben Bromhead wrote:

Latency can be so variable even when testing things locally. I quickly
fired up postgres and did the following with psql:

ben=# CREATE TABLE foo(i int, j text, PRIMARY KEY(i));
CREATE TABLE
ben=# \timing
Timing is on.
ben=# INSERT INTO foo VALUES(2, 'yay');
INSERT 0 1
Time: 1.162 ms
ben=# INSERT INTO foo VALUES(3, 'yay');
INSERT 0 1
Time: 1.108 ms

I then fired up a local copy of Cassandra (2.0.12)

cqlsh CREATE KEYSPACE foo WITH replication = { 'class' :
'SimpleStrategy', 'replication_factor' : 1 };
cqlsh USE foo;
cqlsh:foo CREATE TABLE foo(i int PRIMARY KEY, j text);
cqlsh:foo TRACING ON;
Now tracing requests.
cqlsh:foo INSERT INTO foo (i, j) VALUES (1, 'yay');





Re: High latencies for simple queries

2015-03-27 Thread Tyler Hobbs
Since you're executing queries sequentially, you may want to look into
using callback chaining to avoid the cross-thread signaling that results in
the 1ms latencies.  Basically, just use session.execute_async() and attach
a callback to the returned future that will execute your next query.  The
callback is executed on the event loop thread.  The main downsides to this
are that you need to be careful to avoid blocking the event loop thread
(including executing session.execute() or prepare()) and you need to ensure
that all exceptions raised in the callback are handled by your application
code.

On Fri, Mar 27, 2015 at 3:11 PM, Artur Siekielski a...@vhex.net wrote:

 I think that in your example Postgres spends most time on waiting for
 fsync() to complete. On Linux, for a battery-backed raid controller, it's
 safe to mount ext4 filesystem with barrier=0 option which improves
 fsync() performance a lot. I have partitions mounted with this option and I
 did a test from Python, using psycopg2 driver, and I got the following
 latencies, in milliseconds:
 - INSERT without COMMIT: 0.04
 - INSERT with COMMIT: 0.12
 - SELECT: 0.05
 I'm also repeating benchmark runs multiple times (I'm using Python's
 timeit module).


 On 03/27/2015 07:58 PM, Ben Bromhead wrote:

 Latency can be so variable even when testing things locally. I quickly
 fired up postgres and did the following with psql:

 ben=# CREATE TABLE foo(i int, j text, PRIMARY KEY(i));
 CREATE TABLE
 ben=# \timing
 Timing is on.
 ben=# INSERT INTO foo VALUES(2, 'yay');
 INSERT 0 1
 Time: 1.162 ms
 ben=# INSERT INTO foo VALUES(3, 'yay');
 INSERT 0 1
 Time: 1.108 ms

 I then fired up a local copy of Cassandra (2.0.12)

 cqlsh CREATE KEYSPACE foo WITH replication = { 'class' :
 'SimpleStrategy', 'replication_factor' : 1 };
 cqlsh USE foo;
 cqlsh:foo CREATE TABLE foo(i int PRIMARY KEY, j text);
 cqlsh:foo TRACING ON;
 Now tracing requests.
 cqlsh:foo INSERT INTO foo (i, j) VALUES (1, 'yay');





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: High latencies for simple queries

2015-03-27 Thread Ben Bromhead
Latency can be so variable even when testing things locally. I quickly
fired up postgres and did the following with psql:

ben=# CREATE TABLE foo(i int, j text, PRIMARY KEY(i));
CREATE TABLE
ben=# \timing
Timing is on.
ben=# INSERT INTO foo VALUES(2, 'yay');
INSERT 0 1
Time: 1.162 ms
ben=# INSERT INTO foo VALUES(3, 'yay');
INSERT 0 1
Time: 1.108 ms

I then fired up a local copy of Cassandra (2.0.12)

cqlsh CREATE KEYSPACE foo WITH replication = { 'class' : 'SimpleStrategy',
'replication_factor' : 1 };
cqlsh USE foo;
cqlsh:foo CREATE TABLE foo(i int PRIMARY KEY, j text);
cqlsh:foo TRACING ON;
Now tracing requests.
cqlsh:foo INSERT INTO foo (i, j) VALUES (1, 'yay');

Tracing session: 7a7dced0-d4b2-11e4-b950-85c3c9bd91a0

 activity  | timestamp| source
   | source_elapsed
---+--+---+
execute_cql3_query | 11:52:55,229 |
127.0.0.1 |  0
 Parsing INSERT INTO foo (i, j) VALUES (1, 'yay'); | 11:52:55,229 |
127.0.0.1 | 43
   Preparing statement | 11:52:55,229 |
127.0.0.1 |141
 Determining replicas for mutation | 11:52:55,229 |
127.0.0.1 |291
Acquiring switchLock read lock | 11:52:55,229 |
127.0.0.1 |403
Appending to commitlog | 11:52:55,229 |
127.0.0.1 |413
Adding to foo memtable | 11:52:55,229 |
127.0.0.1 |432
  Request complete | 11:52:55,229 |
127.0.0.1 |541

All this on a mac book pro with 16gb of memory and an SSD

So ymmv?

On 27 March 2015 at 08:28, Tyler Hobbs ty...@datastax.com wrote:

 Just to check, are you concerned about minimizing that latency or
 maximizing throughput?

 I'll that latency is what you're actually concerned about.  A fair amount
 of that latency is probably happening in the python driver.  Although it
 can easily execute ~8k operations per second (using cpython), in some
 scenarios it can be difficult to guarantee sub-ms latency for an individual
 query due to how some of the internals work.  In particular, it uses
 python's Conditions for cross-thread signalling (from the event loop thread
 to the application thread).  Unfortunately, python's Condition
 implementation includes a loop with a minimum sleep of 1ms if the Condition
 isn't already set when you start the wait() call.  This is why, with a
 single application thread, you will typically see a minimum of 1ms latency.

 Another source of similar latencies for the python driver is the Asyncore
 event loop, which is used when libev isn't available.  I would make sure
 that you can use the LibevConnection class with the driver to avoid this.

 On Fri, Mar 27, 2015 at 6:24 AM, Artur Siekielski a...@vhex.net wrote:

 I'm running Cassandra locally and I see that the execution time for the
 simplest queries is 1-2 milliseconds. By a simple query I mean either
 INSERT or SELECT from a small table with short keys.

 While this number is not high, it's about 10-20 times slower than
 Postgresql (even if INSERTs are wrapped in transactions). I know that the
 nature of Cassandra compared to Postgresql is different, but for some
 scenarios this difference can matter.

 The question is: is it normal for Cassandra to have a minimum latency of
 1 millisecond?

 I'm using Cassandra 2.1.2, python-driver.





 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 

Ben Bromhead

Instaclustr | www.instaclustr.com | @instaclustr
http://twitter.com/instaclustr | (650) 284 9692


Re: High latencies for simple queries

2015-03-27 Thread Laing, Michael
I use callback chaining with the python driver and can confirm that it is
very fast.

You can chain the chains together to perform sequential processing. I do
this when retrieving metadata and then the referenced payload for
example, when the metadata has been inverted and the payload is larger than
we want to invert. And you can be running multiple chains of chains
asynchronously - cascade state by employing the userdata of the future.

We also multiprocess, for more parallelism, and we distribute work to
multiple multiprocessing instances using a message broker for yet more
parallel activity, as well as reliability.

ml

On Fri, Mar 27, 2015 at 4:28 PM, Tyler Hobbs ty...@datastax.com wrote:

 Since you're executing queries sequentially, you may want to look into
 using callback chaining to avoid the cross-thread signaling that results in
 the 1ms latencies.  Basically, just use session.execute_async() and attach
 a callback to the returned future that will execute your next query.  The
 callback is executed on the event loop thread.  The main downsides to this
 are that you need to be careful to avoid blocking the event loop thread
 (including executing session.execute() or prepare()) and you need to ensure
 that all exceptions raised in the callback are handled by your application
 code.

 On Fri, Mar 27, 2015 at 3:11 PM, Artur Siekielski a...@vhex.net wrote:

 I think that in your example Postgres spends most time on waiting for
 fsync() to complete. On Linux, for a battery-backed raid controller, it's
 safe to mount ext4 filesystem with barrier=0 option which improves
 fsync() performance a lot. I have partitions mounted with this option and I
 did a test from Python, using psycopg2 driver, and I got the following
 latencies, in milliseconds:
 - INSERT without COMMIT: 0.04
 - INSERT with COMMIT: 0.12
 - SELECT: 0.05
 I'm also repeating benchmark runs multiple times (I'm using Python's
 timeit module).


 On 03/27/2015 07:58 PM, Ben Bromhead wrote:

 Latency can be so variable even when testing things locally. I quickly
 fired up postgres and did the following with psql:

 ben=# CREATE TABLE foo(i int, j text, PRIMARY KEY(i));
 CREATE TABLE
 ben=# \timing
 Timing is on.
 ben=# INSERT INTO foo VALUES(2, 'yay');
 INSERT 0 1
 Time: 1.162 ms
 ben=# INSERT INTO foo VALUES(3, 'yay');
 INSERT 0 1
 Time: 1.108 ms

 I then fired up a local copy of Cassandra (2.0.12)

 cqlsh CREATE KEYSPACE foo WITH replication = { 'class' :
 'SimpleStrategy', 'replication_factor' : 1 };
 cqlsh USE foo;
 cqlsh:foo CREATE TABLE foo(i int PRIMARY KEY, j text);
 cqlsh:foo TRACING ON;
 Now tracing requests.
 cqlsh:foo INSERT INTO foo (i, j) VALUES (1, 'yay');





 --
 Tyler Hobbs
 DataStax http://datastax.com/