Re: Best way to do a multi_get using CQL

2014-06-20 Thread Marcelo Elias Del Valle
Yes, I am using the CQL datastax drivers.
It was a good advice, thanks a lot Janathan.
[]s


2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I will
 try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look for
  random keys in a huge cluster. After all, the amount of connections to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT IN?
   The justification in the docs is I could query many nodes, but I would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = query % tuple(args)
  futures.append((query_msg, self.session.execute_async(query,
 args)))
  
   for query_msg, future in futures:
  try:
 rows = future.result(timeout=10)
 for row in rows:
   entity_ids.add(row.entity_id)
  except:
 logging.error(Query '%s' returned ERROR  % (query_msg))
 raise
  
   Using async just with select = would mean instead of 1 async query
   (example:
   in (0, 1, 2)), I would do several, one for each value of values
 array
   above.
   In my head, this would mean more connections to Cassandra and the same
   amount of work, right? What would be the advantage?
  
   []s
  
  
  
  
   2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
  
   Your other option is to fire off async queries.  It's pretty
   straightforward w/ the java or python drivers.
  
   On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
   marc...@s1mbi0se.com.br wrote:
I was taking a look at Cassandra anti-patterns list:
   
   
   
   
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
   
Among then is
   
SELECT ... IN or index lookups¶
   
SELECT ... IN and index lookups (formerly secondary indexes) should
be
avoided except for specific scenarios. See When not to use IN in
SELECT
and
When not to use an index in Indexing in
   
CQL for Cassandra 2.0
   
And Looking at the SELECT doc, I saw:
   
When not to use IN¶
   
The recommendations about when not to use an index apply to using
 IN
in
the
WHERE clause. Under most conditions, using IN in the WHERE clause
 is
not
recommended. Using IN can degrade performance because usually many
nodes
must be queried. For example, in a single, local data center
 cluster
having
30 nodes, a replication factor of 3, and a consistency level of
LOCAL_QUORUM, a single key query goes out to two nodes, but if the
query
uses the IN condition, the number of nodes being queried are most
likely
even higher, up to 20 nodes depending on where the keys fall in the
token
range.
   
In my system, I have a column family called entity_lookup:
   
CREATE KEYSPACE IF NOT EXISTS Identification1
  WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
  'DC1' : 3 };
USE Identification1;
   
CREATE TABLE IF NOT EXISTS entity_lookup (
  name varchar,
  value varchar,
  entity_id uuid,
  PRIMARY KEY ((name, value), entity_id));
   
And I use the following select to query it:
   
SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
   
Is this an anti-pattern?
   
If not using SELECT IN, which other way would you recomend for
lookups
like
that? I have 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Laing, Michael
However my extensive benchmarking this week of the python driver from
master shows a performance *decrease* when using 'token_aware'.

This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

Also why do the work the coordinator will do for you: send all the queries,
wait for everything to come back in whatever order, and sort the result.

I would rather keep my app code simple.

But the real point is that you should benchmark in your own environment.

ml


On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I will
 try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look for
  random keys in a huge cluster. After all, the amount of connections to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT IN?
   The justification in the docs is I could query many nodes, but I
 would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = query % tuple(args)
  futures.append((query_msg, self.session.execute_async(query,
 args)))
  
   for query_msg, future in futures:
  try:
 rows = future.result(timeout=10)
 for row in rows:
   entity_ids.add(row.entity_id)
  except:
 logging.error(Query '%s' returned ERROR  % (query_msg))
 raise
  
   Using async just with select = would mean instead of 1 async query
   (example:
   in (0, 1, 2)), I would do several, one for each value of values
 array
   above.
   In my head, this would mean more connections to Cassandra and the
 same
   amount of work, right? What would be the advantage?
  
   []s
  
  
  
  
   2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
  
   Your other option is to fire off async queries.  It's pretty
   straightforward w/ the java or python drivers.
  
   On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
   marc...@s1mbi0se.com.br wrote:
I was taking a look at Cassandra anti-patterns list:
   
   
   
   
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
   
Among then is
   
SELECT ... IN or index lookups¶
   
SELECT ... IN and index lookups (formerly secondary indexes)
 should
be
avoided except for specific scenarios. See When not to use IN in
SELECT
and
When not to use an index in Indexing in
   
CQL for Cassandra 2.0
   
And Looking at the SELECT doc, I saw:
   
When not to use IN¶
   
The recommendations about when not to use an index apply to using
 IN
in
the
WHERE clause. Under most conditions, using IN in the WHERE clause
 is
not
recommended. Using IN can degrade performance because usually many
nodes
must be queried. For example, in a single, local data center
 cluster
having
30 nodes, a replication factor of 3, and a consistency level of
LOCAL_QUORUM, a single key query goes out to two nodes, but if the
query
uses the IN condition, the number of nodes being queried are most
likely
even higher, up to 20 nodes depending on where the keys fall in
 the
token
range.
   
In my system, I have a column family called entity_lookup:
   
CREATE KEYSPACE IF NOT 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Jeremy Jongsma
I've found that if you have any amount of latency between your client and
nodes, and you are executing a large batch of queries, you'll usually want
to send them together to one node unless execution time is of no concern.
The tradeoff is resource usage on the connected node vs. time to complete
all the queries, because you'll need fewer client - node network round
trips.

With large numbers of queries you will still want to make sure you split
them into manageable batches before sending them, to control memory usage
on the executing node. I've been limiting queries to batches of 100 keys in
scenarios like this.


On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com
wrote:

 However my extensive benchmarking this week of the python driver from
 master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I will
 try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look for
  random keys in a huge cluster. After all, the amount of connections to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT IN?
   The justification in the docs is I could query many nodes, but I
 would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = query % tuple(args)
  futures.append((query_msg, self.session.execute_async(query,
 args)))
  
   for query_msg, future in futures:
  try:
 rows = future.result(timeout=10)
 for row in rows:
   entity_ids.add(row.entity_id)
  except:
 logging.error(Query '%s' returned ERROR  % (query_msg))
 raise
  
   Using async just with select = would mean instead of 1 async query
   (example:
   in (0, 1, 2)), I would do several, one for each value of values
 array
   above.
   In my head, this would mean more connections to Cassandra and the
 same
   amount of work, right? What would be the advantage?
  
   []s
  
  
  
  
   2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
  
   Your other option is to fire off async queries.  It's pretty
   straightforward w/ the java or python drivers.
  
   On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
   marc...@s1mbi0se.com.br wrote:
I was taking a look at Cassandra anti-patterns list:
   
   
   
   
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
   
Among then is
   
SELECT ... IN or index lookups¶
   
SELECT ... IN and index lookups (formerly secondary indexes)
 should
be
avoided except for specific scenarios. See When not to use IN in
SELECT
and
When not to use an index in Indexing in
   
CQL for Cassandra 2.0
   
And Looking at the SELECT doc, I saw:
   
When not to use IN¶
   
The recommendations about when not to use 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Marcelo Elias Del Valle
A question, not sure if you guys know the answer:
Supose I async query 1000 rows using token aware and suppose I have 10
nodes. Suppose also each node would receive 100 row queries each.
How does async work in this case? Would it send each row query to each node
in a different connection? Different message?
I guess if there was a way to use batch with async, once you commit the
batch for the 1000 queries, it would create 1 connection to each host and
query 100 rows in a single message to each host.
This would decrease resource usage, am I wrong?

[]s


2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client and
 nodes, and you are executing a large batch of queries, you'll usually want
 to send them together to one node unless execution time is of no concern.
 The tradeoff is resource usage on the connected node vs. time to complete
 all the queries, because you'll need fewer client - node network round
 trips.

 With large numbers of queries you will still want to make sure you split
 them into manageable batches before sending them, to control memory usage
 on the executing node. I've been limiting queries to batches of 100 keys in
 scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com
  wrote:

 However my extensive benchmarking this week of the python driver from
 master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I
 will try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look for
  random keys in a huge cluster. After all, the amount of connections to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT
 IN?
   The justification in the docs is I could query many nodes, but I
 would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = query % tuple(args)
  futures.append((query_msg, self.session.execute_async(query,
 args)))
  
   for query_msg, future in futures:
  try:
 rows = future.result(timeout=10)
 for row in rows:
   entity_ids.add(row.entity_id)
  except:
 logging.error(Query '%s' returned ERROR  % (query_msg))
 raise
  
   Using async just with select = would mean instead of 1 async query
   (example:
   in (0, 1, 2)), I would do several, one for each value of values
 array
   above.
   In my head, this would mean more connections to Cassandra and the
 same
   amount of work, right? What would be the advantage?
  
   []s
  
  
  
  
   2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
  
   Your other option is to fire off async queries.  It's pretty
   straightforward w/ the java or python drivers.
  
   On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
   

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Jeremy Jongsma
That depends on the connection pooling implementation in your driver.
Astyanax will keep N connections open to each node (configurable) and route
each query in a separate message over an existing connection, waiting until
one becomes available if all are in use.


On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 A question, not sure if you guys know the answer:
 Supose I async query 1000 rows using token aware and suppose I have 10
 nodes. Suppose also each node would receive 100 row queries each.
 How does async work in this case? Would it send each row query to each
 node in a different connection? Different message?
 I guess if there was a way to use batch with async, once you commit the
 batch for the 1000 queries, it would create 1 connection to each host and
 query 100 rows in a single message to each host.
 This would decrease resource usage, am I wrong?

 []s


 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client and
 nodes, and you are executing a large batch of queries, you'll usually want
 to send them together to one node unless execution time is of no concern.
 The tradeoff is resource usage on the connected node vs. time to complete
 all the queries, because you'll need fewer client - node network round
 trips.

 With large numbers of queries you will still want to make sure you split
 them into manageable batches before sending them, to control memory usage
 on the executing node. I've been limiting queries to batches of 100 keys in
 scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 However my extensive benchmarking this week of the python driver from
 master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I
 will try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look
 for
  random keys in a huge cluster. After all, the amount of connections
 to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just
 ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT
 IN?
   The justification in the docs is I could query many nodes, but I
 would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +
 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = query % tuple(args)
  futures.append((query_msg, self.session.execute_async(query,
 args)))
  
   for query_msg, future in futures:
  try:
 rows = future.result(timeout=10)
 for row in rows:
   entity_ids.add(row.entity_id)
  except:
 logging.error(Query '%s' returned ERROR  % (query_msg))
 raise
  
   Using async just with select = would mean instead of 1 async query
   (example:
   in (0, 1, 2)), I would do several, one for each value of values
 array
   above.
   In my head, this would mean more 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Marcelo Elias Del Valle
I am using python + CQL Driver.
I wonder how they do...
These things seems little important, but they are fundamental to get a good
performance in Cassandra...
I wish there was a simpler way to query in batches. Opening a large amount
of connections and sending 1 message at a time seems bad to me, as
sometimes you want to work with small rows.
It's no surprise Cassandra performs better when we use average row sizes.
But honestly I disagree with this part of Cassandra/Driver's design.
[]s


2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 That depends on the connection pooling implementation in your driver.
 Astyanax will keep N connections open to each node (configurable) and route
 each query in a separate message over an existing connection, waiting until
 one becomes available if all are in use.


 On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 A question, not sure if you guys know the answer:
 Supose I async query 1000 rows using token aware and suppose I have 10
 nodes. Suppose also each node would receive 100 row queries each.
 How does async work in this case? Would it send each row query to each
 node in a different connection? Different message?
 I guess if there was a way to use batch with async, once you commit the
 batch for the 1000 queries, it would create 1 connection to each host and
 query 100 rows in a single message to each host.
 This would decrease resource usage, am I wrong?

 []s


 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client and
 nodes, and you are executing a large batch of queries, you'll usually want
 to send them together to one node unless execution time is of no concern.
 The tradeoff is resource usage on the connected node vs. time to complete
 all the queries, because you'll need fewer client - node network round
 trips.

 With large numbers of queries you will still want to make sure you split
 them into manageable batches before sending them, to control memory usage
 on the executing node. I've been limiting queries to batches of 100 keys in
 scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 However my extensive benchmarking this week of the python driver from
 master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I
 will try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look
 for
  random keys in a huge cluster. After all, the amount of connections
 to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every
 server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just
 ask
  the node with the data.
 
  On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   But using async queries wouldn't be even worse than using SELECT
 IN?
   The justification in the docs is I could query many nodes, but I
 would
   still
   do it.
  
   Today, I use both async queries AND SELECT IN:
  
   SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP
 + 
   WHERE
   name=%s and value in(%s)
  
   for name, values in identifiers.items():
  query = self.SELECT_ENTITY_LOOKUP % ('%s',
   ','.join(['%s']*len(values)))
  args = [name] + values
  query_msg = 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread DuyHai Doan
Well it's kind of a trade-off.

 Either you send data directly to the primary replica nodes to take
advantage of data-locality using token-aware strategy and the price to pay
is a high number of opened connections from client side.

Or you just batch data to a random node playing the coordinator role to
dispatch requests to the right nodes. The price to pay is then spike load
on 1 node (the coordinator) and intra-cluster bandwdith usage.

 The choice is yours, it has nothing to do with good or bad design.


On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 I am using python + CQL Driver.
 I wonder how they do...
 These things seems little important, but they are fundamental to get a
 good performance in Cassandra...
 I wish there was a simpler way to query in batches. Opening a large amount
 of connections and sending 1 message at a time seems bad to me, as
 sometimes you want to work with small rows.
 It's no surprise Cassandra performs better when we use average row sizes.
 But honestly I disagree with this part of Cassandra/Driver's design.
 []s


 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 That depends on the connection pooling implementation in your driver.
 Astyanax will keep N connections open to each node (configurable) and route
 each query in a separate message over an existing connection, waiting until
 one becomes available if all are in use.


 On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 A question, not sure if you guys know the answer:
 Supose I async query 1000 rows using token aware and suppose I have 10
 nodes. Suppose also each node would receive 100 row queries each.
 How does async work in this case? Would it send each row query to each
 node in a different connection? Different message?
 I guess if there was a way to use batch with async, once you commit the
 batch for the 1000 queries, it would create 1 connection to each host and
 query 100 rows in a single message to each host.
 This would decrease resource usage, am I wrong?

 []s


 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client
 and nodes, and you are executing a large batch of queries, you'll usually
 want to send them together to one node unless execution time is of no
 concern. The tradeoff is resource usage on the connected node vs. time to
 complete all the queries, because you'll need fewer client - node network
 round trips.

 With large numbers of queries you will still want to make sure you
 split them into manageable batches before sending them, to control memory
 usage on the executing node. I've been limiting queries to batches of 100
 keys in scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 However my extensive benchmarking this week of the python driver from
 master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own
 environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single
 connection.

 I am assuming you're using one of the datastax drivers for CQL, btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware, I
 will try
  to change my code.
 
  But would it be a recomended solution for these cases? Any other
 options?
 
  I still would if this is the right use case for Cassandra, to look
 for
  random keys in a huge cluster. After all, the amount of
 connections to
  Cassandra will still be huge, right... Wouldn't it be a problem?
  Or when you use async the driver reuses the connection?
 
  []s
 
 
  2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  If you use async and your driver is token aware, it will go to the
  proper node, rather than requiring the coordinator to do so.
 
  Realistically you're going to have a connection open to every
 server
  anyways.  It's the difference between you querying for the data
  directly and using a coordinator as a proxy.  It's faster to just
 ask
  the node 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Jeremy Jongsma
There is nothing preventing that in Cassandra, it's just a matter of how
intelligent the driver API is. Submit a feature request to Astyanax or
Datastax driver projects.


On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 The bad design part (just my opinion, no intention to offend) is not allow
 the possibility of sending batches directly to the data nodes, without
 using a coordinator.
 I would choose that option.
 []s


 2014-06-20 16:05 GMT-03:00 DuyHai Doan doanduy...@gmail.com:

 Well it's kind of a trade-off.

  Either you send data directly to the primary replica nodes to take
 advantage of data-locality using token-aware strategy and the price to pay
 is a high number of opened connections from client side.

 Or you just batch data to a random node playing the coordinator role to
 dispatch requests to the right nodes. The price to pay is then spike load
 on 1 node (the coordinator) and intra-cluster bandwdith usage.

  The choice is yours, it has nothing to do with good or bad design.


 On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 I am using python + CQL Driver.
 I wonder how they do...
 These things seems little important, but they are fundamental to get a
 good performance in Cassandra...
 I wish there was a simpler way to query in batches. Opening a large
 amount of connections and sending 1 message at a time seems bad to me, as
 sometimes you want to work with small rows.
 It's no surprise Cassandra performs better when we use average row
 sizes. But honestly I disagree with this part of Cassandra/Driver's design.
 []s


 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 That depends on the connection pooling implementation in your driver.
 Astyanax will keep N connections open to each node (configurable) and route
 each query in a separate message over an existing connection, waiting until
 one becomes available if all are in use.


 On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 A question, not sure if you guys know the answer:
 Supose I async query 1000 rows using token aware and suppose I have 10
 nodes. Suppose also each node would receive 100 row queries each.
 How does async work in this case? Would it send each row query to each
 node in a different connection? Different message?
 I guess if there was a way to use batch with async, once you commit
 the batch for the 1000 queries, it would create 1 connection to each host
 and query 100 rows in a single message to each host.
 This would decrease resource usage, am I wrong?

 []s


 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client
 and nodes, and you are executing a large batch of queries, you'll usually
 want to send them together to one node unless execution time is of no
 concern. The tradeoff is resource usage on the connected node vs. time to
 complete all the queries, because you'll need fewer client - node 
 network
 round trips.

 With large numbers of queries you will still want to make sure you
 split them into manageable batches before sending them, to control memory
 usage on the executing node. I've been limiting queries to batches of 100
 keys in scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 However my extensive benchmarking this week of the python driver
 from master shows a performance *decrease* when using 'token_aware'.

 This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

 Also why do the work the coordinator will do for you: send all the
 queries, wait for everything to come back in whatever order, and sort 
 the
 result.

 I would rather keep my app code simple.

 But the real point is that you should benchmark in your own
 environment.

 ml


 On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Yes, I am using the CQL datastax drivers.
 It was a good advice, thanks a lot Janathan.
 []s


 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 The only case in which it might be better to use an IN clause is if
 the entire query can be satisfied from that machine.  Otherwise, go
 async.

 The native driver reuses connections and intelligently manages the
 pool for you.  It can also multiplex queries over a single
 connection.

 I am assuming you're using one of the datastax drivers for CQL,
 btw.

 Jon

 On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  This is interesting, I didn't know that!
  It might make sense then to use select = + async + token aware,
 I will try
  to change my code.
 
  But would it be a recomended solution for these cases? Any
 other options?
 
  I still would if this is the right use case for Cassandra, to
 look for
  random keys in a huge cluster. After all, the amount of
 connections to
  Cassandra will 

Re: Best way to do a multi_get using CQL

2014-06-20 Thread Jonathan Haddad
I forgot to add that each connection can handle multiple simultaneous
queries.  This was part of the original protocol as of C* 1.2:
http://www.datastax.com/dev/blog/binary-protocol

Asynchronous: each connection can handle more than one active request
at the same time. In practice, this means that client libraries will
only need to maintain a relatively low amount of open connections to a
given Cassandra node to achieve good performance. This particularly
matters with Cassandra where a client usually wants to keep connection
to all (or at least a good part of) the nodes of the Cluster and so
having a low number of per-node connections helps scaling to large
clusters.
Technically, this is achieved by giving each messages a stream ID, and
by having responses to a request preserve the request’s stream ID.
Clients can thus send multiple requests with different stream IDs on
the same connection (i.e. without waiting for the response to a
request to send the next one) while still being able to associate each
received response to the right request, even if said responses comes
in a different order than the one in which requests were submitted.
That asynchronicity is of course optional in the sense that a client
library can still choose to use the protocol in a synchronous way if
that is simpler.

On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma jer...@barchart.com wrote:
 There is nothing preventing that in Cassandra, it's just a matter of how
 intelligent the driver API is. Submit a feature request to Astyanax or
 Datastax driver projects.


 On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:

 The bad design part (just my opinion, no intention to offend) is not allow
 the possibility of sending batches directly to the data nodes, without using
 a coordinator.
 I would choose that option.
 []s


 2014-06-20 16:05 GMT-03:00 DuyHai Doan doanduy...@gmail.com:

 Well it's kind of a trade-off.

  Either you send data directly to the primary replica nodes to take
 advantage of data-locality using token-aware strategy and the price to pay
 is a high number of opened connections from client side.

 Or you just batch data to a random node playing the coordinator role to
 dispatch requests to the right nodes. The price to pay is then spike load on
 1 node (the coordinator) and intra-cluster bandwdith usage.

  The choice is yours, it has nothing to do with good or bad design.


 On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:

 I am using python + CQL Driver.
 I wonder how they do...
 These things seems little important, but they are fundamental to get a
 good performance in Cassandra...
 I wish there was a simpler way to query in batches. Opening a large
 amount of connections and sending 1 message at a time seems bad to me, as
 sometimes you want to work with small rows.
 It's no surprise Cassandra performs better when we use average row
 sizes. But honestly I disagree with this part of Cassandra/Driver's design.
 []s


 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 That depends on the connection pooling implementation in your driver.
 Astyanax will keep N connections open to each node (configurable) and 
 route
 each query in a separate message over an existing connection, waiting 
 until
 one becomes available if all are in use.


 On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:

 A question, not sure if you guys know the answer:
 Supose I async query 1000 rows using token aware and suppose I have 10
 nodes. Suppose also each node would receive 100 row queries each.
 How does async work in this case? Would it send each row query to each
 node in a different connection? Different message?
 I guess if there was a way to use batch with async, once you commit
 the batch for the 1000 queries, it would create 1 connection to each host
 and query 100 rows in a single message to each host.
 This would decrease resource usage, am I wrong?

 []s


 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com:

 I've found that if you have any amount of latency between your client
 and nodes, and you are executing a large batch of queries, you'll 
 usually
 want to send them together to one node unless execution time is of no
 concern. The tradeoff is resource usage on the connected node vs. time 
 to
 complete all the queries, because you'll need fewer client - node 
 network
 round trips.

 With large numbers of queries you will still want to make sure you
 split them into manageable batches before sending them, to control 
 memory
 usage on the executing node. I've been limiting queries to batches of 
 100
 keys in scenarios like this.


 On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael
 michael.la...@nytimes.com wrote:

 However my extensive benchmarking this week of the python driver
 from master shows a performance decrease when using 'token_aware'.

 This is on 12-node, 

Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
Your other option is to fire off async queries.  It's pretty
straightforward w/ the java or python drivers.

On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 I was taking a look at Cassandra anti-patterns list:

 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html

 Among then is

 SELECT ... IN or index lookups¶

 SELECT ... IN and index lookups (formerly secondary indexes) should be
 avoided except for specific scenarios. See When not to use IN in SELECT and
 When not to use an index in Indexing in

 CQL for Cassandra 2.0

 And Looking at the SELECT doc, I saw:

 When not to use IN¶

 The recommendations about when not to use an index apply to using IN in the
 WHERE clause. Under most conditions, using IN in the WHERE clause is not
 recommended. Using IN can degrade performance because usually many nodes
 must be queried. For example, in a single, local data center cluster having
 30 nodes, a replication factor of 3, and a consistency level of
 LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
 uses the IN condition, the number of nodes being queried are most likely
 even higher, up to 20 nodes depending on where the keys fall in the token
 range.

 In my system, I have a column family called entity_lookup:

 CREATE KEYSPACE IF NOT EXISTS Identification1
   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
   'DC1' : 3 };
 USE Identification1;

 CREATE TABLE IF NOT EXISTS entity_lookup (
   name varchar,
   value varchar,
   entity_id uuid,
   PRIMARY KEY ((name, value), entity_id));

 And I use the following select to query it:

 SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)

 Is this an anti-pattern?

 If not using SELECT IN, which other way would you recomend for lookups like
 that? I have several values I would like to search in cassandra and they
 might not be in the same particion, as above.

 Is Cassandra the wrong tool for lookups like that?

 Best regards,
 Marcelo Valle.














-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: Best way to do a multi_get using CQL

2014-06-19 Thread Marcelo Elias Del Valle
But using async queries wouldn't be even worse than using SELECT IN?
The justification in the docs is I could query many nodes, but I would
still do it.

Today, I use both async queries AND SELECT IN:

SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
name=%s and value in(%s)

for name, values in identifiers.items():
   query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
   args = [name] + values
   query_msg = query % tuple(args)
   futures.append((query_msg, self.session.execute_async(query, args)))

for query_msg, future in futures:
   try:
  rows = future.result(timeout=10)
  for row in rows:
entity_ids.add(row.entity_id)
   except:
  logging.error(Query '%s' returned ERROR  % (query_msg))
  raise

Using async just with select = would mean instead of 1 async query
(example: in (0, 1, 2)), I would do several, one for each value of values
array above.
In my head, this would mean more connections to Cassandra and the same
amount of work, right? What would be the advantage?

[]s




2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Your other option is to fire off async queries.  It's pretty
 straightforward w/ the java or python drivers.

 On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  I was taking a look at Cassandra anti-patterns list:
 
 
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
 
  Among then is
 
  SELECT ... IN or index lookups¶
 
  SELECT ... IN and index lookups (formerly secondary indexes) should be
  avoided except for specific scenarios. See When not to use IN in SELECT
 and
  When not to use an index in Indexing in
 
  CQL for Cassandra 2.0
 
  And Looking at the SELECT doc, I saw:
 
  When not to use IN¶
 
  The recommendations about when not to use an index apply to using IN in
 the
  WHERE clause. Under most conditions, using IN in the WHERE clause is not
  recommended. Using IN can degrade performance because usually many nodes
  must be queried. For example, in a single, local data center cluster
 having
  30 nodes, a replication factor of 3, and a consistency level of
  LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
  uses the IN condition, the number of nodes being queried are most likely
  even higher, up to 20 nodes depending on where the keys fall in the token
  range.
 
  In my system, I have a column family called entity_lookup:
 
  CREATE KEYSPACE IF NOT EXISTS Identification1
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 3 };
  USE Identification1;
 
  CREATE TABLE IF NOT EXISTS entity_lookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id));
 
  And I use the following select to query it:
 
  SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
 
  Is this an anti-pattern?
 
  If not using SELECT IN, which other way would you recomend for lookups
 like
  that? I have several values I would like to search in cassandra and they
  might not be in the same particion, as above.
 
  Is Cassandra the wrong tool for lookups like that?
 
  Best regards,
  Marcelo Valle.
 
 
 
 
 
 
 
 
 
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
If you use async and your driver is token aware, it will go to the
proper node, rather than requiring the coordinator to do so.

Realistically you're going to have a connection open to every server
anyways.  It's the difference between you querying for the data
directly and using a coordinator as a proxy.  It's faster to just ask
the node with the data.

On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 But using async queries wouldn't be even worse than using SELECT IN?
 The justification in the docs is I could query many nodes, but I would still
 do it.

 Today, I use both async queries AND SELECT IN:

 SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
 name=%s and value in(%s)

 for name, values in identifiers.items():
query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
args = [name] + values
query_msg = query % tuple(args)
futures.append((query_msg, self.session.execute_async(query, args)))

 for query_msg, future in futures:
try:
   rows = future.result(timeout=10)
   for row in rows:
 entity_ids.add(row.entity_id)
except:
   logging.error(Query '%s' returned ERROR  % (query_msg))
   raise

 Using async just with select = would mean instead of 1 async query (example:
 in (0, 1, 2)), I would do several, one for each value of values array
 above.
 In my head, this would mean more connections to Cassandra and the same
 amount of work, right? What would be the advantage?

 []s




 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Your other option is to fire off async queries.  It's pretty
 straightforward w/ the java or python drivers.

 On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  I was taking a look at Cassandra anti-patterns list:
 
 
  http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
 
  Among then is
 
  SELECT ... IN or index lookups¶
 
  SELECT ... IN and index lookups (formerly secondary indexes) should be
  avoided except for specific scenarios. See When not to use IN in SELECT
  and
  When not to use an index in Indexing in
 
  CQL for Cassandra 2.0
 
  And Looking at the SELECT doc, I saw:
 
  When not to use IN¶
 
  The recommendations about when not to use an index apply to using IN in
  the
  WHERE clause. Under most conditions, using IN in the WHERE clause is not
  recommended. Using IN can degrade performance because usually many nodes
  must be queried. For example, in a single, local data center cluster
  having
  30 nodes, a replication factor of 3, and a consistency level of
  LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
  uses the IN condition, the number of nodes being queried are most likely
  even higher, up to 20 nodes depending on where the keys fall in the
  token
  range.
 
  In my system, I have a column family called entity_lookup:
 
  CREATE KEYSPACE IF NOT EXISTS Identification1
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 3 };
  USE Identification1;
 
  CREATE TABLE IF NOT EXISTS entity_lookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id));
 
  And I use the following select to query it:
 
  SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
 
  Is this an anti-pattern?
 
  If not using SELECT IN, which other way would you recomend for lookups
  like
  that? I have several values I would like to search in cassandra and they
  might not be in the same particion, as above.
 
  Is Cassandra the wrong tool for lookups like that?
 
  Best regards,
  Marcelo Valle.
 
 
 
 
 
 
 
 
 
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade





-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: Best way to do a multi_get using CQL

2014-06-19 Thread Marcelo Elias Del Valle
This is interesting, I didn't know that!
It might make sense then to use select = + async + token aware, I will try
to change my code.

But would it be a recomended solution for these cases? Any other options?

I still would if this is the right use case for Cassandra, to look for
random keys in a huge cluster. After all, the amount of connections to
Cassandra will still be huge, right... Wouldn't it be a problem?
Or when you use async the driver reuses the connection?

[]s


2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 If you use async and your driver is token aware, it will go to the
 proper node, rather than requiring the coordinator to do so.

 Realistically you're going to have a connection open to every server
 anyways.  It's the difference between you querying for the data
 directly and using a coordinator as a proxy.  It's faster to just ask
 the node with the data.

 On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  But using async queries wouldn't be even worse than using SELECT IN?
  The justification in the docs is I could query many nodes, but I would
 still
  do it.
 
  Today, I use both async queries AND SELECT IN:
 
  SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP +  WHERE
  name=%s and value in(%s)
 
  for name, values in identifiers.items():
 query = self.SELECT_ENTITY_LOOKUP % ('%s',
 ','.join(['%s']*len(values)))
 args = [name] + values
 query_msg = query % tuple(args)
 futures.append((query_msg, self.session.execute_async(query, args)))
 
  for query_msg, future in futures:
 try:
rows = future.result(timeout=10)
for row in rows:
  entity_ids.add(row.entity_id)
 except:
logging.error(Query '%s' returned ERROR  % (query_msg))
raise
 
  Using async just with select = would mean instead of 1 async query
 (example:
  in (0, 1, 2)), I would do several, one for each value of values array
  above.
  In my head, this would mean more connections to Cassandra and the same
  amount of work, right? What would be the advantage?
 
  []s
 
 
 
 
  2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  Your other option is to fire off async queries.  It's pretty
  straightforward w/ the java or python drivers.
 
  On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   I was taking a look at Cassandra anti-patterns list:
  
  
  
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
  
   Among then is
  
   SELECT ... IN or index lookups¶
  
   SELECT ... IN and index lookups (formerly secondary indexes) should be
   avoided except for specific scenarios. See When not to use IN in
 SELECT
   and
   When not to use an index in Indexing in
  
   CQL for Cassandra 2.0
  
   And Looking at the SELECT doc, I saw:
  
   When not to use IN¶
  
   The recommendations about when not to use an index apply to using IN
 in
   the
   WHERE clause. Under most conditions, using IN in the WHERE clause is
 not
   recommended. Using IN can degrade performance because usually many
 nodes
   must be queried. For example, in a single, local data center cluster
   having
   30 nodes, a replication factor of 3, and a consistency level of
   LOCAL_QUORUM, a single key query goes out to two nodes, but if the
 query
   uses the IN condition, the number of nodes being queried are most
 likely
   even higher, up to 20 nodes depending on where the keys fall in the
   token
   range.
  
   In my system, I have a column family called entity_lookup:
  
   CREATE KEYSPACE IF NOT EXISTS Identification1
 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
 'DC1' : 3 };
   USE Identification1;
  
   CREATE TABLE IF NOT EXISTS entity_lookup (
 name varchar,
 value varchar,
 entity_id uuid,
 PRIMARY KEY ((name, value), entity_id));
  
   And I use the following select to query it:
  
   SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
  
   Is this an anti-pattern?
  
   If not using SELECT IN, which other way would you recomend for lookups
   like
   that? I have several values I would like to search in cassandra and
 they
   might not be in the same particion, as above.
  
   Is Cassandra the wrong tool for lookups like that?
  
   Best regards,
   Marcelo Valle.
  
  
  
  
  
  
  
  
  
  
  
 
 
 
  --
  Jon Haddad
  http://www.rustyrazorblade.com
  skype: rustyrazorblade
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: Best way to do a multi_get using CQL

2014-06-19 Thread Jonathan Haddad
The only case in which it might be better to use an IN clause is if
the entire query can be satisfied from that machine.  Otherwise, go
async.

The native driver reuses connections and intelligently manages the
pool for you.  It can also multiplex queries over a single connection.

I am assuming you're using one of the datastax drivers for CQL, btw.

Jon

On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 This is interesting, I didn't know that!
 It might make sense then to use select = + async + token aware, I will try
 to change my code.

 But would it be a recomended solution for these cases? Any other options?

 I still would if this is the right use case for Cassandra, to look for
 random keys in a huge cluster. After all, the amount of connections to
 Cassandra will still be huge, right... Wouldn't it be a problem?
 Or when you use async the driver reuses the connection?

 []s


 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 If you use async and your driver is token aware, it will go to the
 proper node, rather than requiring the coordinator to do so.

 Realistically you're going to have a connection open to every server
 anyways.  It's the difference between you querying for the data
 directly and using a coordinator as a proxy.  It's faster to just ask
 the node with the data.

 On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  But using async queries wouldn't be even worse than using SELECT IN?
  The justification in the docs is I could query many nodes, but I would
  still
  do it.
 
  Today, I use both async queries AND SELECT IN:
 
  SELECT_ENTITY_LOOKUP = SELECT entity_id FROM  + ENTITY_LOOKUP + 
  WHERE
  name=%s and value in(%s)
 
  for name, values in identifiers.items():
 query = self.SELECT_ENTITY_LOOKUP % ('%s',
  ','.join(['%s']*len(values)))
 args = [name] + values
 query_msg = query % tuple(args)
 futures.append((query_msg, self.session.execute_async(query, args)))
 
  for query_msg, future in futures:
 try:
rows = future.result(timeout=10)
for row in rows:
  entity_ids.add(row.entity_id)
 except:
logging.error(Query '%s' returned ERROR  % (query_msg))
raise
 
  Using async just with select = would mean instead of 1 async query
  (example:
  in (0, 1, 2)), I would do several, one for each value of values array
  above.
  In my head, this would mean more connections to Cassandra and the same
  amount of work, right? What would be the advantage?
 
  []s
 
 
 
 
  2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  Your other option is to fire off async queries.  It's pretty
  straightforward w/ the java or python drivers.
 
  On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   I was taking a look at Cassandra anti-patterns list:
  
  
  
   http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
  
   Among then is
  
   SELECT ... IN or index lookups¶
  
   SELECT ... IN and index lookups (formerly secondary indexes) should
   be
   avoided except for specific scenarios. See When not to use IN in
   SELECT
   and
   When not to use an index in Indexing in
  
   CQL for Cassandra 2.0
  
   And Looking at the SELECT doc, I saw:
  
   When not to use IN¶
  
   The recommendations about when not to use an index apply to using IN
   in
   the
   WHERE clause. Under most conditions, using IN in the WHERE clause is
   not
   recommended. Using IN can degrade performance because usually many
   nodes
   must be queried. For example, in a single, local data center cluster
   having
   30 nodes, a replication factor of 3, and a consistency level of
   LOCAL_QUORUM, a single key query goes out to two nodes, but if the
   query
   uses the IN condition, the number of nodes being queried are most
   likely
   even higher, up to 20 nodes depending on where the keys fall in the
   token
   range.
  
   In my system, I have a column family called entity_lookup:
  
   CREATE KEYSPACE IF NOT EXISTS Identification1
 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
 'DC1' : 3 };
   USE Identification1;
  
   CREATE TABLE IF NOT EXISTS entity_lookup (
 name varchar,
 value varchar,
 entity_id uuid,
 PRIMARY KEY ((name, value), entity_id));
  
   And I use the following select to query it:
  
   SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
  
   Is this an anti-pattern?
  
   If not using SELECT IN, which other way would you recomend for
   lookups
   like
   that? I have several values I would like to search in cassandra and
   they
   might not be in the same particion, as above.
  
   Is Cassandra the wrong tool for lookups like that?
  
   Best regards,
   Marcelo Valle.
  
  
  
  
  
  
  
  
  
  
  
 
 
 
  --
  Jon Haddad
  http://www.rustyrazorblade.com
  skype: