Re: Best way to do a multi_get using CQL
Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have
Re: Best way to do a multi_get using CQL
However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT
Re: Best way to do a multi_get using CQL
I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use
Re: Best way to do a multi_get using CQL
A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
Re: Best way to do a multi_get using CQL
That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more
Re: Best way to do a multi_get using CQL
I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com: That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg =
Re: Best way to do a multi_get using CQL
Well it's kind of a trade-off. Either you send data directly to the primary replica nodes to take advantage of data-locality using token-aware strategy and the price to pay is a high number of opened connections from client side. Or you just batch data to a random node playing the coordinator role to dispatch requests to the right nodes. The price to pay is then spike load on 1 node (the coordinator) and intra-cluster bandwdith usage. The choice is yours, it has nothing to do with good or bad design. On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com: That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node
Re: Best way to do a multi_get using CQL
There is nothing preventing that in Cassandra, it's just a matter of how intelligent the driver API is. Submit a feature request to Astyanax or Datastax driver projects. On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: The bad design part (just my opinion, no intention to offend) is not allow the possibility of sending batches directly to the data nodes, without using a coordinator. I would choose that option. []s 2014-06-20 16:05 GMT-03:00 DuyHai Doan doanduy...@gmail.com: Well it's kind of a trade-off. Either you send data directly to the primary replica nodes to take advantage of data-locality using token-aware strategy and the price to pay is a high number of opened connections from client side. Or you just batch data to a random node playing the coordinator role to dispatch requests to the right nodes. The price to pay is then spike load on 1 node (the coordinator) and intra-cluster bandwdith usage. The choice is yours, it has nothing to do with good or bad design. On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com: That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will
Re: Best way to do a multi_get using CQL
I forgot to add that each connection can handle multiple simultaneous queries. This was part of the original protocol as of C* 1.2: http://www.datastax.com/dev/blog/binary-protocol Asynchronous: each connection can handle more than one active request at the same time. In practice, this means that client libraries will only need to maintain a relatively low amount of open connections to a given Cassandra node to achieve good performance. This particularly matters with Cassandra where a client usually wants to keep connection to all (or at least a good part of) the nodes of the Cluster and so having a low number of per-node connections helps scaling to large clusters. Technically, this is achieved by giving each messages a stream ID, and by having responses to a request preserve the request’s stream ID. Clients can thus send multiple requests with different stream IDs on the same connection (i.e. without waiting for the response to a request to send the next one) while still being able to associate each received response to the right request, even if said responses comes in a different order than the one in which requests were submitted. That asynchronicity is of course optional in the sense that a client library can still choose to use the protocol in a synchronous way if that is simpler. On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma jer...@barchart.com wrote: There is nothing preventing that in Cassandra, it's just a matter of how intelligent the driver API is. Submit a feature request to Astyanax or Datastax driver projects. On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: The bad design part (just my opinion, no intention to offend) is not allow the possibility of sending batches directly to the data nodes, without using a coordinator. I would choose that option. []s 2014-06-20 16:05 GMT-03:00 DuyHai Doan doanduy...@gmail.com: Well it's kind of a trade-off. Either you send data directly to the primary replica nodes to take advantage of data-locality using token-aware strategy and the price to pay is a high number of opened connections from client side. Or you just batch data to a random node playing the coordinator role to dispatch requests to the right nodes. The price to pay is then spike load on 1 node (the coordinator) and intra-cluster bandwdith usage. The choice is yours, it has nothing to do with good or bad design. On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma jer...@barchart.com: That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use. On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: A question, not sure if you guys know the answer: Supose I async query 1000 rows using token aware and suppose I have 10 nodes. Suppose also each node would receive 100 row queries each. How does async work in this case? Would it send each row query to each node in a different connection? Different message? I guess if there was a way to use batch with async, once you commit the batch for the 1000 queries, it would create 1 connection to each host and query 100 rows in a single message to each host. This would decrease resource usage, am I wrong? []s 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma jer...@barchart.com: I've found that if you have any amount of latency between your client and nodes, and you are executing a large batch of queries, you'll usually want to send them together to one node unless execution time is of no concern. The tradeoff is resource usage on the connected node vs. time to complete all the queries, because you'll need fewer client - node network round trips. With large numbers of queries you will still want to make sure you split them into manageable batches before sending them, to control memory usage on the executing node. I've been limiting queries to batches of 100 keys in scenarios like this. On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael michael.la...@nytimes.com wrote: However my extensive benchmarking this week of the python driver from master shows a performance decrease when using 'token_aware'. This is on 12-node,
Re: Best way to do a multi_get using CQL
Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: