That depends on the connection pooling implementation in your driver. Astyanax will keep N connections open to each node (configurable) and route each query in a separate message over an existing connection, waiting until one becomes available if all are in use.
On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle < marc...@s1mbi0se.com.br> wrote: > A question, not sure if you guys know the answer: > Supose I async query 1000 rows using token aware and suppose I have 10 > nodes. Suppose also each node would receive 100 row queries each. > How does async work in this case? Would it send each row query to each > node in a different connection? Different message? > I guess if there was a way to use batch with async, once you commit the > batch for the 1000 queries, it would create 1 connection to each host and > query 100 rows in a single message to each host. > This would decrease resource usage, am I wrong? > > []s > > > 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: > > I've found that if you have any amount of latency between your client and >> nodes, and you are executing a large batch of queries, you'll usually want >> to send them together to one node unless execution time is of no concern. >> The tradeoff is resource usage on the connected node vs. time to complete >> all the queries, because you'll need fewer client -> node network round >> trips. >> >> With large numbers of queries you will still want to make sure you split >> them into manageable batches before sending them, to control memory usage >> on the executing node. I've been limiting queries to batches of 100 keys in >> scenarios like this. >> >> >> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael < >> michael.la...@nytimes.com> wrote: >> >>> However my extensive benchmarking this week of the python driver from >>> master shows a performance *decrease* when using 'token_aware'. >>> >>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>> >>> Also why do the work the coordinator will do for you: send all the >>> queries, wait for everything to come back in whatever order, and sort the >>> result. >>> >>> I would rather keep my app code simple. >>> >>> But the real point is that you should benchmark in your own environment. >>> >>> ml >>> >>> >>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >>> marc...@s1mbi0se.com.br> wrote: >>> >>>> Yes, I am using the CQL datastax drivers. >>>> It was a good advice, thanks a lot Janathan. >>>> []s >>>> >>>> >>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>> >>>> The only case in which it might be better to use an IN clause is if >>>>> the entire query can be satisfied from that machine. Otherwise, go >>>>> async. >>>>> >>>>> The native driver reuses connections and intelligently manages the >>>>> pool for you. It can also multiplex queries over a single connection. >>>>> >>>>> I am assuming you're using one of the datastax drivers for CQL, btw. >>>>> >>>>> Jon >>>>> >>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>> <marc...@s1mbi0se.com.br> wrote: >>>>> > This is interesting, I didn't know that! >>>>> > It might make sense then to use select = + async + token aware, I >>>>> will try >>>>> > to change my code. >>>>> > >>>>> > But would it be a "recomended solution" for these cases? Any other >>>>> options? >>>>> > >>>>> > I still would if this is the right use case for Cassandra, to look >>>>> for >>>>> > random keys in a huge cluster. After all, the amount of connections >>>>> to >>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>> > Or when you use async the driver reuses the connection? >>>>> > >>>>> > []s >>>>> > >>>>> > >>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>> > >>>>> >> If you use async and your driver is token aware, it will go to the >>>>> >> proper node, rather than requiring the coordinator to do so. >>>>> >> >>>>> >> Realistically you're going to have a connection open to every server >>>>> >> anyways. It's the difference between you querying for the data >>>>> >> directly and using a coordinator as a proxy. It's faster to just >>>>> ask >>>>> >> the node with the data. >>>>> >> >>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>> >> > But using async queries wouldn't be even worse than using SELECT >>>>> IN? >>>>> >> > The justification in the docs is I could query many nodes, but I >>>>> would >>>>> >> > still >>>>> >> > do it. >>>>> >> > >>>>> >> > Today, I use both async queries AND SELECT IN: >>>>> >> > >>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + >>>>> " >>>>> >> > WHERE >>>>> >> > name=%s and value in(%s)" >>>>> >> > >>>>> >> > for name, values in identifiers.items(): >>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>> >> > ','.join(['%s']*len(values))) >>>>> >> > args = [name] + values >>>>> >> > query_msg = query % tuple(args) >>>>> >> > futures.append((query_msg, self.session.execute_async(query, >>>>> args))) >>>>> >> > >>>>> >> > for query_msg, future in futures: >>>>> >> > try: >>>>> >> > rows = future.result(timeout=100000) >>>>> >> > for row in rows: >>>>> >> > entity_ids.add(row.entity_id) >>>>> >> > except: >>>>> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >>>>> >> > raise >>>>> >> > >>>>> >> > Using async just with select = would mean instead of 1 async query >>>>> >> > (example: >>>>> >> > in (0, 1, 2)), I would do several, one for each value of "values" >>>>> array >>>>> >> > above. >>>>> >> > In my head, this would mean more connections to Cassandra and the >>>>> same >>>>> >> > amount of work, right? What would be the advantage? >>>>> >> > >>>>> >> > []s >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>> >> > >>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>> >> >> straightforward w/ the java or python drivers. >>>>> >> >> >>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>> >> >> > >>>>> >> >> > Among then is >>>>> >> >> > >>>>> >> >> > SELECT ... IN or index lookups¶ >>>>> >> >> > >>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >>>>> should >>>>> >> >> > be >>>>> >> >> > avoided except for specific scenarios. See When not to use IN >>>>> in >>>>> >> >> > SELECT >>>>> >> >> > and >>>>> >> >> > When not to use an index in Indexing in >>>>> >> >> > >>>>> >> >> > CQL for Cassandra 2.0" >>>>> >> >> > >>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>> >> >> > >>>>> >> >> > When not to use IN¶ >>>>> >> >> > >>>>> >> >> > The recommendations about when not to use an index apply to >>>>> using IN >>>>> >> >> > in >>>>> >> >> > the >>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>> clause is >>>>> >> >> > not >>>>> >> >> > recommended. Using IN can degrade performance because usually >>>>> many >>>>> >> >> > nodes >>>>> >> >> > must be queried. For example, in a single, local data center >>>>> cluster >>>>> >> >> > having >>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of >>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if >>>>> the >>>>> >> >> > query >>>>> >> >> > uses the IN condition, the number of nodes being queried are >>>>> most >>>>> >> >> > likely >>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall >>>>> in the >>>>> >> >> > token >>>>> >> >> > range." >>>>> >> >> > >>>>> >> >> > In my system, I have a column family called "entity_lookup": >>>>> >> >> > >>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>>> >> >> > 'DC1' : 3 }; >>>>> >> >> > USE Identification1; >>>>> >> >> > >>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>> >> >> > name varchar, >>>>> >> >> > value varchar, >>>>> >> >> > entity_id uuid, >>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>> >> >> > >>>>> >> >> > And I use the following select to query it: >>>>> >> >> > >>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value >>>>> in(%s) >>>>> >> >> > >>>>> >> >> > Is this an anti-pattern? >>>>> >> >> > >>>>> >> >> > If not using SELECT IN, which other way would you recomend for >>>>> >> >> > lookups >>>>> >> >> > like >>>>> >> >> > that? I have several values I would like to search in >>>>> cassandra and >>>>> >> >> > they >>>>> >> >> > might not be in the same particion, as above. >>>>> >> >> > >>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>> >> >> > >>>>> >> >> > Best regards, >>>>> >> >> > Marcelo Valle. >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> -- >>>>> >> >> Jon Haddad >>>>> >> >> http://www.rustyrazorblade.com >>>>> >> >> skype: rustyrazorblade >>>>> >> > >>>>> >> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Jon Haddad >>>>> >> http://www.rustyrazorblade.com >>>>> >> skype: rustyrazorblade >>>>> > >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Jon Haddad >>>>> http://www.rustyrazorblade.com >>>>> skype: rustyrazorblade >>>>> >>>> >>>> >>> >> >