I forgot to add that each connection can handle multiple simultaneous queries. This was part of the original protocol as of C* 1.2: http://www.datastax.com/dev/blog/binary-protocol
Asynchronous: each connection can handle more than one active request at the same time. In practice, this means that client libraries will only need to maintain a relatively low amount of open connections to a given Cassandra node to achieve good performance. This particularly matters with Cassandra where a client usually wants to keep connection to all (or at least a good part of) the nodes of the Cluster and so having a low number of per-node connections helps scaling to large clusters. Technically, this is achieved by giving each messages a stream ID, and by having responses to a request preserve the request’s stream ID. Clients can thus send multiple requests with different stream IDs on the same connection (i.e. without waiting for the response to a request to send the next one) while still being able to associate each received response to the right request, even if said responses comes in a different order than the one in which requests were submitted. That asynchronicity is of course optional in the sense that a client library can still choose to use the protocol in a synchronous way if that is simpler. On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma <jer...@barchart.com> wrote: > There is nothing preventing that in Cassandra, it's just a matter of how > intelligent the driver API is. Submit a feature request to Astyanax or > Datastax driver projects. > > > On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle > <marc...@s1mbi0se.com.br> wrote: >> >> The bad design part (just my opinion, no intention to offend) is not allow >> the possibility of sending batches directly to the data nodes, without using >> a coordinator. >> I would choose that option. >> []s >> >> >> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <doanduy...@gmail.com>: >>> >>> Well it's kind of a trade-off. >>> >>> Either you send data directly to the primary replica nodes to take >>> advantage of data-locality using token-aware strategy and the price to pay >>> is a high number of opened connections from client side. >>> >>> Or you just batch data to a random node playing the coordinator role to >>> dispatch requests to the right nodes. The price to pay is then spike load on >>> 1 node (the coordinator) and intra-cluster bandwdith usage. >>> >>> The choice is yours, it has nothing to do with good or bad design. >>> >>> >>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle >>> <marc...@s1mbi0se.com.br> wrote: >>>> >>>> I am using python + CQL Driver. >>>> I wonder how they do... >>>> These things seems little important, but they are fundamental to get a >>>> good performance in Cassandra... >>>> I wish there was a simpler way to query in batches. Opening a large >>>> amount of connections and sending 1 message at a time seems bad to me, as >>>> sometimes you want to work with small rows. >>>> It's no surprise Cassandra performs better when we use average row >>>> sizes. But honestly I disagree with this part of Cassandra/Driver's design. >>>> []s >>>> >>>> >>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>>> >>>>> That depends on the connection pooling implementation in your driver. >>>>> Astyanax will keep N connections open to each node (configurable) and >>>>> route >>>>> each query in a separate message over an existing connection, waiting >>>>> until >>>>> one becomes available if all are in use. >>>>> >>>>> >>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle >>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>> >>>>>> A question, not sure if you guys know the answer: >>>>>> Supose I async query 1000 rows using token aware and suppose I have 10 >>>>>> nodes. Suppose also each node would receive 100 row queries each. >>>>>> How does async work in this case? Would it send each row query to each >>>>>> node in a different connection? Different message? >>>>>> I guess if there was a way to use batch with async, once you commit >>>>>> the batch for the 1000 queries, it would create 1 connection to each host >>>>>> and query 100 rows in a single message to each host. >>>>>> This would decrease resource usage, am I wrong? >>>>>> >>>>>> []s >>>>>> >>>>>> >>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>>>>> >>>>>>> I've found that if you have any amount of latency between your client >>>>>>> and nodes, and you are executing a large batch of queries, you'll >>>>>>> usually >>>>>>> want to send them together to one node unless execution time is of no >>>>>>> concern. The tradeoff is resource usage on the connected node vs. time >>>>>>> to >>>>>>> complete all the queries, because you'll need fewer client -> node >>>>>>> network >>>>>>> round trips. >>>>>>> >>>>>>> With large numbers of queries you will still want to make sure you >>>>>>> split them into manageable batches before sending them, to control >>>>>>> memory >>>>>>> usage on the executing node. I've been limiting queries to batches of >>>>>>> 100 >>>>>>> keys in scenarios like this. >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael >>>>>>> <michael.la...@nytimes.com> wrote: >>>>>>>> >>>>>>>> However my extensive benchmarking this week of the python driver >>>>>>>> from master shows a performance decrease when using 'token_aware'. >>>>>>>> >>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>>>>>>> >>>>>>>> Also why do the work the coordinator will do for you: send all the >>>>>>>> queries, wait for everything to come back in whatever order, and sort >>>>>>>> the >>>>>>>> result. >>>>>>>> >>>>>>>> I would rather keep my app code simple. >>>>>>>> >>>>>>>> But the real point is that you should benchmark in your own >>>>>>>> environment. >>>>>>>> >>>>>>>> ml >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle >>>>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>> >>>>>>>>> Yes, I am using the CQL datastax drivers. >>>>>>>>> It was a good advice, thanks a lot Janathan. >>>>>>>>> []s >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>>>> >>>>>>>>>> The only case in which it might be better to use an IN clause is >>>>>>>>>> if >>>>>>>>>> the entire query can be satisfied from that machine. Otherwise, >>>>>>>>>> go >>>>>>>>>> async. >>>>>>>>>> >>>>>>>>>> The native driver reuses connections and intelligently manages the >>>>>>>>>> pool for you. It can also multiplex queries over a single >>>>>>>>>> connection. >>>>>>>>>> >>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL, >>>>>>>>>> btw. >>>>>>>>>> >>>>>>>>>> Jon >>>>>>>>>> >>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>>>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>>> > This is interesting, I didn't know that! >>>>>>>>>> > It might make sense then to use select = + async + token aware, >>>>>>>>>> > I will try >>>>>>>>>> > to change my code. >>>>>>>>>> > >>>>>>>>>> > But would it be a "recomended solution" for these cases? Any >>>>>>>>>> > other options? >>>>>>>>>> > >>>>>>>>>> > I still would if this is the right use case for Cassandra, to >>>>>>>>>> > look for >>>>>>>>>> > random keys in a huge cluster. After all, the amount of >>>>>>>>>> > connections to >>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>>>>>>> > Or when you use async the driver reuses the connection? >>>>>>>>>> > >>>>>>>>>> > []s >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>>>>> > >>>>>>>>>> >> If you use async and your driver is token aware, it will go to >>>>>>>>>> >> the >>>>>>>>>> >> proper node, rather than requiring the coordinator to do so. >>>>>>>>>> >> >>>>>>>>>> >> Realistically you're going to have a connection open to every >>>>>>>>>> >> server >>>>>>>>>> >> anyways. It's the difference between you querying for the data >>>>>>>>>> >> directly and using a coordinator as a proxy. It's faster to >>>>>>>>>> >> just ask >>>>>>>>>> >> the node with the data. >>>>>>>>>> >> >>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>>>>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>>> >> > But using async queries wouldn't be even worse than using >>>>>>>>>> >> > SELECT IN? >>>>>>>>>> >> > The justification in the docs is I could query many nodes, >>>>>>>>>> >> > but I would >>>>>>>>>> >> > still >>>>>>>>>> >> > do it. >>>>>>>>>> >> > >>>>>>>>>> >> > Today, I use both async queries AND SELECT IN: >>>>>>>>>> >> > >>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + >>>>>>>>>> >> > ENTITY_LOOKUP + " >>>>>>>>>> >> > WHERE >>>>>>>>>> >> > name=%s and value in(%s)" >>>>>>>>>> >> > >>>>>>>>>> >> > for name, values in identifiers.items(): >>>>>>>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>>>>>>> >> > ','.join(['%s']*len(values))) >>>>>>>>>> >> > args = [name] + values >>>>>>>>>> >> > query_msg = query % tuple(args) >>>>>>>>>> >> > futures.append((query_msg, >>>>>>>>>> >> > self.session.execute_async(query, args))) >>>>>>>>>> >> > >>>>>>>>>> >> > for query_msg, future in futures: >>>>>>>>>> >> > try: >>>>>>>>>> >> > rows = future.result(timeout=100000) >>>>>>>>>> >> > for row in rows: >>>>>>>>>> >> > entity_ids.add(row.entity_id) >>>>>>>>>> >> > except: >>>>>>>>>> >> > logging.error("Query '%s' returned ERROR " % >>>>>>>>>> >> > (query_msg)) >>>>>>>>>> >> > raise >>>>>>>>>> >> > >>>>>>>>>> >> > Using async just with select = would mean instead of 1 async >>>>>>>>>> >> > query >>>>>>>>>> >> > (example: >>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of >>>>>>>>>> >> > "values" array >>>>>>>>>> >> > above. >>>>>>>>>> >> > In my head, this would mean more connections to Cassandra and >>>>>>>>>> >> > the same >>>>>>>>>> >> > amount of work, right? What would be the advantage? >>>>>>>>>> >> > >>>>>>>>>> >> > []s >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad >>>>>>>>>> >> > <j...@jonhaddad.com>: >>>>>>>>>> >> > >>>>>>>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>>>>>>> >> >> straightforward w/ the java or python drivers. >>>>>>>>>> >> >> >>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > Among then is >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > SELECT ... IN or index lookups¶ >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary >>>>>>>>>> >> >> > indexes) should >>>>>>>>>> >> >> > be >>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use >>>>>>>>>> >> >> > IN in >>>>>>>>>> >> >> > SELECT >>>>>>>>>> >> >> > and >>>>>>>>>> >> >> > When not to use an index in Indexing in >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > CQL for Cassandra 2.0" >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > When not to use IN¶ >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > The recommendations about when not to use an index apply >>>>>>>>>> >> >> > to using IN >>>>>>>>>> >> >> > in >>>>>>>>>> >> >> > the >>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>>>>>>> >> >> > clause is >>>>>>>>>> >> >> > not >>>>>>>>>> >> >> > recommended. Using IN can degrade performance because >>>>>>>>>> >> >> > usually many >>>>>>>>>> >> >> > nodes >>>>>>>>>> >> >> > must be queried. For example, in a single, local data >>>>>>>>>> >> >> > center cluster >>>>>>>>>> >> >> > having >>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency >>>>>>>>>> >> >> > level of >>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, >>>>>>>>>> >> >> > but if the >>>>>>>>>> >> >> > query >>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried >>>>>>>>>> >> >> > are most >>>>>>>>>> >> >> > likely >>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys >>>>>>>>>> >> >> > fall in the >>>>>>>>>> >> >> > token >>>>>>>>>> >> >> > range." >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > In my system, I have a column family called >>>>>>>>>> >> >> > "entity_lookup": >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>>>>>>> >> >> > WITH REPLICATION = { 'class' : >>>>>>>>>> >> >> > 'NetworkTopologyStrategy', >>>>>>>>>> >> >> > 'DC1' : 3 }; >>>>>>>>>> >> >> > USE Identification1; >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>>>> >> >> > name varchar, >>>>>>>>>> >> >> > value varchar, >>>>>>>>>> >> >> > entity_id uuid, >>>>>>>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > And I use the following select to query it: >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and >>>>>>>>>> >> >> > value in(%s) >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > Is this an anti-pattern? >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend >>>>>>>>>> >> >> > for >>>>>>>>>> >> >> > lookups >>>>>>>>>> >> >> > like >>>>>>>>>> >> >> > that? I have several values I would like to search in >>>>>>>>>> >> >> > cassandra and >>>>>>>>>> >> >> > they >>>>>>>>>> >> >> > might not be in the same particion, as above. >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > Best regards, >>>>>>>>>> >> >> > Marcelo Valle. >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> -- >>>>>>>>>> >> >> Jon Haddad >>>>>>>>>> >> >> http://www.rustyrazorblade.com >>>>>>>>>> >> >> skype: rustyrazorblade >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> -- >>>>>>>>>> >> Jon Haddad >>>>>>>>>> >> http://www.rustyrazorblade.com >>>>>>>>>> >> skype: rustyrazorblade >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jon Haddad >>>>>>>>>> http://www.rustyrazorblade.com >>>>>>>>>> skype: rustyrazorblade >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade