Re: Best way to do a multi_get using CQL

Jonathan Haddad Fri, 20 Jun 2014 12:42:06 -0700

I forgot to add that each connection can handle multiple simultaneous
queries.  This was part of the original protocol as of C* 1.2:
http://www.datastax.com/dev/blog/binary-protocol


Asynchronous: each connection can handle more than one active request
at the same time. In practice, this means that client libraries will
only need to maintain a relatively low amount of open connections to a
given Cassandra node to achieve good performance. This particularly
matters with Cassandra where a client usually wants to keep connection
to all (or at least a good part of) the nodes of the Cluster and so
having a low number of per-node connections helps scaling to large
clusters.
Technically, this is achieved by giving each messages a stream ID, and
by having responses to a request preserve the request’s stream ID.
Clients can thus send multiple requests with different stream IDs on
the same connection (i.e. without waiting for the response to a
request to send the next one) while still being able to associate each
received response to the right request, even if said responses comes
in a different order than the one in which requests were submitted.
That asynchronicity is of course optional in the sense that a client
library can still choose to use the protocol in a synchronous way if
that is simpler.

On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma <jer...@barchart.com> wrote:
> There is nothing preventing that in Cassandra, it's just a matter of how
> intelligent the driver API is. Submit a feature request to Astyanax or
> Datastax driver projects.
>
>
> On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle
> <marc...@s1mbi0se.com.br> wrote:
>>
>> The bad design part (just my opinion, no intention to offend) is not allow
>> the possibility of sending batches directly to the data nodes, without using
>> a coordinator.
>> I would choose that option.
>> []s
>>
>>
>> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <doanduy...@gmail.com>:
>>>
>>> Well it's kind of a trade-off.
>>>
>>>  Either you send data directly to the primary replica nodes to take
>>> advantage of data-locality using token-aware strategy and the price to pay
>>> is a high number of opened connections from client side.
>>>
>>> Or you just batch data to a random node playing the coordinator role to
>>> dispatch requests to the right nodes. The price to pay is then spike load on
>>> 1 node (the coordinator) and intra-cluster bandwdith usage.
>>>
>>>  The choice is yours, it has nothing to do with good or bad design.
>>>
>>>
>>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle
>>> <marc...@s1mbi0se.com.br> wrote:
>>>>
>>>> I am using python + CQL Driver.
>>>> I wonder how they do...
>>>> These things seems little important, but they are fundamental to get a
>>>> good performance in Cassandra...
>>>> I wish there was a simpler way to query in batches. Opening a large
>>>> amount of connections and sending 1 message at a time seems bad to me, as
>>>> sometimes you want to work with small rows.
>>>> It's no surprise Cassandra performs better when we use average row
>>>> sizes. But honestly I disagree with this part of Cassandra/Driver's design.
>>>> []s
>>>>
>>>>
>>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:
>>>>
>>>>> That depends on the connection pooling implementation in your driver.
>>>>> Astyanax will keep N connections open to each node (configurable) and 
>>>>> route
>>>>> each query in a separate message over an existing connection, waiting 
>>>>> until
>>>>> one becomes available if all are in use.
>>>>>
>>>>>
>>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle
>>>>> <marc...@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>> A question, not sure if you guys know the answer:
>>>>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>>>>> nodes. Suppose also each node would receive 100 row queries each.
>>>>>> How does async work in this case? Would it send each row query to each
>>>>>> node in a different connection? Different message?
>>>>>> I guess if there was a way to use batch with async, once you commit
>>>>>> the batch for the 1000 queries, it would create 1 connection to each host
>>>>>> and query 100 rows in a single message to each host.
>>>>>> This would decrease resource usage, am I wrong?
>>>>>>
>>>>>> []s
>>>>>>
>>>>>>
>>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:
>>>>>>
>>>>>>> I've found that if you have any amount of latency between your client
>>>>>>> and nodes, and you are executing a large batch of queries, you'll 
>>>>>>> usually
>>>>>>> want to send them together to one node unless execution time is of no
>>>>>>> concern. The tradeoff is resource usage on the connected node vs. time 
>>>>>>> to
>>>>>>> complete all the queries, because you'll need fewer client -> node 
>>>>>>> network
>>>>>>> round trips.
>>>>>>>
>>>>>>> With large numbers of queries you will still want to make sure you
>>>>>>> split them into manageable batches before sending them, to control 
>>>>>>> memory
>>>>>>> usage on the executing node. I've been limiting queries to batches of 
>>>>>>> 100
>>>>>>> keys in scenarios like this.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael
>>>>>>> <michael.la...@nytimes.com> wrote:
>>>>>>>>
>>>>>>>> However my extensive benchmarking this week of the python driver
>>>>>>>> from master shows a performance decrease when using 'token_aware'.
>>>>>>>>
>>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>>>>
>>>>>>>> Also why do the work the coordinator will do for you: send all the
>>>>>>>> queries, wait for everything to come back in whatever order, and sort 
>>>>>>>> the
>>>>>>>> result.
>>>>>>>>
>>>>>>>> I would rather keep my app code simple.
>>>>>>>>
>>>>>>>> But the real point is that you should benchmark in your own
>>>>>>>> environment.
>>>>>>>>
>>>>>>>> ml
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle
>>>>>>>> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>>
>>>>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>>>>> []s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>>>>>>
>>>>>>>>>> The only case in which it might be better to use an IN clause is
>>>>>>>>>> if
>>>>>>>>>> the entire query can be satisfied from that machine.  Otherwise,
>>>>>>>>>> go
>>>>>>>>>> async.
>>>>>>>>>>
>>>>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>>>>> connection.
>>>>>>>>>>
>>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL,
>>>>>>>>>> btw.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>>>>> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>>> > This is interesting, I didn't know that!
>>>>>>>>>> > It might make sense then to use select = + async + token aware,
>>>>>>>>>> > I will try
>>>>>>>>>> > to change my code.
>>>>>>>>>> >
>>>>>>>>>> > But would it be a "recomended solution" for these cases? Any
>>>>>>>>>> > other options?
>>>>>>>>>> >
>>>>>>>>>> > I still would if this is the right use case for Cassandra, to
>>>>>>>>>> > look for
>>>>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>>>>> > connections to
>>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>>>>> >
>>>>>>>>>> > []s
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>>>>>>> >
>>>>>>>>>> >> If you use async and your driver is token aware, it will go to
>>>>>>>>>> >> the
>>>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>>>>> >>
>>>>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>>>>> >> server
>>>>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
>>>>>>>>>> >> just ask
>>>>>>>>>> >> the node with the data.
>>>>>>>>>> >>
>>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>>>>> >> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>>>>> >> > SELECT IN?
>>>>>>>>>> >> > The justification in the docs is I could query many nodes,
>>>>>>>>>> >> > but I would
>>>>>>>>>> >> > still
>>>>>>>>>> >> > do it.
>>>>>>>>>> >> >
>>>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>>>>> >> >
>>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
>>>>>>>>>> >> > ENTITY_LOOKUP + "
>>>>>>>>>> >> > WHERE
>>>>>>>>>> >> > name=%s and value in(%s)"
>>>>>>>>>> >> >
>>>>>>>>>> >> > for name, values in identifiers.items():
>>>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>>>>> >> >    args = [name] + values
>>>>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>>>>> >> >    futures.append((query_msg,
>>>>>>>>>> >> > self.session.execute_async(query, args)))
>>>>>>>>>> >> >
>>>>>>>>>> >> > for query_msg, future in futures:
>>>>>>>>>> >> >    try:
>>>>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>>>>> >> >       for row in rows:
>>>>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>>>>> >> >    except:
>>>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " %
>>>>>>>>>> >> > (query_msg))
>>>>>>>>>> >> >       raise
>>>>>>>>>> >> >
>>>>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>>>>> >> > query
>>>>>>>>>> >> > (example:
>>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>>>>> >> > "values" array
>>>>>>>>>> >> > above.
>>>>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>>>>> >> > the same
>>>>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>>>>> >> >
>>>>>>>>>> >> > []s
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad
>>>>>>>>>> >> > <j...@jonhaddad.com>:
>>>>>>>>>> >> >
>>>>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Among then is
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
>>>>>>>>>> >> >> > indexes) should
>>>>>>>>>> >> >> > be
>>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>>>>> >> >> > IN in
>>>>>>>>>> >> >> > SELECT
>>>>>>>>>> >> >> > and
>>>>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > When not to use IN¶
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > The recommendations about when not to use an index apply
>>>>>>>>>> >> >> > to using IN
>>>>>>>>>> >> >> > in
>>>>>>>>>> >> >> > the
>>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>>>>> >> >> > clause is
>>>>>>>>>> >> >> > not
>>>>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>>>>> >> >> > usually many
>>>>>>>>>> >> >> > nodes
>>>>>>>>>> >> >> > must be queried. For example, in a single, local data
>>>>>>>>>> >> >> > center cluster
>>>>>>>>>> >> >> > having
>>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
>>>>>>>>>> >> >> > level of
>>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes,
>>>>>>>>>> >> >> > but if the
>>>>>>>>>> >> >> > query
>>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
>>>>>>>>>> >> >> > are most
>>>>>>>>>> >> >> > likely
>>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
>>>>>>>>>> >> >> > fall in the
>>>>>>>>>> >> >> > token
>>>>>>>>>> >> >> > range."
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > In my system, I have a column family called
>>>>>>>>>> >> >> > "entity_lookup":
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>>>>> >> >> >   WITH REPLICATION = { 'class' :
>>>>>>>>>> >> >> > 'NetworkTopologyStrategy',
>>>>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>>>>> >> >> > USE Identification1;
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>>> >> >> >   name varchar,
>>>>>>>>>> >> >> >   value varchar,
>>>>>>>>>> >> >> >   entity_id uuid,
>>>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > And I use the following select to query it:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and
>>>>>>>>>> >> >> > value in(%s)
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>>>>> >> >> > for
>>>>>>>>>> >> >> > lookups
>>>>>>>>>> >> >> > like
>>>>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>>>>> >> >> > cassandra and
>>>>>>>>>> >> >> > they
>>>>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Best regards,
>>>>>>>>>> >> >> > Marcelo Valle.
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> --
>>>>>>>>>> >> >> Jon Haddad
>>>>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>>>>> >> >> skype: rustyrazorblade
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> --
>>>>>>>>>> >> Jon Haddad
>>>>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>>>>> >> skype: rustyrazorblade
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jon Haddad
>>>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>>>> skype: rustyrazorblade
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Best way to do a multi_get using CQL

Reply via email to