Re: Best way to do a multi_get using CQL

Jeremy Jongsma Fri, 20 Jun 2014 10:38:42 -0700

That depends on the connection pooling implementation in your driver.
Astyanax will keep N connections open to each node (configurable) and route
each query in a separate message over an existing connection, waiting until
one becomes available if all are in use.



On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
marc...@s1mbi0se.com.br> wrote:

> A question, not sure if you guys know the answer:
> Supose I async query 1000 rows using token aware and suppose I have 10
> nodes. Suppose also each node would receive 100 row queries each.
> How does async work in this case? Would it send each row query to each
> node in a different connection? Different message?
> I guess if there was a way to use batch with async, once you commit the
> batch for the 1000 queries, it would create 1 connection to each host and
> query 100 rows in a single message to each host.
> This would decrease resource usage, am I wrong?
>
> []s
>
>
> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:
>
> I've found that if you have any amount of latency between your client and
>> nodes, and you are executing a large batch of queries, you'll usually want
>> to send them together to one node unless execution time is of no concern.
>> The tradeoff is resource usage on the connected node vs. time to complete
>> all the queries, because you'll need fewer client -> node network round
>> trips.
>>
>> With large numbers of queries you will still want to make sure you split
>> them into manageable batches before sending them, to control memory usage
>> on the executing node. I've been limiting queries to batches of 100 keys in
>> scenarios like this.
>>
>>
>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>> michael.la...@nytimes.com> wrote:
>>
>>> However my extensive benchmarking this week of the python driver from
>>> master shows a performance *decrease* when using 'token_aware'.
>>>
>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>
>>> Also why do the work the coordinator will do for you: send all the
>>> queries, wait for everything to come back in whatever order, and sort the
>>> result.
>>>
>>> I would rather keep my app code simple.
>>>
>>> But the real point is that you should benchmark in your own environment.
>>>
>>> ml
>>>
>>>
>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>> marc...@s1mbi0se.com.br> wrote:
>>>
>>>> Yes, I am using the CQL datastax drivers.
>>>> It was a good advice, thanks a lot Janathan.
>>>> []s
>>>>
>>>>
>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>
>>>> The only case in which it might be better to use an IN clause is if
>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>> async.
>>>>>
>>>>> The native driver reuses connections and intelligently manages the
>>>>> pool for you.  It can also multiplex queries over a single connection.
>>>>>
>>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>>
>>>>> Jon
>>>>>
>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>> <marc...@s1mbi0se.com.br> wrote:
>>>>> > This is interesting, I didn't know that!
>>>>> > It might make sense then to use select = + async + token aware, I
>>>>> will try
>>>>> > to change my code.
>>>>> >
>>>>> > But would it be a "recomended solution" for these cases? Any other
>>>>> options?
>>>>> >
>>>>> > I still would if this is the right use case for Cassandra, to look
>>>>> for
>>>>> > random keys in a huge cluster. After all, the amount of connections
>>>>> to
>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>> > Or when you use async the driver reuses the connection?
>>>>> >
>>>>> > []s
>>>>> >
>>>>> >
>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>> >
>>>>> >> If you use async and your driver is token aware, it will go to the
>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>> >>
>>>>> >> Realistically you're going to have a connection open to every server
>>>>> >> anyways.  It's the difference between you querying for the data
>>>>> >> directly and using a coordinator as a proxy.  It's faster to just
>>>>> ask
>>>>> >> the node with the data.
>>>>> >>
>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>> >> <marc...@s1mbi0se.com.br> wrote:
>>>>> >> > But using async queries wouldn't be even worse than using SELECT
>>>>> IN?
>>>>> >> > The justification in the docs is I could query many nodes, but I
>>>>> would
>>>>> >> > still
>>>>> >> > do it.
>>>>> >> >
>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>> >> >
>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP +
>>>>> "
>>>>> >> > WHERE
>>>>> >> > name=%s and value in(%s)"
>>>>> >> >
>>>>> >> > for name, values in identifiers.items():
>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>> >> > ','.join(['%s']*len(values)))
>>>>> >> >    args = [name] + values
>>>>> >> >    query_msg = query % tuple(args)
>>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>>> args)))
>>>>> >> >
>>>>> >> > for query_msg, future in futures:
>>>>> >> >    try:
>>>>> >> >       rows = future.result(timeout=100000)
>>>>> >> >       for row in rows:
>>>>> >> >         entity_ids.add(row.entity_id)
>>>>> >> >    except:
>>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>>> >> >       raise
>>>>> >> >
>>>>> >> > Using async just with select = would mean instead of 1 async query
>>>>> >> > (example:
>>>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>>>> array
>>>>> >> > above.
>>>>> >> > In my head, this would mean more connections to Cassandra and the
>>>>> same
>>>>> >> > amount of work, right? What would be the advantage?
>>>>> >> >
>>>>> >> > []s
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>> >> >
>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>> >> >> straightforward w/ the java or python drivers.
>>>>> >> >>
>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>> >> >> <marc...@s1mbi0se.com.br> wrote:
>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>> >> >> >
>>>>> >> >> > Among then is
>>>>> >> >> >
>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>> >> >> >
>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>>> should
>>>>> >> >> > be
>>>>> >> >> > avoided except for specific scenarios. See When not to use IN
>>>>> in
>>>>> >> >> > SELECT
>>>>> >> >> > and
>>>>> >> >> > When not to use an index in Indexing in
>>>>> >> >> >
>>>>> >> >> > CQL for Cassandra 2.0"
>>>>> >> >> >
>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>> >> >> >
>>>>> >> >> > When not to use IN¶
>>>>> >> >> >
>>>>> >> >> > The recommendations about when not to use an index apply to
>>>>> using IN
>>>>> >> >> > in
>>>>> >> >> > the
>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>> clause is
>>>>> >> >> > not
>>>>> >> >> > recommended. Using IN can degrade performance because usually
>>>>> many
>>>>> >> >> > nodes
>>>>> >> >> > must be queried. For example, in a single, local data center
>>>>> cluster
>>>>> >> >> > having
>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>>>> the
>>>>> >> >> > query
>>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>>> most
>>>>> >> >> > likely
>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall
>>>>> in the
>>>>> >> >> > token
>>>>> >> >> > range."
>>>>> >> >> >
>>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>>> >> >> >
>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>> >> >> >   'DC1' : 3 };
>>>>> >> >> > USE Identification1;
>>>>> >> >> >
>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>> >> >> >   name varchar,
>>>>> >> >> >   value varchar,
>>>>> >> >> >   entity_id uuid,
>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>> >> >> >
>>>>> >> >> > And I use the following select to query it:
>>>>> >> >> >
>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>>> in(%s)
>>>>> >> >> >
>>>>> >> >> > Is this an anti-pattern?
>>>>> >> >> >
>>>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>>>> >> >> > lookups
>>>>> >> >> > like
>>>>> >> >> > that? I have several values I would like to search in
>>>>> cassandra and
>>>>> >> >> > they
>>>>> >> >> > might not be in the same particion, as above.
>>>>> >> >> >
>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>> >> >> >
>>>>> >> >> > Best regards,
>>>>> >> >> > Marcelo Valle.
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Jon Haddad
>>>>> >> >> http://www.rustyrazorblade.com
>>>>> >> >> skype: rustyrazorblade
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Jon Haddad
>>>>> >> http://www.rustyrazorblade.com
>>>>> >> skype: rustyrazorblade
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jon Haddad
>>>>> http://www.rustyrazorblade.com
>>>>> skype: rustyrazorblade
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Reply via email to