Re: Best way to do a multi_get using CQL

Jonathan Haddad Thu, 19 Jun 2014 20:29:21 -0700

The only case in which it might be better to use an IN clause is if
the entire query can be satisfied from that machine.  Otherwise, go
async.


The native driver reuses connections and intelligently manages the
pool for you.  It can also multiplex queries over a single connection.

I am assuming you're using one of the datastax drivers for CQL, btw.

Jon

On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
<marc...@s1mbi0se.com.br> wrote:
> This is interesting, I didn't know that!
> It might make sense then to use select = + async + token aware, I will try
> to change my code.
>
> But would it be a "recomended solution" for these cases? Any other options?
>
> I still would if this is the right use case for Cassandra, to look for
> random keys in a huge cluster. After all, the amount of connections to
> Cassandra will still be huge, right... Wouldn't it be a problem?
> Or when you use async the driver reuses the connection?
>
> []s
>
>
> 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>
>> If you use async and your driver is token aware, it will go to the
>> proper node, rather than requiring the coordinator to do so.
>>
>> Realistically you're going to have a connection open to every server
>> anyways.  It's the difference between you querying for the data
>> directly and using a coordinator as a proxy.  It's faster to just ask
>> the node with the data.
>>
>> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>> <marc...@s1mbi0se.com.br> wrote:
>> > But using async queries wouldn't be even worse than using SELECT IN?
>> > The justification in the docs is I could query many nodes, but I would
>> > still
>> > do it.
>> >
>> > Today, I use both async queries AND SELECT IN:
>> >
>> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>> > WHERE
>> > name=%s and value in(%s)"
>> >
>> > for name, values in identifiers.items():
>> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>> > ','.join(['%s']*len(values)))
>> >    args = [name] + values
>> >    query_msg = query % tuple(args)
>> >    futures.append((query_msg, self.session.execute_async(query, args)))
>> >
>> > for query_msg, future in futures:
>> >    try:
>> >       rows = future.result(timeout=100000)
>> >       for row in rows:
>> >         entity_ids.add(row.entity_id)
>> >    except:
>> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>> >       raise
>> >
>> > Using async just with select = would mean instead of 1 async query
>> > (example:
>> > in (0, 1, 2)), I would do several, one for each value of "values" array
>> > above.
>> > In my head, this would mean more connections to Cassandra and the same
>> > amount of work, right? What would be the advantage?
>> >
>> > []s
>> >
>> >
>> >
>> >
>> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>> >
>> >> Your other option is to fire off async queries.  It's pretty
>> >> straightforward w/ the java or python drivers.
>> >>
>> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>> >> <marc...@s1mbi0se.com.br> wrote:
>> >> > I was taking a look at Cassandra anti-patterns list:
>> >> >
>> >> >
>> >> >
>> >> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>> >> >
>> >> > Among then is
>> >> >
>> >> > SELECT ... IN or index lookups¶
>> >> >
>> >> > SELECT ... IN and index lookups (formerly secondary indexes) should
>> >> > be
>> >> > avoided except for specific scenarios. See When not to use IN in
>> >> > SELECT
>> >> > and
>> >> > When not to use an index in Indexing in
>> >> >
>> >> > CQL for Cassandra 2.0"
>> >> >
>> >> > And Looking at the SELECT doc, I saw:
>> >> >
>> >> > When not to use IN¶
>> >> >
>> >> > The recommendations about when not to use an index apply to using IN
>> >> > in
>> >> > the
>> >> > WHERE clause. Under most conditions, using IN in the WHERE clause is
>> >> > not
>> >> > recommended. Using IN can degrade performance because usually many
>> >> > nodes
>> >> > must be queried. For example, in a single, local data center cluster
>> >> > having
>> >> > 30 nodes, a replication factor of 3, and a consistency level of
>> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the
>> >> > query
>> >> > uses the IN condition, the number of nodes being queried are most
>> >> > likely
>> >> > even higher, up to 20 nodes depending on where the keys fall in the
>> >> > token
>> >> > range."
>> >> >
>> >> > In my system, I have a column family called "entity_lookup":
>> >> >
>> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>> >> >   'DC1' : 3 };
>> >> > USE Identification1;
>> >> >
>> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>> >> >   name varchar,
>> >> >   value varchar,
>> >> >   entity_id uuid,
>> >> >   PRIMARY KEY ((name, value), entity_id));
>> >> >
>> >> > And I use the following select to query it:
>> >> >
>> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
>> >> >
>> >> > Is this an anti-pattern?
>> >> >
>> >> > If not using SELECT IN, which other way would you recomend for
>> >> > lookups
>> >> > like
>> >> > that? I have several values I would like to search in cassandra and
>> >> > they
>> >> > might not be in the same particion, as above.
>> >> >
>> >> > Is Cassandra the wrong tool for lookups like that?
>> >> >
>> >> > Best regards,
>> >> > Marcelo Valle.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jon Haddad
>> >> http://www.rustyrazorblade.com
>> >> skype: rustyrazorblade
>> >
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Best way to do a multi_get using CQL

Reply via email to