If querying by a date inequality is an important access paradigm you probably want a column that represents some time bucket (a month?) And have that column be part of the cql primary key. Thus when a query is requested you can make c* happy by specifying a date bucket to pick the c* row and the date inequality to slice the "cql rows"-> columns. Of course this adds work for the client when dates span multiple buckets, but an open ended date inequality is probably troublesome for massive datasets anyway.

On 02/03/2013 03:42 PM, Paul van Hoven wrote:
Thanks for the answer. Can anybody else answer my other two questions,
because my problem is not solved yet?

2013/2/3 Edward Capriolo <edlinuxg...@gmail.com>:
This was the issue that prompted the "WITH FILTERING ALLOWED":

https://issues.apache.org/jira/browse/CASSANDRA-4915

Cassandra's storage system can only optimize certain queries.

On Sun, Feb 3, 2013 at 2:07 PM, Paul van Hoven
<paul.van.ho...@googlemail.com> wrote:
I'm not sure if I understood your answer.

When you have GB or TB of data any query that adds "WITH FILTERING"
will not work at scale.
1. You mean any query that requires "with filtering" is slow?

Secondary indexes need at least one equality. If you want to do this
at scale you might need a different design.
2. And what design would be recommendable then?

3. How should the query look like such that it would scale?



2013/2/3 Edward Capriolo <edlinuxg...@gmail.com>:
Secondary indexes need at least one equality. If you want to do this
at scale you might need a different design.

Using WITH FILTERING and LIMIT 10 is simply grabbing the first few
random rows that match your criteria.

When you have GB or TB of data any query that adds "WITH FILTERING"
will not work at scale.

This is why it was added to the language CQL lets you do some queries
that "seem fast" when your developing with 10 rows, without this
clause you would not know if a query is fast because it hits a
cassandra index, or it is just fast because the results were found in
the first 10 rows.

Edward

On Sun, Feb 3, 2013 at 10:56 AM, Paul van Hoven
<paul.van.ho...@googlemail.com> wrote:
Okay, here is the schema (actually it is in german, but I translated
the column names such that it is easier to read for an international
audience):

cqlsh:demodb> describe table offerten_log_archiv;

CREATE TABLE offerten_log_archiv (
   offerte_id int PRIMARY KEY,
   aktionen int,
   angezeigt bigint,
   datum timestamp,
   gutschrift bigint,
   kampagne_id int,
   klicks int,
   klicks_ungueltig int,
   kosten bigint,
   statistik_id bigint,
   stunden int,
   werbeflaeche_id int,
   werbemittel_id int
) WITH
   bloom_filter_fp_chance=0.010000 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.000000 AND
   gc_grace_seconds=864000 AND
   read_repair_chance=0.100000 AND
   replicate_on_write='true' AND
   compaction={'class': 'SizeTieredCompactionStrategy'};

CREATE INDEX datum_key ON offerten_log_archiv (datum);

CREATE INDEX stunden_key ON offerten_log_archiv (stunden);

cqlsh:demodb>

This is the query I'm trying to perform:
cqlsh:demodb> select * from ola where date > '2013-01-01' and hour = 0
limit 10 allow filtering;
Request did not complete within rpc_timeout.

ola = offerten_log_archiv (table name)
hour = stunde (column name)
date = datum (column name)

I hope this information makes my problem more clear.



2013/2/3 Edward Capriolo <edlinuxg...@gmail.com>:
Without seeing your schema it is hard to say, but in some cases "ALLOW
FILTERING" might be considered "EXPECT THIS COULD BE SLOW". It could
mean the query is not hitting and index and is going to page through
large amounts of data.

On Sun, Feb 3, 2013 at 9:42 AM, Paul van Hoven
<paul.van.ho...@googlemail.com> wrote:
After figuring out how to use the ">" operator on an secondary index I
noticed that in a column family of about 5.5 million datasets I get a
rpc_timeout when trying to read data from this table. In the concrete
situation I want to request data younger than January 1 2013. The
number of rows that should be affected are about 1 million. When doing
the request I get a timeout error:

cqlsh:demodb> select * from ola where date > '2013-01-01' and hour = 0
limit 10 allow filtering;
Request did not complete within rpc_timeout.

Actually I find this very confusing since I would except an
exceptional performance gain in comparison to a similar sql query.
Therefore, I think the query I'm performing is not appropriate for
cassandra, although I would do a query like that in this manner on a
sql database. So my question now is: How should I perfrom this query
on cassandra?

Reply via email to