[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

Sylvain Lebresne (JIRA) Tue, 19 Nov 2013 01:30:21 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826335#comment-13826335
 ]


Sylvain Lebresne commented on CASSANDRA-6348:
---------------------------------------------

bq. Other than hadoop queries, It's common for user to query on multiple indexes

I sure hope you're wrong and for sure it shoudn't be, because Cassandra sucks 
at it. And I personally have almost never seen anyone use it (on the mailing 
list for instance). 

ALLOW FILTERING is really meant as a "don't do unless you're just having fun 
with cqlsh on a toy database". Using ALLOW FILTERING on real production queries 
is wrong (at least for CQL queries, I'm not talking about Hadoop, which is a 
different problem). I'm more than happy to make the document/message more clear 
about that fact if it's not.

bq. Hadoop Cql query uses "ALLOW FILTERING"

Which is kind of a problem in the sense that it's not what ALLOW FILTERING has 
been intended for and that more generally CQL has never been designed with 
Hadoop in mind, it's a strictly real-time oriented language. So maybe we should 
re-purpose ALLOW FILTERING as "the hadoop mode" somehow, but if we do, we 
should be a explicit about it and think about how to do that best. But trying 
to shove Hadoop into something it hasn't been made for feels wrong to me.

That being said, I wonder if an overall simpler solution to the "Hadoop wants 
to use the 2dnary indexes" problem couldn't be better solves by letting it 
query the 2ndary index CFS directly. That is, allow selects on the index itself 
(which would obviously require a special flag to unlock). That way, Hadoop 
would get paging over the index "for free" (which at the end of the day is the 
problem that needs solving if I understand it correctly) and would get control 
over that paging. And it would allow Hadoop to do things like merging indexes 
that probably make more sense on the Hadoop side that it makes on the realtime 
side (i.e. we keep Cassandra focuses on on realtime queries with as little 
processing as possible, which is what it is good at).


> TimeoutException throws if Cql query allows data filtering and index is too 
> big and it can't find the data in base CF after filtering 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Alex Liu
>            Assignee: Alex Liu
>
> If index row is too big, and filtering can't find the match Cql row in base 
> CF, it keep scanning the index row and retrieving base CF until the index row 
> is scanned completely which may take too long and thrift server returns 
> TimeoutException. This is one of the reasons why we shouldn't index a column 
> if the index is too big.
> Multiple indexes merging can resolve the case where there are only EQUAL 
> clauses. (CASSANDRA-6048 addresses it).
> If the query has none-EQUAL clauses, we still need do data filtering which 
> might lead to timeout exception.
> We can either disable those kind of queries or WARN the user that data 
> filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

Reply via email to