[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

Alex Liu (JIRA) Fri, 22 Nov 2013 15:12:05 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830401#comment-13830401
 ]


Alex Liu commented on CASSANDRA-6348:
-------------------------------------

rowsPerQuery is only used as page size for Index CF during 2i search.

maxColumns is the number of limit clause.  If meanColumns is a big number, then 
filter.maxColumns()/meanColumns is less than 1, rowsPerQuery is 2. The result 
paging size for index CF is 2 which is too small, we end up with too many 
random seeks between index CF and base CF, that's the reason why sometimes 2i 
index search is so slow. We need to avoid the page size of index CF too small. 
The goal is to set page size an enough large number but not too large to avoid 
OOM, so we can have less random seeks between index CF and base CF.

If there is data filtering involved and many base CF columns don't match the 
filter,  the small page size causes the issue even worse for we needs paging 
through more pages in index CF.

{code}
    public int maxRows()
    {
        return countCQL3Rows ? Integer.MAX_VALUE : maxResults;
    }

    public int maxColumns()
    {
        return countCQL3Rows ? maxResults : Integer.MAX_VALUE;
    }
{code}

for none-cql query,
{code}
            rowsPerQuery = Math.max(Math.min(filter.maxResults, 
Integer.MAX_VALUE / meanColumns), 2);
            most likely  becomes rowsPerQuery = Math.max(filter.maxResults, 2);
            most likely becomes rowsPerQuery = filter.maxResults
            which is the same number of rows to fetch
{code}

for cql query
{code}
            rowsPerQuery = Math.max(Math.min(Integer.MAX_VALUE, 
filter.maxResults / meanColumns), 2);
            most likely  becomes rowsPerQuery = Math.max(filter.maxResults/ 
meanColumns, 2);
            most likely becomes rowsPerQuery = filter.maxResults/ meanColumns
            if meanColumns is too big, it's a very small number less than 1 
possible.
            if no limit clause in cql query, it becomes Integer.MAX_VALUE/ 
meanColumns which is a big number.
{code}

So the question is how to calculate page size for index CF, so we don't have 
too many random seeks between index CF and base CF and void fetching too many 
index columns to avoid OOM.



> TimeoutException throws if Cql query allows data filtering and index is too 
> big and it can't find the data in base CF after filtering 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Alex Liu
>            Assignee: Alex Liu
>         Attachments: 6348.txt
>
>
> If index row is too big, and filtering can't find the match Cql row in base 
> CF, it keep scanning the index row and retrieving base CF until the index row 
> is scanned completely which may take too long and thrift server returns 
> TimeoutException. This is one of the reasons why we shouldn't index a column 
> if the index is too big.
> Multiple indexes merging can resolve the case where there are only EQUAL 
> clauses. (CASSANDRA-6048 addresses it).
> If the query has none-EQUAL clauses, we still need do data filtering which 
> might lead to timeout exception.
> We can either disable those kind of queries or WARN the user that data 
> filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

Reply via email to