[
https://issues.apache.org/jira/browse/CASSANDRA-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181200#comment-13181200
]
Sylvain Lebresne commented on CASSANDRA-1956:
---------------------------------------------
bq. That, and "I want to cache a specific set of known-ahead-of-time columns
[maybe the entire row]," which is what today's row cache is mostly used for.
That is trivially handled by the filter-per-cf approach I'm advocating,
contrarily to the query cache solution.
bq. I think it's a huge, huge win for a design to be able to handle both of
these, without requiring it to be specified in the schema.
Again, I really don't think specifying it in the schema is such a big deal in
that case (I insist on the "in that case", I'm *not* pretending hand-tuning is
never a big deal), nor does it feel a hard one to get right.
Now don't get me wrong, I agree that self-tuning is great, but only if we know
how to do it correctly. Typically, and to refer to some ideas above, I think
that if users have to think about what query they should do to have good
caching (like using select * when really they want select x, y but want to keep
the full row in cache, or to be careful that if they use too many different
queries for a given row it won't play well with the cache), then 1) it's still
hand-tuning and 2) one that is imo far less convenient/intuitive.
Basically what I'm saying is that with a query cache, I see a number of
unknowns, of added difficulties (what about the space taken by all those filter
per query? how do we make sure to cache the full row when it's the right thing
to do without any user intervention? etc...) and of cases where it will be less
efficient that the filter-per-cf alternative unless the user is super careful
(will that be a problem in real life ? maybe not, but maybe). On the other
side, adding a simple per-cf filter is a nice simple increment over what we
have and we stay in known territory while solving the problem we want to solve.
Besides, if specifying a filter with the schema is that much of a problem,
maybe we can do that choice automatically. We have stats on the rows avg and
max size, and we can easily start gathering some simple stats on queries, at
least enough to be able to say if it's the head or tail that we need to keep in
cache for wide rows. Though honestly, even if we do that, my preference would
largely go to still allow the user to override whatever automatic choice we
came up with if they wish so.
> Convert row cache to row+filter cache
> -------------------------------------
>
> Key: CASSANDRA-1956
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1956
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Stu Hood
> Assignee: Vijay
> Priority: Minor
> Fix For: 1.2
>
> Attachments: 0001-1956-cache-updates-v0.patch,
> 0001-re-factor-row-cache.patch, 0001-row-cache-filter.patch,
> 0002-1956-updates-to-thrift-and-avro-v0.patch, 0002-add-query-cache.patch
>
>
> Changing the row cache to a row+filter cache would make it much more useful.
> We currently have to warn against using the row cache with wide rows, where
> the read pattern is typically a peek at the head, but this usecase would be
> perfect supported by a cache that stored only columns matching the filter.
> Possible implementations:
> * (copout) Cache a single filter per row, and leave the cache key as is
> * Cache a list of filters per row, leaving the cache key as is: this is
> likely to have some gotchas for weird usage patterns, and it requires the
> list overheard
> * Change the cache key to "rowkey+filterid": basically ideal, but you need a
> secondary index to lookup cache entries by rowkey so that you can keep them
> in sync with the memtable
> * others?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira