[
https://issues.apache.org/jira/browse/CASSANDRA-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903022#comment-14903022
]
Sylvain Lebresne commented on CASSANDRA-10342:
----------------------------------------------
bq. This is not actually true at the moment.
Let's be precise. We do use the time-ordered path in 3.0 much more often than
we were before 3.0 (basically, pre-3.0, we never used that path for CQL tables,
we do now). However, we do query all columns every time, which potentially
limit the efficiency of that path. Still, if a row is queried a lot,
"defragmenting" it (entirely) is going to help.
> Read defragmentation can cause unnecessary repairs
> --------------------------------------------------
>
> Key: CASSANDRA-10342
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10342
> Project: Cassandra
> Issue Type: Bug
> Reporter: Marcus Olsson
> Assignee: Marcus Eriksson
> Priority: Minor
>
> After applying the fix from CASSANDRA-10299 to the cluster we started having
> a problem of ~20k small sstables appearing for the table with static data
> when running incremental repair.
> In the logs there were several messages about flushes for that table, one for
> each repaired range. The flushed sstables were 0.000kb in size with < 100 ops
> in each. When checking cfstats there were several writes to that table, even
> though we were only reading from it and read repair did not repair anything.
> After digging around in the codebase I noticed that defragmentation of data
> can occur while reading, depending on the query and some other conditions.
> This causes the read data to be inserted again to have it in a more recent
> sstable, which can be a problem if that data was repaired using incremental
> repair. The defragmentation is done in
> [CollationController.java|https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/db/CollationController.java#L151].
> I guess this wasn't a problem with full repairs since I assume that the
> digest should be the same even if you have two copies of the same data. But
> with incremental repair this will most probably cause a mismatch between
> nodes if that data already was repaired, since the other nodes probably won't
> have that data in their unrepaired set.
> ------
> I can add that the problems on our cluster was probably due to the fact that
> CASSANDRA-10299 caused the same data to be streamed multiple times and ending
> up in several sstables. One of the conditions for the defragmentation is that
> the number of sstables read during a read request have to be more than the
> minimum number of sstables needed for a compaction(> 4 in our case). So
> normally I don't think this would cause ~20k sstables to appear, we probably
> hit an extreme.
> One workaround for this is to use another compaction strategy than STCS(it
> seems to be the only affected strategy, atleast in 2.1), but the solution
> might be to either make defragmentation configurable per table or avoid
> reinserting the data if any of the sstables involved in the read are repaired.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)