[
https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sylvain Lebresne updated CASSANDRA-15432:
-----------------------------------------
Fix Version/s: (was: 3.11.x)
(was: 4.x)
(was: 3.0.x)
4.0-beta2
3.11.8
3.0.22
Since Version: 1.1.0
Source Control Link:
3.0:https://github.com/apache/cassandra/commit/e2ecdf268a82fa3ac0f4c9fe77ab35bca33cc72a,
3.11:https://github.com/apache/cassandra/commit/ecd23f1da5894511cccac6c8445f962f3b73f733,
trunk:https://github.com/apache/cassandra/commit/efce6b39fb557314fad0cb56b0
Resolution: Fixed
Status: Resolved (was: Ready to Commit)
Thanks for the review. CI doesn't seem to show anything new broken so committed.
> The "read defragmentation" optimization does not work
> -----------------------------------------------------
>
> Key: CASSANDRA-15432
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15432
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Priority: Normal
> Fix For: 3.0.22, 3.11.8, 4.0-beta2
>
>
> The so-called "read defragmentation" that has been added way back with
> CASSANDRA-2503 actually does not work, and never has. That is, the
> defragmentation writes do happen, but they only additional load on the nodes
> without helping anything, and are thus a clear negative.
> The "read defragmentation" (which only impact so-called "names queries")
> kicks in when a read hits "too many" sstables (> 4 by default), and when it
> does, it writes down the result of that read. The assumption being that the
> next read for that data would only read the newly written data, which if not
> still in memtable would at least be in a single sstable, thus speeding that
> next read.
> Unfortunately, this is not how this work. When we defrag and write the result
> of our original read, we do so with the timestamp of the data read (as we
> should, changing the timestamp would be plain wrong). And as a result,
> following reads will read that data first, but will have no way to tell that
> no more sstables should be read. Technically, the
> [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830]
> call will not return {{null}} because the {{currentMaxTs}} will be higher
> than at least some of the data in the result, and this until we've read from
> as many sstables than in the original read.
> I see no easy way to fix this. It might be possible to make it work with
> additional per-sstable metadata, but nothing sufficiently simple and cheap to
> be worth it comes to mind. And I thus suggest simply removing that code.
> For the record, I'll note that there is actually a 2nd problem with that
> code: currently, we "defrag" a read even if we didn't got data for everything
> that the query requests. This also is "wrong" even if we ignore the first
> issue: a following read that would read the defragmented data would also have
> no way to know to not read more sstables to try to get the missing parts.
> This problem would be fixeable, but is obviously overshadowed by the previous
> one anyway.
> Anyway, as mentioned, I suggest to just remove the "optimization" (which
> again, never optimized anything) altogether, and happy to provide the simple
> patch.
> The only question might be in which versions? This impact all versions, but
> this isn't a correction bug either, "just" a performance one. So do we want
> 4.0 only or is there appetite for earlier?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]