[ 
https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-15432:
------------------------------------
     Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Implementation(12988)
       Complexity: Normal
      Component/s: Legacy/Local Write-Read Paths
    Discovered By: Code Inspection
         Severity: Normal
           Status: Open  (was: Triage Needed)

> The "read defragmentation" optimization does not work
> -----------------------------------------------------
>
>                 Key: CASSANDRA-15432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Sylvain Lebresne
>            Priority: Normal
>
> The so-called "read defragmentation" that has been added way back with 
> CASSANDRA-2503 actually does not work, and never has. That is, the 
> defragmentation writes do happen, but they only additional load on the nodes 
> without helping anything, and are thus a clear negative.
> The "read defragmentation" (which only impact so-called "names queries") 
> kicks in when a read hits "too many" sstables (> 4 by default), and when it 
> does, it writes down the result of that read. The assumption being that the 
> next read for that data would only read the newly written data, which if not 
> still in memtable would at least be in a single sstable, thus speeding that 
> next read.
> Unfortunately, this is not how this work. When we defrag and write the result 
> of our original read, we do so with the timestamp of the data read (as we 
> should, changing the timestamp would be plain wrong). And as a result, 
> following reads will read that data first, but will have no way to tell that 
> no more sstables should be read. Technically, the 
> [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830]
>  call will not return {{null}} because the {{currentMaxTs}} will be higher 
> than at least some of the data in the result, and this until we've read from 
> as many sstables than in the original read.
> I see no easy way to fix this. It might be possible to make it work with 
> additional per-sstable metadata, but nothing sufficiently simple and cheap to 
> be worth it comes to mind. And I thus suggest simply removing that code.
> For the record, I'll note that there is actually a 2nd problem with that 
> code: currently, we "defrag" a read even if we didn't got data for everything 
> that the query requests. This also is "wrong" even if we ignore the first 
> issue: a following read that would read the defragmented data would also have 
> no way to know to not read more sstables to try to get the missing parts. 
> This problem would be fixeable, but is obviously overshadowed by the previous 
> one anyway.
> Anyway, as mentioned, I suggest to just remove the "optimization" (which 
> again, never optimized anything) altogether, and happy to provide the simple 
> patch.
> The only question might be in which versions? This impact all versions, but 
> this isn't a correction bug either, "just" a performance one. So do we want 
> 4.0 only or is there appetite for earlier?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to