[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Lebresne updated CASSANDRA-15432: - Fix Version/s: (was: 3.11.x) (was: 4.x) (was: 3.0.x) 4.0-beta2 3.11.8 3.0.22 Since Version: 1.1.0 Source Control Link: 3.0:https://github.com/apache/cassandra/commit/e2ecdf268a82fa3ac0f4c9fe77ab35bca33cc72a, 3.11:https://github.com/apache/cassandra/commit/ecd23f1da5894511cccac6c8445f962f3b73f733, trunk:https://github.com/apache/cassandra/commit/efce6b39fb557314fad0cb56b0 Resolution: Fixed Status: Resolved (was: Ready to Commit) Thanks for the review. CI doesn't seem to show anything new broken so committed. > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > Fix For: 3.0.22, 3.11.8, 4.0-beta2 > > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-15432: -- Status: Ready to Commit (was: Review In Progress) +1. Good riddance. > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-15432: -- Reviewers: Aleksey Yeschenko, Aleksey Yeschenko (was: Aleksey Yeschenko) Aleksey Yeschenko, Aleksey Yeschenko (was: Aleksey Yeschenko) Status: Review In Progress (was: Patch Available) > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-15432: -- Reviewers: Aleksey Yeschenko > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Lebresne updated CASSANDRA-15432: - Fix Version/s: 4.x 3.11.x 3.0.x > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Lebresne updated CASSANDRA-15432: - Test and Documentation Plan: No impact on testing as this is removing code and no test existed for the removed optimization. Afaict, the optimization was not documented, so no impact on documentation. Status: Patch Available (was: Open) > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Normal > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15432) The "read defragmentation" optimization does not work
[ https://issues.apache.org/jira/browse/CASSANDRA-15432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15432: Bug Category: Parent values: Correctness(12982)Level 1 values: API / Semantic Implementation(12988) Complexity: Normal Component/s: Legacy/Local Write-Read Paths Discovered By: Code Inspection Severity: Normal Status: Open (was: Triage Needed) > The "read defragmentation" optimization does not work > - > > Key: CASSANDRA-15432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Sylvain Lebresne >Priority: Normal > > The so-called "read defragmentation" that has been added way back with > CASSANDRA-2503 actually does not work, and never has. That is, the > defragmentation writes do happen, but they only additional load on the nodes > without helping anything, and are thus a clear negative. > The "read defragmentation" (which only impact so-called "names queries") > kicks in when a read hits "too many" sstables (> 4 by default), and when it > does, it writes down the result of that read. The assumption being that the > next read for that data would only read the newly written data, which if not > still in memtable would at least be in a single sstable, thus speeding that > next read. > Unfortunately, this is not how this work. When we defrag and write the result > of our original read, we do so with the timestamp of the data read (as we > should, changing the timestamp would be plain wrong). And as a result, > following reads will read that data first, but will have no way to tell that > no more sstables should be read. Technically, the > [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] > call will not return {{null}} because the {{currentMaxTs}} will be higher > than at least some of the data in the result, and this until we've read from > as many sstables than in the original read. > I see no easy way to fix this. It might be possible to make it work with > additional per-sstable metadata, but nothing sufficiently simple and cheap to > be worth it comes to mind. And I thus suggest simply removing that code. > For the record, I'll note that there is actually a 2nd problem with that > code: currently, we "defrag" a read even if we didn't got data for everything > that the query requests. This also is "wrong" even if we ignore the first > issue: a following read that would read the defragmented data would also have > no way to know to not read more sstables to try to get the missing parts. > This problem would be fixeable, but is obviously overshadowed by the previous > one anyway. > Anyway, as mentioned, I suggest to just remove the "optimization" (which > again, never optimized anything) altogether, and happy to provide the simple > patch. > The only question might be in which versions? This impact all versions, but > this isn't a correction bug either, "just" a performance one. So do we want > 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org