[jira] [Updated] (CASSANDRA-21371) Account tombstones in paging for select queries

Dmitry Konstantinov (Jira) Thu, 14 May 2026 02:31:12 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dmitry Konstantinov updated CASSANDRA-21371:
--------------------------------------------
    Description: 
Currently Cassandra uses only live rows to track if it is enough rows for a 
page to return. As a result if a partition has a lot of tombstones we get 
performance issues: such queries are slow and consume read threads + a lot of 
objects are allocated to handle them and it creates a GC pressure spike.

To avoid such issues we can:
 * consider tombstone unfiltered items during read paging in the same ways as 
live rows
 * OR introduce a byte-sized limit per page which includes live rows as well as 
tombstones

It means we can return a smaller number of rows in a result or even 0 but it 
does not mean that there is no data anymore in DB, an app still needs to fetch 
data.
So, if we have for example 60k row tombstones, we will fetch 5k and return back 
an answer to a client, so this particular request can be slower from a client 
point of view due to several pages to fetch but it will allow to have DB heathy 
by avoid read threads occupied for an unpredictable amount of time.

CQL protocol already has an explicit flag to show if we have more data to 
return -  
[https://cassandra.apache.org/doc/latest/cassandra/reference/native-protocol.html]
 , so no changes on CQL protocol level are expected:
{code:java}
0x0002    Has_more_pages: indicates whether this is not the last
                      page of results and more should be retrieved. If set, the
                      <paging_state> will be present. The <paging_state> is a
                      [bytes] value that should be used in QUERY/EXECUTE to
                      continue paging and retrieve the remainder of the result 
for
                      this query (See Section 7 for more details).
{code}
{code:java}
- Clients should not rely on the actual size of the result set returned to
    decide if there are more results to fetch or not. Instead, they should 
always
    check the Has_more_pages flag (unless they did not enable paging for the 
query
    obviously). Clients should also not assert that no result will have more 
than
    <result_page_size> results. While the current implementation always respects
    the exact value of <result_page_size>, we reserve the right to return
    slightly smaller or bigger pages in the future for performance 
reasons.{code}

  was:
Currently Cassandra uses only live rows to track if it is enough rows for a 
page to return. As a result if a partition has a lot of tombstones we get 
performance issues: such queries are slow and consume read threads + a lot of 
objects are allocated to handle them and it creates a GC pressure spike.

To avoid such issues we can:
 * consider tombstone unfiltered items during read paging in the same ways as 
alive rows
 * OR introduce a byte-sized limit for page which accounts alive rows as well 
as tombstones

It means we can return a smaller number of rows in a result or even 0 but it 
does not mean that there is no data anymore in DB, an app still needs to fetch 
data.
So, if we have for example 60k row tombstones, we will fetch 5k and return back 
an answer to a client, so this particular request can be slower from a client 
point of view due to several pages to fetch but it will allow to have DB heathy 
by avoid read threads occupied for an unpredictable amount of time.

CQL protocol already has an explicit flag to show if we have more data to 
return -  
[https://cassandra.apache.org/doc/latest/cassandra/reference/native-protocol.html]
 , so no changes on CQL protocol level are expected:
{code:java}
0x0002    Has_more_pages: indicates whether this is not the last
                      page of results and more should be retrieved. If set, the
                      <paging_state> will be present. The <paging_state> is a
                      [bytes] value that should be used in QUERY/EXECUTE to
                      continue paging and retrieve the remainder of the result 
for
                      this query (See Section 7 for more details).
{code}
{code:java}
- Clients should not rely on the actual size of the result set returned to
    decide if there are more results to fetch or not. Instead, they should 
always
    check the Has_more_pages flag (unless they did not enable paging for the 
query
    obviously). Clients should also not assert that no result will have more 
than
    <result_page_size> results. While the current implementation always respects
    the exact value of <result_page_size>, we reserve the right to return
    slightly smaller or bigger pages in the future for performance 
reasons.{code}


> Account tombstones in paging for select queries
> -----------------------------------------------
>
>                 Key: CASSANDRA-21371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21371
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Consistency/Coordination, Local/Other
>            Reporter: Dmitry Konstantinov
>            Priority: Normal
>
> Currently Cassandra uses only live rows to track if it is enough rows for a 
> page to return. As a result if a partition has a lot of tombstones we get 
> performance issues: such queries are slow and consume read threads + a lot of 
> objects are allocated to handle them and it creates a GC pressure spike.
> To avoid such issues we can:
>  * consider tombstone unfiltered items during read paging in the same ways as 
> live rows
>  * OR introduce a byte-sized limit per page which includes live rows as well 
> as tombstones
> It means we can return a smaller number of rows in a result or even 0 but it 
> does not mean that there is no data anymore in DB, an app still needs to 
> fetch data.
> So, if we have for example 60k row tombstones, we will fetch 5k and return 
> back an answer to a client, so this particular request can be slower from a 
> client point of view due to several pages to fetch but it will allow to have 
> DB heathy by avoid read threads occupied for an unpredictable amount of time.
> CQL protocol already has an explicit flag to show if we have more data to 
> return -  
> [https://cassandra.apache.org/doc/latest/cassandra/reference/native-protocol.html]
>  , so no changes on CQL protocol level are expected:
> {code:java}
> 0x0002    Has_more_pages: indicates whether this is not the last
>                       page of results and more should be retrieved. If set, 
> the
>                       <paging_state> will be present. The <paging_state> is a
>                       [bytes] value that should be used in QUERY/EXECUTE to
>                       continue paging and retrieve the remainder of the 
> result for
>                       this query (See Section 7 for more details).
> {code}
> {code:java}
> - Clients should not rely on the actual size of the result set returned to
>     decide if there are more results to fetch or not. Instead, they should 
> always
>     check the Has_more_pages flag (unless they did not enable paging for the 
> query
>     obviously). Clients should also not assert that no result will have more 
> than
>     <result_page_size> results. While the current implementation always 
> respects
>     the exact value of <result_page_size>, we reserve the right to return
>     slightly smaller or bigger pages in the future for performance 
> reasons.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-21371) Account tombstones in paging for select queries

Reply via email to