Given what was said, I propose rephrasing this functionality to limit the memory used to execute a query. We will not expose the page size measured in bytes to the client. Instead, an upper limit will be a guardrail so that we won't fetch more data.
Aggregation query with grouping is a special case in which we would count only those columns marked as queried in a ColumnFilter for a grouped result (maximum sizes of those columns in a group). This way, we can still achieve the goal of making the server more stable under heavy load. Letting the user specify a page size in bytes is indeed a separate story, as the result set size needs to be measured on a higher level, where the selectors are applied. thanks, Jacek wt., 13 cze 2023 o 10:42 Benjamin Lerer <ble...@apache.org> napisał(a): > So my other question - for aggregation with the "group by" clause, we >> return an aggregated row which is computed from a group of rows - with my >> current implementation, it is approximated by counting the size of the >> largest row in that group - I think it is the safest and simplest >> approximation - wdyt? > > > I feel that there are something that was not discussed here. The storage > engine can return some rows that are much larger than the actual row > returned to the user depending on the projections being used. Therefore > there will only be a reliable matching between the size of the page loaded > internally and the size of the page returned to the user when the full row > is queried without transformation. For all the other case the difference > can be really significant. For a group by queries doing a count(*), the > approach suggested will return a page size that is totally off with what > was requested. > > Le mar. 13 juin 2023 à 07:00, Jacek Lewandowski < > lewandowski.ja...@gmail.com> a écrit : > >> Josh, that answers my question exactly; thank you. >> >> I will not implement limiting the result set in CQL (that is, by LIMIT >> clause) and stay with just paging. Whether the page size is defined in >> bytes or rows can be determined by a flag - there are many unused bits for >> that. >> >> So my other question - for aggregation with the "group by" clause, we >> return an aggregated row which is computed from a group of rows - with my >> current implementation, it is approximated by counting the size of the >> largest row in that group - I think it is the safest and simplest >> approximation - wdyt? >> >> >> pon., 12 cze 2023 o 22:55 Josh McKenzie <jmcken...@apache.org> >> napisał(a): >> >>> As long as it is valid in the paging protocol to return a short page, >>> but still say “there are more pages”, I think that is fine to do that. >>> >>> Thankfully the v3-v5 spec all make it clear that clients need to respect >>> what the server has to say about there being more pages: >>> https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L1247-L1253 >>> >>> - Clients should not rely on the actual size of the result set >>> returned to >>> decide if there are more results to fetch or not. Instead, they >>> should always >>> check the Has_more_pages flag (unless they did not enable paging for >>> the query >>> obviously). Clients should also not assert that no result will have >>> more than >>> <result_page_size> results. While the current implementation always >>> respects >>> the exact value of <result_page_size>, we reserve the right to return >>> slightly smaller or bigger pages in the future for performance >>> reasons. >>> >>> >>> On Mon, Jun 12, 2023, at 3:19 PM, Jeremiah Jordan wrote: >>> >>> As long as it is valid in the paging protocol to return a short page, >>> but still say “there are more pages”, I think that is fine to do that. For >>> an actual LIMIT that is part of the user query, I think the server must >>> always have returned all data that fits into the LIMIT when all pages have >>> been returned. >>> >>> -Jeremiah >>> >>> On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> >>> >>> Yeah, my bad. I have paging on the brain. Seriously. >>> >>> I can't think of a use-case in which a LIMIT based on # bytes makes >>> sense from a user perspective. >>> >>> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote: >>> >>> >>> >>> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer <b.le...@gmail.com> >>> wrote: >>> >>> If you have rows that vary significantly in their size, your latencies >>> could end up being pretty unpredictable using a LIMIT BY <row_count>. Being >>> able to specify a limit by bytes at the driver / API level would allow app >>> devs to get more deterministic results out of their interaction w/the DB if >>> they're looking to respond back to a client within a certain time frame and >>> / or determine next steps in the app (continue paging, stop, etc) based on >>> how long it took to get results back. >>> >>> >>> Are you talking about the page size or the LIMIT. Once the LIMIT is >>> reached there is no "continue paging". LIMIT is also at the CQL level not >>> at the driver level. >>> I can totally understand the need for a page size in bytes not for a >>> LIMIT. >>> >>> >>> Would only ever EXPECT to see a page size in bytes, never a LIMIT >>> specifying bytes. >>> >>> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me. >>> >>> >>> >>>