Given what was said, I propose rephrasing this functionality to limit the
memory used to execute a query. We will not expose the page size measured
in bytes to the client. Instead, an upper limit will be a guardrail so that
we won't fetch more data.

Aggregation query with grouping is a special case in which we would count
only those columns marked as queried in a ColumnFilter for a grouped result
(maximum sizes of those columns in a group).

This way, we can still achieve the goal of making the server more stable
under heavy load. Letting the user specify a page size in bytes is indeed a
separate story, as the result set size needs to be measured on a higher
level, where the selectors are applied.

thanks,
Jacek


wt., 13 cze 2023 o 10:42 Benjamin Lerer <ble...@apache.org> napisał(a):

> So my other question - for aggregation with the "group by" clause, we
>> return an aggregated row which is computed from a group of rows - with my
>> current implementation, it is approximated by counting the size of the
>> largest row in that group - I think it is the safest and simplest
>> approximation - wdyt?
>
>
> I feel that there are something that was not discussed here. The storage
> engine can return some rows that are much larger than the actual row
> returned to the user depending on the projections being used. Therefore
> there will only be a reliable matching between the size of the page loaded
> internally and the size of the page returned to the user when the full row
> is queried without transformation. For all the other case the difference
> can be really significant. For a group by queries doing a count(*), the
> approach suggested will return a page size that is totally off with what
> was requested.
>
> Le mar. 13 juin 2023 à 07:00, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
>> Josh, that answers my question exactly; thank you.
>>
>> I will not implement limiting the result set in CQL (that is, by LIMIT
>> clause) and stay with just paging. Whether the page size is defined in
>> bytes or rows can be determined by a flag - there are many unused bits for
>> that.
>>
>> So my other question - for aggregation with the "group by" clause, we
>> return an aggregated row which is computed from a group of rows - with my
>> current implementation, it is approximated by counting the size of the
>> largest row in that group - I think it is the safest and simplest
>> approximation - wdyt?
>>
>>
>> pon., 12 cze 2023 o 22:55 Josh McKenzie <jmcken...@apache.org>
>> napisał(a):
>>
>>> As long as it is valid in the paging protocol to return a short page,
>>> but still say “there are more pages”, I think that is fine to do that.
>>>
>>> Thankfully the v3-v5 spec all make it clear that clients need to respect
>>> what the server has to say about there being more pages:
>>> https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec#L1247-L1253
>>>
>>>   - Clients should not rely on the actual size of the result set
>>> returned to
>>>     decide if there are more results to fetch or not. Instead, they
>>> should always
>>>     check the Has_more_pages flag (unless they did not enable paging for
>>> the query
>>>     obviously). Clients should also not assert that no result will have
>>> more than
>>>     <result_page_size> results. While the current implementation always
>>> respects
>>>     the exact value of <result_page_size>, we reserve the right to return
>>>     slightly smaller or bigger pages in the future for performance
>>> reasons.
>>>
>>>
>>> On Mon, Jun 12, 2023, at 3:19 PM, Jeremiah Jordan wrote:
>>>
>>> As long as it is valid in the paging protocol to return a short page,
>>> but still say “there are more pages”, I think that is fine to do that.  For
>>> an actual LIMIT that is part of the user query, I think the server must
>>> always have returned all data that fits into the LIMIT when all pages have
>>> been returned.
>>>
>>> -Jeremiah
>>>
>>> On Jun 12, 2023 at 12:56:14 PM, Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>
>>> Yeah, my bad. I have paging on the brain. Seriously.
>>>
>>> I can't think of a use-case in which a LIMIT based on # bytes makes
>>> sense from a user perspective.
>>>
>>> On Mon, Jun 12, 2023, at 1:35 PM, Jeff Jirsa wrote:
>>>
>>>
>>>
>>> On Mon, Jun 12, 2023 at 9:50 AM Benjamin Lerer <b.le...@gmail.com>
>>> wrote:
>>>
>>> If you have rows that vary significantly in their size, your latencies
>>> could end up being pretty unpredictable using a LIMIT BY <row_count>. Being
>>> able to specify a limit by bytes at the driver / API level would allow app
>>> devs to get more deterministic results out of their interaction w/the DB if
>>> they're looking to respond back to a client within a certain time frame and
>>> / or determine next steps in the app (continue paging, stop, etc) based on
>>> how long it took to get results back.
>>>
>>>
>>> Are you talking about the page size or the LIMIT. Once the LIMIT is
>>> reached there is no "continue paging". LIMIT is also at the CQL level not
>>> at the driver level.
>>> I can totally understand the need for a page size in bytes not for a
>>> LIMIT.
>>>
>>>
>>> Would only ever EXPECT to see a page size in bytes, never a LIMIT
>>> specifying bytes.
>>>
>>> I know the C-11745 ticket says LIMIT, too, but that feels very odd to me.
>>>
>>>
>>>
>>>

Reply via email to