Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Jacek Lewandowski Mon, 12 Jun 2023 02:40:35 -0700

Yes, LIMIT BY <bytes> provided by the user in CQL does not make much sense
to me either



pon., 12 cze 2023 o 11:20 Benedict <[email protected]> napisał(a):

> I agree that this is more suitable as a paging option, and not as a CQL
> LIMIT option.
>
> If it were to be a CQL LIMIT option though, then it should be accurate
> regarding result set IMO; there shouldn’t be any further results that could
> have been returned within the LIMIT.
>
> On 12 Jun 2023, at 10:16, Benjamin Lerer <[email protected]> wrote:
>
> 
> Thanks Jacek for raising that discussion.
>
> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> Paging in bytes makes sense to me as the paging mechanism is transparent
> for the user in most drivers. It is simply a way to optimize your memory
> usage from end to end.
>
> I do not like the approach of using both of them simultaneously because if
> you request a page with a certain amount of rows and do not get it then is
> is really confusing and can be a problem for some usecases. We have users
> keeping their session open and the page information to display page of data.
>
> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <
> [email protected]> a écrit :
>
>> Hi,
>>
>> I was working on limiting query results by their size expressed in bytes,
>> and some questions arose that I'd like to bring to the mailing list.
>>
>> The semantics of queries (without aggregation) - data limits are applied
>> on the raw data returned from replicas - while it works fine for the row
>> number limits as the number of rows is not likely to change after
>> post-processing, it is not that accurate for size based limits as the cell
>> sizes may be different after post-processing (for example due to applying
>> some transformation function, projection, or whatever).
>>
>> We can truncate the results after post-processing to stay within the
>> user-provided limit in bytes, but if the result is smaller than the limit -
>> we will not fetch more. In that case, the meaning of "limit" being an
>> actual limit is valid though it would be misleading for the page size
>> because we will not fetch the maximum amount of data that does not exceed
>> the page size.
>>
>> Such a problem is much more visible for "group by" queries with
>> aggregation. The paging and limiting mechanism is applied to the rows
>> rather than groups, as it has no information about how much memory a single
>> group uses. For now, I've approximated a group size as the size of the
>> largest participating row.
>>
>> The problem concerns the allowed interpretation of the size limit
>> expressed in bytes. Whether we want to use this mechanism to let the users
>> precisely control the size of the resultset, or we instead want to use this
>> mechanism to limit the amount of memory used internally for the data and
>> prevent problems (assuming restricting size and rows number can be used
>> simultaneously in a way that we stop when we reach any of the specified
>> limits).
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-11745
>>
>> thanks,
>> - - -- --- ----- -------- -------------
>> Jacek Lewandowski
>>
>

Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Reply via email to