Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Benjamin Lerer Mon, 12 Jun 2023 09:50:48 -0700

>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY <row_count>. Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.



Are you talking about the page size or the LIMIT. Once the LIMIT is reached
there is no "continue paging". LIMIT is also at the CQL level not at the
driver level.
I can totally understand the need for a page size in bytes not for a LIMIT.

Le lun. 12 juin 2023 à 16:25, Josh McKenzie <jmcken...@apache.org> a écrit :

> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> If you have rows that vary significantly in their size, your latencies
> could end up being pretty unpredictable using a LIMIT BY <row_count>. Being
> able to specify a limit by bytes at the driver / API level would allow app
> devs to get more deterministic results out of their interaction w/the DB if
> they're looking to respond back to a client within a certain time frame and
> / or determine next steps in the app (continue paging, stop, etc) based on
> how long it took to get results back.
>
> I'm seeing similar tradeoffs working on gracefully paging over tombstones;
> there's a strong desire to be able to have more confidence in the statement
> "If I ask the server for a page of data, I'll very likely get it back
> within time X".
>
> There's an argument that it's a data modeling problem and apps should
> model differently to have more consistent row sizes and/or tombstone
> counts; I'm sympathetic to that but the more we can loosen those
> constraints on users the better their experience in my opinion.
>
> On Mon, Jun 12, 2023, at 5:39 AM, Jacek Lewandowski wrote:
>
> Yes, LIMIT BY <bytes> provided by the user in CQL does not make much sense
> to me either
>
>
> pon., 12 cze 2023 o 11:20 Benedict <bened...@apache.org> napisał(a):
>
>
> I agree that this is more suitable as a paging option, and not as a CQL
> LIMIT option.
>
> If it were to be a CQL LIMIT option though, then it should be accurate
> regarding result set IMO; there shouldn’t be any further results that could
> have been returned within the LIMIT.
>
>
> On 12 Jun 2023, at 10:16, Benjamin Lerer <ble...@apache.org> wrote:
>
> 
> Thanks Jacek for raising that discussion.
>
> I do not have in mind a scenario where it could be useful to specify a
> LIMIT in bytes. The LIMIT clause is usually used when you know how many
> rows you wish to display or use. Unless somebody has a useful scenario in
> mind I do not think that there is a need for that feature.
>
> Paging in bytes makes sense to me as the paging mechanism is transparent
> for the user in most drivers. It is simply a way to optimize your memory
> usage from end to end.
>
> I do not like the approach of using both of them simultaneously because if
> you request a page with a certain amount of rows and do not get it then is
> is really confusing and can be a problem for some usecases. We have users
> keeping their session open and the page information to display page of data.
>
> Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> a écrit :
>
> Hi,
>
> I was working on limiting query results by their size expressed in bytes,
> and some questions arose that I'd like to bring to the mailing list.
>
> The semantics of queries (without aggregation) - data limits are applied
> on the raw data returned from replicas - while it works fine for the row
> number limits as the number of rows is not likely to change after
> post-processing, it is not that accurate for size based limits as the cell
> sizes may be different after post-processing (for example due to applying
> some transformation function, projection, or whatever).
>
> We can truncate the results after post-processing to stay within the
> user-provided limit in bytes, but if the result is smaller than the limit -
> we will not fetch more. In that case, the meaning of "limit" being an
> actual limit is valid though it would be misleading for the page size
> because we will not fetch the maximum amount of data that does not exceed
> the page size.
>
> Such a problem is much more visible for "group by" queries with
> aggregation. The paging and limiting mechanism is applied to the rows
> rather than groups, as it has no information about how much memory a single
> group uses. For now, I've approximated a group size as the size of the
> largest participating row.
>
> The problem concerns the allowed interpretation of the size limit
> expressed in bytes. Whether we want to use this mechanism to let the users
> precisely control the size of the resultset, or we instead want to use this
> mechanism to limit the amount of memory used internally for the data and
> prevent problems (assuming restricting size and rows number can be used
> simultaneously in a way that we stop when we reach any of the specified
> limits).
>
> https://issues.apache.org/jira/browse/CASSANDRA-11745
>
> thanks,
> - - -- --- ----- -------- -------------
> Jacek Lewandowski
>
>
>

Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Reply via email to