[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950117#comment-14950117 ] Sylvain Lebresne commented on CASSANDRA-10489: -- For the "non-indexed" part, the big problem with this is paging. I don't think we can do arbitrary read-time ordering in the presence of paging, at least not given how things work. And I'm *very* opposed to supporting it only if the user disable paging as 1) that would be really confusing, 2) that would be encouraging users to disable paging which we don't want and 3) sorting client side ain't that hard. And at least in that "non-indexed" case, I don't buy the "it's better to do it server side if you want a LIMIT" argument: pulling 10k rows in memory to sort them when you only care about the 10 first one is a stupid idea whether that's client or server side, you should use a MV (or any other smarter solution). As for the "indexed" part, I kind of think that it's what MVs are for. Of course, if you see a much better way to do this that through MVs, feel free to share. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950785#comment-14950785 ] Jeremiah Jordan commented on CASSANDRA-10489: - bq. And I'm very opposed to supporting it only if the user disable paging +1 I too don't see a good way to do this in the face of paging, without ignoring paging. And then you are back to the whole reason we made paging in the first place, people were issuing queries that would materialize large data sets and crash nodes. I agree, I think a materialized view with a different clustering order is where this should happen. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949492#comment-14949492 ] Jonathan Shook commented on CASSANDRA-10489: So, against a non-indexed field, the processing bound will be the size of the partition. If you only hold a scoreboard of limit items in memory and stream through the rest, replacing items, the memory requirements are lower, but the IO requirements could be substantial. If you do this with RF>1 and CL>1, then you may have semantics of result merging at the coordinator, but this should still be bounded to the result size and not the search space. I would like for us to consider this operation for indexed fields and non-indexed fields as separate features, possibly putting the non-indexed version behind a warning or such. I'm sure some will absolutely try to sort 10^9 items with limit 10. At least they should know that it has a completely different op cost. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949521#comment-14949521 ] Jon Haddad commented on CASSANDRA-10489: {quote} I would like for us to consider this operation for indexed fields and non-indexed fields as separate features, possibly putting the non-indexed version behind a warning or such. {quote} Assuming they'd be different code paths it would make sense to consider them separate features. {quote} possibly putting the non-indexed version behind a warning or such. I'm sure some will absolutely try to sort 10^9 unindexed items with limit 10. At least they should know that it has a completely different op cost. {quote} I'd be a little concerned we'd be generating a lot of noise if we warn every time a user sorts on a non indexed field. I'm giving myself 100% chance of doing this on partitions with 10s to hundreds of rows, and seeing warnings every time would render them useless. I think a threshold makes more sense here, similar to tombstone_warn_threshold, and maybe even a failure similar to tombstone_failure_threshold. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949436#comment-14949436 ] Jonathan Shook commented on CASSANDRA-10489: Would this need to be limited to indexed (in some form) fields? Without an index, it would be difficult for the coordinator to know the bound of sorting ahead of time. Or would this be for rows selected by some indexed field with limit, and then sorted only after limit was applied? Essentially, should we define this as a valid goal for results for which we already can know the cardinality bounds without traversing the whole partition? > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949450#comment-14949450 ] Jon Haddad commented on CASSANDRA-10489: I don't see this as any different than selecting 10K rows out of a relational DB and sorting on one of the fields. I realize this could potentially be a little ridiculous if you're working on some crazy time series - at that point you'd want to have multiple tables to manage the query performance. There's plenty of cases, however which are limited to hundreds or thousands of rows which would work perfectly fine with in memory sorting. An example would be a table which is hourly aggregated data & you need to find the top 10 hours (of some field, maybe it's pageviews) in a year. Really not necessary to have a secondary table for this, and silly to pull back 8K rows just to sort client side and get the top 10. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949613#comment-14949613 ] Jonathan Shook commented on CASSANDRA-10489: I'm totally cool with a threshold warning here. But something that is easily ignored is easily ignored, like log spam. Also, if it is documented clearly in terms of op costs, I'm ok with that too. Anywhere we have a list of "these things that can be expensive if you don't understand what they are doing", this should be on it. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)