[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-09 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950117#comment-14950117
 ] 

Sylvain Lebresne commented on CASSANDRA-10489:
--

For the "non-indexed" part, the big problem with this is paging. I don't think 
we can do arbitrary read-time ordering in the presence of paging, at least not 
given how things work. And I'm *very* opposed to supporting it only if the user 
disable paging as 1) that would be really confusing, 2) that would be 
encouraging users to disable paging which we don't want and 3) sorting client 
side ain't that hard. And at least in that "non-indexed" case, I don't buy the 
"it's better to do it server side if you want a LIMIT" argument: pulling 10k 
rows in memory to sort them when you only care about the 10 first one is a 
stupid idea whether that's client or server side, you should use a MV (or any 
other smarter solution).

As for the "indexed" part, I kind of think that it's what MVs are for. Of 
course,  if you see a much better way to do this that through MVs, feel free to 
share. 

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-09 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950785#comment-14950785
 ] 

Jeremiah Jordan commented on CASSANDRA-10489:
-

bq. And I'm very opposed to supporting it only if the user disable paging

+1 I too don't see a good way to do this in the face of paging, without 
ignoring paging.  And then you are back to the whole reason we made paging in 
the first place, people were issuing queries that would materialize large data 
sets and crash nodes.

I agree, I think a materialized view with a different clustering order is where 
this should happen.

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-08 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949492#comment-14949492
 ] 

Jonathan Shook commented on CASSANDRA-10489:


So, against a non-indexed field, the processing bound will be the size of the 
partition. If you only hold a scoreboard of limit items in memory and stream 
through the rest, replacing items, the memory requirements are lower, but the 
IO requirements could be substantial. If you do this with RF>1 and CL>1, then 
you may have semantics of result merging at the coordinator, but this should 
still be bounded to the result size and not the search space.

I would like for us to consider this operation for indexed fields and 
non-indexed fields as separate features, possibly putting the non-indexed 
version behind a warning or such. I'm sure some will absolutely try to sort 
10^9 items with limit 10. At least they should know that it has a completely 
different op cost.


> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-08 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949521#comment-14949521
 ] 

Jon Haddad commented on CASSANDRA-10489:


{quote}
I would like for us to consider this operation for indexed fields and 
non-indexed fields as separate features, possibly putting the non-indexed 
version behind a warning or such.
{quote}

Assuming they'd be different code paths it would make sense to consider them 
separate features.

{quote}
possibly putting the non-indexed version behind a warning or such. I'm sure 
some will absolutely try to sort 10^9 unindexed items with limit 10. At least 
they should know that it has a completely different op cost.
{quote}

I'd be a little concerned we'd be generating a lot of noise if we warn every 
time a user sorts on a non indexed field.  I'm giving myself 100% chance of 
doing this on partitions with 10s to hundreds of rows, and seeing warnings 
every time would render them useless.  I think a threshold makes more sense 
here, similar to tombstone_warn_threshold, and maybe even a failure similar to 
tombstone_failure_threshold.  

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-08 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949436#comment-14949436
 ] 

Jonathan Shook commented on CASSANDRA-10489:


Would this need to be limited to indexed (in some form) fields? Without an 
index, it would be difficult for the coordinator to know the bound of sorting 
ahead of time. Or would this be for rows selected by some indexed field with 
limit, and then sorted only after limit was applied?

Essentially, should we define this as a valid goal for results for which we 
already can know the cardinality bounds without traversing the whole partition?

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-08 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949450#comment-14949450
 ] 

Jon Haddad commented on CASSANDRA-10489:


I don't see this as any different than selecting 10K rows out of a relational 
DB and sorting on one of the fields.  I realize this could potentially be a 
little ridiculous if you're working on some crazy time series - at that point 
you'd want to have multiple tables to manage the query performance.  There's 
plenty of cases, however which are limited to hundreds or thousands of rows 
which would work perfectly fine with in memory sorting.  An example would be a 
table which is hourly aggregated data & you need to find the top 10 hours (of 
some field, maybe it's pageviews) in a year.  Really not necessary to have a 
secondary table for this, and silly to pull back 8K rows just to sort client 
side and get the top 10.  

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions

2015-10-08 Thread Jonathan Shook (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949613#comment-14949613
 ] 

Jonathan Shook commented on CASSANDRA-10489:


I'm totally cool with a threshold warning here. But something that is easily 
ignored is easily ignored, like log spam. Also, if it is documented clearly in 
terms of op costs, I'm ok with that too. Anywhere we have a list of "these 
things that can be expensive if you don't understand what they are doing", this 
should be on it.

> arbitrary order by on partitions
> 
>
> Key: CASSANDRA-10489
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10489
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jon Haddad
>Priority: Minor
>
> We've got aggregations, we might as well allow sorting rows within a 
> partition on arbitrary fields.  Currently the advice is "do it client side", 
> but when combined with a LIMIT clause it makes sense do this server side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)