I think in general statistics/sortedness would be useful to have in the Arrow spec (it has come up in the past I think most recently around Min/Max). A few thoughts:
1. We've previously hesitated to specify sort order for different types, we would need to account for that in any formalization. 2. I think it would be better to expand the specification to have flatbuffer structures to specify statistics/ordering. I don't think the current metadata facility is a great place for it because, IIUC, it requires Key-Value types of UTF8 strings and not bytes. -Micah On Tue, May 11, 2021 at 11:49 AM Julian Hyde <jhyde.apa...@gmail.com> wrote: > Note that Calcite’s Statistic interface is heavily simplified, designed to > be really simple for people to implement when they write their first table > adapter. There are more advanced forms of metadata, such as > RelMdDistribution [1] and Collation [2]. > > Since Arrow data sets will typically consist of many files, spread over > many nodes, I think it is important that the “sortedness” concept should be > combined with “distribution”. > > Consider this example: I have files that represent sales; they are > hash-partitioned by zip code and range-partitioned by date; the file name > includes the year; the data within a file that has the same zip code is > sorted by date (but two records in the same files that have different zip > codes are not necessarily in date order). > > I would like to be able to represent all of that as part of my > “sortedness”. Knowing about distribution will inform when it is safe for > operators to work on parallel streams of data, when data needs to be > shuffled and merged to compute results as opposed to merely doing a union, > and knowing about sorting informs operators when they can safely emit > partial results. > > Julian > > [1] > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm > < > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm > > > > [2] > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html > < > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html> > l > > > On May 11, 2021, at 11:29 AM, Andy Grove <andygrov...@gmail.com> wrote: > > > > TableProvider has a statistics method already. The approach that Calcite > > takes is to include sort order as part of statistics [1], so that could > be > > one approach to consider. > > > > We may also want to add a method to LogicalPlan for returning the sort > > order (or statistics) for a particular operator. > > > > [1] > > > https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Statistic.html > > > > > > On Tue, May 11, 2021 at 12:14 PM Andrew Lamb <al...@influxdata.com> > wrote: > > > >> I was imagining something known at Query Planning time (e.g if the data > you > >> are reading in from a parquet file is already sorted by `time` and the > >> query calls for sorting by time, the sort can be omitted). In this > case, I > >> was thinking "how would we communicate this information to DataFusion > from > >> a TableProvider" > >> > >> Another usecase for sortedness is if you are merging two parquet files > into > >> a single sorted output and you want to know the inputs are already > sorted, > >> you can simply merge the two streams together and save quite a lot of > >> processing time and intermediate buffers. > >> > >> > >> > >> On Tue, May 11, 2021 at 2:01 PM Andy Grove <andygrov...@gmail.com> > wrote: > >> > >>> I had been planning on adding a method to DataFusion's execution plan > to > >>> indicate the sort-order of the results (if known), similar to how we > >>> currently have information about output partitioning. > >>> > >>> Would this cover your requirement or are you looking for something > >> outside > >>> the context of execution plans? > >>> > >>> On Tue, May 11, 2021 at 11:52 AM Andrew Lamb <al...@influxdata.com> > >> wrote: > >>> > >>>> We are building a system that will likely make heavy use of sorted > >> data, > >>>> and we are trying to figure out how to encode the metadata of "how is > >>> this > >>>> data sorted". We can certainly use our own custom metadata fields, but > >>>> wanted to check for prior art and gauge community interest in adding > >>>> something to Arrow. More details are on [1]. > >>>> > >>>> Recording sort-order in Schema would likely be useful for DataFusion > >> as > >>>> well (to optimize away redundant computation if the data is already > >>> sorted > >>>> or pick more efficient algorithms (e.g. a MERGING grouping operator). > >>>> > >>>> I didn't see any obvious prior art on the mailing list [2] or in JIRA > >>>> [3][4] so I figured I would ask if others had any backstory or other > >>>> reactions. > >>>> > >>>> Thank you > >>>> Andrew > >>>> > >>>> > >>>> > >>>> > >>>> [1] https://github.com/apache/arrow-rs/issues/284 > >>>> [2] > >> https://lists.apache.org/list.html?dev@arrow.apache.org:lte=1y:sort > >>>> [3] > >>>> > >>>> > >>> > >> > https://issues.apache.org/jira/browse/ARROW-12087?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20summary%20~%20sort%20ORDER%20BY%20created%20DESC > >>>> [4] > >>>> > >>>> > >>> > >> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20description%20~%20sort%20and%20component%20in%20(format) > >>>> > >>> > >> > >