Re: [Discuss] Storing metadata about the "sortedness" of data

Julian Hyde Tue, 11 May 2021 11:49:39 -0700

Note that Calcite’s Statistic interface is heavily simplified, designed to be 
really simple for people to implement when they write their first table 
adapter. There are more advanced forms of metadata, such as RelMdDistribution 
[1] and Collation [2].


Since Arrow data sets will typically consist of many files, spread over many 
nodes, I think it is important that the “sortedness” concept should be combined 
with “distribution”.

Consider this example: I have files that represent sales; they are 
hash-partitioned by zip code and range-partitioned by date; the file name 
includes the year; the data within a file that has the same zip code is sorted 
by date (but two records in the same files that have different zip codes are 
not necessarily in date order).

I would like to be able to represent all of that as part of my “sortedness”. 
Knowing about distribution will inform when it is safe for operators to work on 
parallel streams of data, when data needs to be shuffled and merged to compute 
results as opposed to merely doing a union, and knowing about sorting informs 
operators when they can safely emit partial results.

Julian

[1] 
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm
 
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm>

[2] 
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html
 
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html>
 l

> On May 11, 2021, at 11:29 AM, Andy Grove <[email protected]> wrote:
> 
> TableProvider has a statistics method already. The approach that Calcite
> takes is to include sort order as part of statistics [1], so that could be
> one approach to consider.
> 
> We may also want to add a method to LogicalPlan for returning the sort
> order (or statistics) for a particular operator.
> 
> [1]
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Statistic.html
> 
> 
> On Tue, May 11, 2021 at 12:14 PM Andrew Lamb <[email protected]> wrote:
> 
>> I was imagining something known at Query Planning time (e.g if the data you
>> are reading in from a parquet file is already sorted by `time` and the
>> query calls for sorting by time, the sort can be omitted). In this case, I
>> was thinking "how would we communicate this information to DataFusion from
>> a TableProvider"
>> 
>> Another usecase for sortedness is if you are merging two parquet files into
>> a single sorted output and you want to know the inputs are already sorted,
>> you can simply merge the two streams together and save quite a lot of
>> processing time and intermediate buffers.
>> 
>> 
>> 
>> On Tue, May 11, 2021 at 2:01 PM Andy Grove <[email protected]> wrote:
>> 
>>> I had been planning on adding a method to DataFusion's execution plan to
>>> indicate the sort-order of the results (if known), similar to how we
>>> currently have information about output partitioning.
>>> 
>>> Would this cover your requirement or are you looking for something
>> outside
>>> the context of execution plans?
>>> 
>>> On Tue, May 11, 2021 at 11:52 AM Andrew Lamb <[email protected]>
>> wrote:
>>> 
>>>> We are building a system that will likely make heavy use of sorted
>> data,
>>>> and we are trying to figure out how to encode the metadata of "how is
>>> this
>>>> data sorted". We can certainly use our own custom metadata fields, but
>>>> wanted to check for prior art and gauge community interest in adding
>>>> something to Arrow. More details are on [1].
>>>> 
>>>> Recording sort-order in Schema  would likely be useful for DataFusion
>> as
>>>> well (to optimize away redundant computation if the data is already
>>> sorted
>>>> or pick more efficient algorithms (e.g. a MERGING grouping operator).
>>>> 
>>>> I didn't see any obvious prior art on the mailing list [2] or in JIRA
>>>> [3][4] so I figured I would ask if others had any backstory or other
>>>> reactions.
>>>> 
>>>> Thank you
>>>> Andrew
>>>> 
>>>> 
>>>> 
>>>> 
>>>> [1] https://github.com/apache/arrow-rs/issues/284
>>>> [2]
>> https://lists.apache.org/[email protected]:lte=1y:sort
>>>> [3]
>>>> 
>>>> 
>>> 
>> https://issues.apache.org/jira/browse/ARROW-12087?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20summary%20~%20sort%20ORDER%20BY%20created%20DESC
>>>> [4]
>>>> 
>>>> 
>>> 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20description%20~%20sort%20and%20component%20in%20(format)
>>>> 
>>> 
>>

Re: [Discuss] Storing metadata about the "sortedness" of data

Reply via email to