Re: [Discuss] Storing metadata about the "sortedness" of data

Micah Kornfield Tue, 11 May 2021 12:05:32 -0700

I think in general statistics/sortedness would be useful to have in the
Arrow spec (it has come up in the past I think most recently around
Min/Max).  A few thoughts:


1.  We've previously hesitated to specify sort order for different types,
we would need to account for that in any formalization.
2.  I think it would be better to expand the specification to have
flatbuffer structures to specify statistics/ordering.  I don't think the
current metadata facility is a great place for it because, IIUC, it
requires Key-Value types of UTF8 strings and not bytes.

-Micah

On Tue, May 11, 2021 at 11:49 AM Julian Hyde <[email protected]> wrote:

> Note that Calcite’s Statistic interface is heavily simplified, designed to
> be really simple for people to implement when they write their first table
> adapter. There are more advanced forms of metadata, such as
> RelMdDistribution [1] and Collation [2].
>
> Since Arrow data sets will typically consist of many files, spread over
> many nodes, I think it is important that the “sortedness” concept should be
> combined with “distribution”.
>
> Consider this example: I have files that represent sales; they are
> hash-partitioned by zip code and range-partitioned by date; the file name
> includes the year; the data within a file that has the same zip code is
> sorted by date (but two records in the same files that have different zip
> codes are not necessarily in date order).
>
> I would like to be able to represent all of that as part of my
> “sortedness”. Knowing about distribution will inform when it is safe for
> operators to work on parallel streams of data, when data needs to be
> shuffled and merged to compute results as opposed to merely doing a union,
> and knowing about sorting informs operators when they can safely emit
> partial results.
>
> Julian
>
> [1]
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm
> <
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdDistribution.htm
> >
>
> [2]
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html
> <
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.Collation.html>
> l
>
> > On May 11, 2021, at 11:29 AM, Andy Grove <[email protected]> wrote:
> >
> > TableProvider has a statistics method already. The approach that Calcite
> > takes is to include sort order as part of statistics [1], so that could
> be
> > one approach to consider.
> >
> > We may also want to add a method to LogicalPlan for returning the sort
> > order (or statistics) for a particular operator.
> >
> > [1]
> >
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Statistic.html
> >
> >
> > On Tue, May 11, 2021 at 12:14 PM Andrew Lamb <[email protected]>
> wrote:
> >
> >> I was imagining something known at Query Planning time (e.g if the data
> you
> >> are reading in from a parquet file is already sorted by `time` and the
> >> query calls for sorting by time, the sort can be omitted). In this
> case, I
> >> was thinking "how would we communicate this information to DataFusion
> from
> >> a TableProvider"
> >>
> >> Another usecase for sortedness is if you are merging two parquet files
> into
> >> a single sorted output and you want to know the inputs are already
> sorted,
> >> you can simply merge the two streams together and save quite a lot of
> >> processing time and intermediate buffers.
> >>
> >>
> >>
> >> On Tue, May 11, 2021 at 2:01 PM Andy Grove <[email protected]>
> wrote:
> >>
> >>> I had been planning on adding a method to DataFusion's execution plan
> to
> >>> indicate the sort-order of the results (if known), similar to how we
> >>> currently have information about output partitioning.
> >>>
> >>> Would this cover your requirement or are you looking for something
> >> outside
> >>> the context of execution plans?
> >>>
> >>> On Tue, May 11, 2021 at 11:52 AM Andrew Lamb <[email protected]>
> >> wrote:
> >>>
> >>>> We are building a system that will likely make heavy use of sorted
> >> data,
> >>>> and we are trying to figure out how to encode the metadata of "how is
> >>> this
> >>>> data sorted". We can certainly use our own custom metadata fields, but
> >>>> wanted to check for prior art and gauge community interest in adding
> >>>> something to Arrow. More details are on [1].
> >>>>
> >>>> Recording sort-order in Schema  would likely be useful for DataFusion
> >> as
> >>>> well (to optimize away redundant computation if the data is already
> >>> sorted
> >>>> or pick more efficient algorithms (e.g. a MERGING grouping operator).
> >>>>
> >>>> I didn't see any obvious prior art on the mailing list [2] or in JIRA
> >>>> [3][4] so I figured I would ask if others had any backstory or other
> >>>> reactions.
> >>>>
> >>>> Thank you
> >>>> Andrew
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> [1] https://github.com/apache/arrow-rs/issues/284
> >>>> [2]
> >> https://lists.apache.org/[email protected]:lte=1y:sort
> >>>> [3]
> >>>>
> >>>>
> >>>
> >>
> https://issues.apache.org/jira/browse/ARROW-12087?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20summary%20~%20sort%20ORDER%20BY%20created%20DESC
> >>>> [4]
> >>>>
> >>>>
> >>>
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20description%20~%20sort%20and%20component%20in%20(format)
> >>>>
> >>>
> >>
>
>

Re: [Discuss] Storing metadata about the "sortedness" of data

Reply via email to