Re: Any standard way for min/max values per record-batch?

Micah Kornfield Wed, 17 Feb 2021 21:08:48 -0800

>
> What is the parallel-list means?

Something like:


table RecordBatch {
nodes: [FieldNode];
// Statistics related to the data represented by each FieldNode
// This field is either length=0 or has the same length as nodes.
statistics: [Statistic];
}

On Wed, Feb 17, 2021 at 8:34 PM Kohei KaiGai <[email protected]> wrote:

> Thanks for the clarification.
>
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> For example, JSON array of min/max values as a key-value metadata
> in the Footer->Schema->Fields[]->custom_metadata?
> Even though the metadata field must be less than INT_MAX, I think it
> is enough portable and not invasive way.
>
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not
> use
> > KeyValue but should have something where the values can be bytes.
> >
> What is the parallel-list means?
> If we would have a standardized binary structure, like DictionaryBatch,
> to store the statistics including min/max values, it exactly makes sense
> more than text-encoded key-value metadata, of course.
>
> Best regards,
>
> 2021年2月18日(木) 12:37 Micah Kornfield <[email protected]>:
> >
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not
> use
> > KeyValue but should have something where the values can be bytes.
> >
> > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai <[email protected]>
> wrote:
> >
> > > Hello,
> > >
> > > Does Apache Arrow have any standard way to embed min/max values of the
> > > fields
> > > per record-batch basis?
> > > It looks FieldNode supports neither dedicated min/max attribute nor
> > > custom-metadata.
> > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> > >
> > > If we embed an array of min/max values into the custom-metadata of the
> > > Field-node,
> > > we may be able to implement.
> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> > >
> > > What I like to implement is something like BRIN index at PostgreSQL.
> > > http://heterodb.github.io/pg-strom/brin/
> > >
> > > This index contains only min/max values for a particular block ranges,
> and
> > > query
> > > executor can skip blocks that obviously don't contain the target data.
> > > If we can skip 9990 of 10000 record batch by checking metadata on a
> query
> > > that
> > > tries to fetch items in very narrow timestamps, it is a great
> > > acceleration more than
> > > full file scans.
> > >
> > > Best regards,
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei <[email protected]>
> > >
>
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <[email protected]>
>

Re: Any standard way for min/max values per record-batch?

Reply via email to