Re: Any standard way for min/max values per record-batch?

2021-07-19 Thread Kohei KaiGai
Hello,

Let me share our trial to support the min/max statistics per record batch.
https://github.com/heterodb/pg-strom/wiki/806:-Apache-Arrow-Min-Max-Statistics-Hint

The latest pg2arrow supports --stat option that can specify the
columns to include min/max statistics
for each record batch.
The statistics are embedded in the custom_metadata[] of the Field (in
the Footer area), if any.

The example below shows database table dump with statistics on the
lo_orderdate column.

$ pg2arrow -d postgres -o /dev/shm/flineorder_sort.arrow -t
lineorder_sort --stat=lo_orderdate --progress
RecordBatch[0]: offset=1640 length=268437080 (meta=920,
body=268436160) nitems=1303085
RecordBatch[1]: offset=268438720 length=268437080 (meta=920,
body=268436160) nitems=1303085
  :
RecordBatch[9]: offset=2415935360 length=5566 (meta=920,
body=55667968) nitems=270231

Then, you can find out the custom-metadata on the lo_orderdate field;
min_values and max_values.
These are comma separated integer lists with 10 elements as many as
the number of record batches.
So, the first item of min_values and max_values are min-/max-datum of
the record-batch[0].

$ python3
Python 3.6.8 (default, Aug 24 2020, 17:57:11)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader('/dev/shm/flineorder_sort.arrow')
>>> X.schema
lo_orderkey: decimal(30, 8)
lo_linenumber: int32
lo_custkey: decimal(30, 8)
lo_partkey: int32
lo_suppkey: decimal(30, 8)
lo_orderdate: int32
  -- field metadata --
  min_values: '19920101,19920919,19930608,19940223,1994,19950730,1996' + 31
  max_values: '19920919,19930608,19940223,1994,19950730,19960417,1997' + 31
lo_orderpriority: fixed_size_binary[15]
lo_shippriority: fixed_size_binary[1]
lo_quantity: decimal(30, 8)
lo_extendedprice: decimal(30, 8)
lo_ordertotalprice: decimal(30, 8)
  :

When we scan the arrow_fdw foreign table that maps that Apache Arrow
file with the above min/max
statistics, it automatically checks WHERE-clause and skipps
record-batches that contain no rows
to be survived.

postgres=# EXPLAIN ANALYZE
   SELECT count(*) FROM flineorder_sort
WHERE lo_orderpriority = '2-HIGH'
  AND lo_orderdate BETWEEN 19940101 AND 19940630;

 QUERY PLAN
--
 Aggregate  (cost=33143.09..33143.10 rows=1 width=8) (actual
time=115.591..115.593 rows=1loops=1)
   ->  Custom Scan (GpuPreAgg) on flineorder_sort
(cost=33139.52..33142.07 rows=204 width=8) (actual
time=115.580..115.585 rows=1 loops=1)
 Reduction: NoGroup
 Outer Scan: flineorder_sort  (cost=4000.00..33139.42 rows=300
width=0) (actual time=10.682..21.555 rows=2606170 loops=1)
 Outer Scan Filter: ((lo_orderdate >= 19940101) AND
(lo_orderdate <= 19940630) AND (lo_orderpriority = '2-HIGH'::bpchar))
 Rows Removed by Outer Scan Filter: 2425885
 referenced: lo_orderdate, lo_orderpriority
 Stats-Hint: (lo_orderdate >= 19940101), (lo_orderdate <=
19940630)  [loaded: 2, skipped: 8]
 files0: /dev/shm/flineorder_sort.arrow (read: 217.52MB, size:
2357.11MB)
 Planning Time: 0.210 ms
 Execution Time: 153.508 ms
(11 rows)

This EXPLAIN ANALYZE displays the Stats-Hint line.
It says Arrow_Fdw could use (lo_orderdate >= 19940101) and
(lo_orderdate <= 19940630) to check
the min/max statistics, then actually skipped 8 record-batches but
only 2 record-batches were loaded.


Our expectation is IoT/M2M grade time-series log-processing because
they always contain timestamp
values for each entry, and physically closed rows tend to have similar values.
Not only Apache Arrow files generated by pg2arrow, this min/max
statistics values are appendable by
rewrite of the Footer portion, without relocation of record-batches.
So, we plan to provide a standalone
command to attach the min/max statistics onto the existing Apache
Arrow generated by other tools.

Best regards,

2021年2月18日(木) 13:33 Kohei KaiGai :
>
> Thanks for the clarification.
>
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> For example, JSON array of min/max values as a key-value metadata
> in the Footer->Schema->Fields[]->custom_metadata?
> Even though the metadata field must be less than INT_MAX, I think it
> is enough portable and not invasive way.
>
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not use
> > KeyValue but should have something where the values can be 

[Rust] Proposed 5.0.0 release blog

2021-07-19 Thread Andrew Lamb
Here is a PR [1] with a proposed 5.0.0 Rust release blog announcement.

Help / Comments / Contributions more than welcome

Andrew

[1] https://github.com/apache/arrow-site/pull/128


Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-19 Thread Andrew Lamb
> If we do indeed have the expectation of stability over its whole public
surface,

I certainly do not have this expectation between major releases. Who does?

I believe it is a disservice to the overall community to release two API
incompatible Rust implementations of Arrow to crates.io. It will
1. potentially confuse new users
2. split development effort
3. encourage writing more code that relies on the old API.

The Rust Arrow community has been *more than supportive* of the changes you
are proposing in arrow2 -- there is strong support for switching; The only
thing preventing that movement is for you to decide you are ready to
release it to the wider audience and then let us help you do that.

Making major public API changes for additional benefit between arrow 5.0.0
and arrow 6.0.0 (or other future versions) is perfectly compatible with
semantic versioning and other software projects.

Andrew

On Mon, Jul 19, 2021 at 2:08 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Whatever its technical faults may be, projects that rely on arrow (such as
> > anything based on DataFusion, like my own) need to be supported as they
> > have made the bet on Rust Arrow.
> >
>
> 1.X versioning in Apache Arrow was never meant to represent stability of
> their individual libraries, but only the stability of the C++/Python and
> the spec. It is a misconception that Rust implementation is stable and/or
> ready for production; its version is aligned with Apache Arrow general
> versioning simply for historical reasons. Requiring arrow2 to also be
> marked as stable is imo just dragging this onwards.
>
> As primary developer of arrow2 and a contributor of some of the major
> pieces of arrow-rs, I am saying that:
>
> * arrow-rs does not have a stable API: it requires large large incompatible
> changes to even make it *safe*
> * arrow2 does not have a stable API: it requires incompatible changes to
> improve UX, performance, and functionality
> * using arrow2 core API results in faster, safer, and less error-prone code
>
> The main difference is that arrow-rs requires API changes to its core
> modules (buffer, bytes, etc), while arrow2 requires changes to its
> peripheral modules (compute and IO). This is why imo we can make arrow2
> available: expected changes will only break a small surface of the public
> API which, while incompatible, are easy to address.
>
> Which is the gist of my proposal:
>
>- Arrow2 starts its release in cargo.io as 0.1
>- A major release (e.g. 0.16.2 -> 1.0.0):
>   - must be voted
>   - may be backward incompatible
>- Minor releases (e.g. 0.16.1 -> 0.17.0):
>   - must be voted
>   - may be backward incompatible
>   - may have new features
>- Patch releases (e.g. 0.16.1 -> 0.16.2):
>   - may be voted
>   - must not be backward compatible
>   - may have new features
>- Minor releases may have a maintenance period (e.g. 3+ months) over
>which we guarantee security patches and feature backports.
>- Major releases have a maintenance period over which we guarantee
>security patches and feature backports according to semver 2.0.
>
> So that:
>
>- It aligns expectations wrt to the current state of Rust's
>implementation
>- it offers support to downstream dependencies that require longer-term
>stability
>- it offers room for developers to improve its API, scrutinize security,
>etc.
>
> If we do indeed have an expectation of stability over its whole public
> surface, then I suggest that we keep arrow2 in the experimental repo as it
> is today.
>
> Btw, this is why some in the Rust community recommend using smaller crates:
> so that versioning is not bound to a large public API surface and can thus
> more easily be applied to smaller surfaces. There is of course a tradeoff
> with maintenance of CI and releases.
>
> Best,
> Jorge
>
> On Sat, Jul 17, 2021 at 1:59 PM Andrew Lamb  wrote:
>
> > What if we released "beta" [1] versions of arrow on cargo at whatever
> pace
> > was necessary? That way dependent crates could opt in to bleeding edge
> > functionality / APIs.
> >
> > There is tension between full technical freedom to change APIs and the
> > needs of downstream projects for a more stable API.
> >
> > Whatever its technical faults may be, projects that rely on arrow (such
> as
> > anything based on DataFusion, like my own) need to be supported as they
> > have made the bet on Rust Arrow. I don't think we can abandon maintenance
> > on the existing codebase until we have a successor ready.
> >
> > Andrew
> >
> > p.s. I personally very much like Adam's suggestion for "Arrow 6.0 in Oct
> > 2021 be based on arrow2" but that is predicated on wanting to have arrow2
> > widely used by downstreams at that point.
> >
> > [1]
> >
> >
> https://stackoverflow.com/questions/46373028/how-to-release-a-beta-version-of-a-crate-for-limited-public-testing
> >
> >
> > On Sat, Jul 17, 2021 at 5:56 AM Adam Lippai  wrote:

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-19 Thread Jorge Cardoso Leitão
Hi,

Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow.
>

1.X versioning in Apache Arrow was never meant to represent stability of
their individual libraries, but only the stability of the C++/Python and
the spec. It is a misconception that Rust implementation is stable and/or
ready for production; its version is aligned with Apache Arrow general
versioning simply for historical reasons. Requiring arrow2 to also be
marked as stable is imo just dragging this onwards.

As primary developer of arrow2 and a contributor of some of the major
pieces of arrow-rs, I am saying that:

* arrow-rs does not have a stable API: it requires large large incompatible
changes to even make it *safe*
* arrow2 does not have a stable API: it requires incompatible changes to
improve UX, performance, and functionality
* using arrow2 core API results in faster, safer, and less error-prone code

The main difference is that arrow-rs requires API changes to its core
modules (buffer, bytes, etc), while arrow2 requires changes to its
peripheral modules (compute and IO). This is why imo we can make arrow2
available: expected changes will only break a small surface of the public
API which, while incompatible, are easy to address.

Which is the gist of my proposal:

   - Arrow2 starts its release in cargo.io as 0.1
   - A major release (e.g. 0.16.2 -> 1.0.0):
  - must be voted
  - may be backward incompatible
   - Minor releases (e.g. 0.16.1 -> 0.17.0):
  - must be voted
  - may be backward incompatible
  - may have new features
   - Patch releases (e.g. 0.16.1 -> 0.16.2):
  - may be voted
  - must not be backward compatible
  - may have new features
   - Minor releases may have a maintenance period (e.g. 3+ months) over
   which we guarantee security patches and feature backports.
   - Major releases have a maintenance period over which we guarantee
   security patches and feature backports according to semver 2.0.

So that:

   - It aligns expectations wrt to the current state of Rust's
   implementation
   - it offers support to downstream dependencies that require longer-term
   stability
   - it offers room for developers to improve its API, scrutinize security,
   etc.

If we do indeed have an expectation of stability over its whole public
surface, then I suggest that we keep arrow2 in the experimental repo as it
is today.

Btw, this is why some in the Rust community recommend using smaller crates:
so that versioning is not bound to a large public API surface and can thus
more easily be applied to smaller surfaces. There is of course a tradeoff
with maintenance of CI and releases.

Best,
Jorge

On Sat, Jul 17, 2021 at 1:59 PM Andrew Lamb  wrote:

> What if we released "beta" [1] versions of arrow on cargo at whatever pace
> was necessary? That way dependent crates could opt in to bleeding edge
> functionality / APIs.
>
> There is tension between full technical freedom to change APIs and the
> needs of downstream projects for a more stable API.
>
> Whatever its technical faults may be, projects that rely on arrow (such as
> anything based on DataFusion, like my own) need to be supported as they
> have made the bet on Rust Arrow. I don't think we can abandon maintenance
> on the existing codebase until we have a successor ready.
>
> Andrew
>
> p.s. I personally very much like Adam's suggestion for "Arrow 6.0 in Oct
> 2021 be based on arrow2" but that is predicated on wanting to have arrow2
> widely used by downstreams at that point.
>
> [1]
>
> https://stackoverflow.com/questions/46373028/how-to-release-a-beta-version-of-a-crate-for-limited-public-testing
>
>
> On Sat, Jul 17, 2021 at 5:56 AM Adam Lippai  wrote:
>
> > 5.0 is being released right now, which means from timing perspective this
> > is the worst moment for arrow2, indeed. You'd need to wait the full 3
> > months. On the other hand does releasing a 6.0 beta based on arrow2 on
> Aug
> > 1st, rc on Sept 1st and releasing the stable on Oct 1st sound like a bad
> > plan?
> >
> > I don't think a 6.0-beta release would be confusing and dedicating most
> of
> > the 5.0->6.0 cycle to this change doesn't sound excessive.
> >
> > I think this approach wouldn't result in extra work (backporting the
> > important changes to 5.1,5.2 release). It only shows the magnitude of
> this
> > change, the work would be done by you anyways, this would just make it
> > clear this is a huge effort.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sat, Jul 17, 2021, 11:31 Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Arrow2 and parquet2 have passed the IP clearance vote and are ready to
> be
> > > merged to apache/* repos.
> > >
> > > My plan is to merge them and PR to both of them to the latest updates
> on
> > my
> > > own repo, so that I can temporarily (and hopefully permanently) archive