Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
ics also bring in security concerns
> that are application-specific. e.g. can an algorithm trust min/max
> stats and risk producing incorrect results if the statistics are
> incorrect? A question that can't really be answered at the C Data
> Interface level.
> 
> The need for more sophisticated statistics only grows with time, so
> there is no such thing as a "simple statistics schema".
> 
> Protocols that produce/consume statistics might want to use the C Data
> Interface as a primitive for passing Arrow arrays of statistics.
> 
> ADBC might be too big of a leap in complexity now, but "we just need C
> Data Interface + statistics" is unlikely to remain true for very long
> as projects grow in complexity.
> 
> --
> Felipe
> 
> On Thu, May 23, 2024 at 9:57 AM Dewey Dunnington
>  wrote:
> >
> > Thank you for the background! I understand that these statistics are
> > important for query planning; however, I am not sure that I follow why
> > we are constrained to the ArrowSchema to represent them. The examples
> > given seem to going through Python...would it be easier to request
> > statistics at a higher level of abstraction? There would already need
> > to be a separate mechanism to request an ArrowArrayStream with
> > statistics (unless the PyCapsule `requested_schema` argument would
> > suffice).
> >  
> > > ADBC may be a bit larger to use only for transmitting
> > > statistics. ADBC has statistics related APIs but it has more
> > > other APIs.  
> >
> > Some examples of producers given in the linked threads (Delta Lake,
> > Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> > can implement an ADBC driver without defining all the methods (where
> > the producer could call AdbcConnectionGetStatistics(), although
> > AdbcStatementGetStatistics() might be more relevant here and doesn't
> > exist). One example listed (using an Arrow Table as a source) seems a
> > bit light to wrap in an ADBC driver; however, it would not take much
> > code to do so and the overhead of getting the reader via ADBC it is
> > something like 100 microseconds (tested via the ADBC R package's
> > "monkey driver" which wraps an existing stream as a statement). In any
> > case, the bulk of the code is building the statistics array.
> >  
> > > How about the following schema for the
> > > statistics ArrowArray? It's based on ADBC.  
> >
> > Whatever format for statistics is decided on, I imagine it should be
> > exactly the same as the ADBC standard? (Perhaps pushing changes
> > upstream if needed?).
> >
> > On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:  
> > >
> > > Hi,
> > >  
> > > > Why not simply pass the statistics ArrowArray separately in your
> > > > producer API of choice  
> > >
> > > It seems that we should use the approach because all
> > > feedback said so. How about the following schema for the
> > > statistics ArrowArray? It's based on ADBC.
> > >
> > > | Field Name   | Field Type| Comments |
> > > |--|---|  |
> > > | column_name  | utf8  | (1)  |
> > > | statistic_key| utf8 not null | (2)  |
> > > | statistic_value  | VALUE_SCHEMA not null |  |
> > > | statistic_is_approximate | bool not null | (3)  |
> > >
> > > 1. If null, then the statistic applies to the entire table.
> > >It's for "row_count".
> > > 2. We'll provide pre-defined keys such as "max", "min",
> > >"byte_width" and "distinct_count" but users can also use
> > >application specific keys.
> > > 3. If true, then the value is approximate or best-effort.
> > >
> > > VALUE_SCHEMA is a dense union with members:
> > >
> > > | Field Name | Field Type |
> > > |||
> > > | int64  | int64  |
> > > | uint64 | uint64 |
> > > | float64| float64|
> > > | binary | binary |
> > >
> > > If a column is an int32 column, it uses int64 for
> > > "max"/"min". We don't provide all types here. Users should
> > > use a compatible type (int64 for a int32 column) instead.
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In 
> > >   "Re: [DISCUSS] Stat

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou



Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit :


Protocols that produce/consume statistics might want to use the C Data
Interface as a primitive for passing Arrow arrays of statistics.


This is also my opinion.

I think what we are slowly converging on is the need for a spec to 
describe the encoding of Arrow array statistics as Arrow arrays.


Regards

Antoine.


Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou



Hi Kou,

I agree that Dewey that this is overstretching the capabilities of the C 
Data Interface. In particular, stuffing a pointer as metadata value and 
decreeing it immortal doesn't sound like a good design decision.


Why not simply pass the statistics ArrowArray separately in your 
producer API of choice (Dewey mentioned ADBC but it is of course just a 
possible API among others)?


Regards

Antoine.


Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata


The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.


So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
 ArrowSchema {
   .name = "column1",
   .format = "i",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
 ArrowSchema {
   .name = "column2",
   .format = "u",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,


Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-05-14 Thread Antoine Pitrou



I think these flags should be advisory and consumers should be free to 
ignore them. However, some consumers apparently would benefit from them 
to more faithfully represent the producer's intention.


For example, in Arrow C++, we could perhaps have a ImportDatum function 
whose actual return type would depend on which flags are set (though I'm 
not sure what the default behavior, in the absence of any flags, should be).


Regards

Antoine.


Le 25/04/2024 à 06:54, Weston Pace a écrit :

What should be done if a system doesn't have a record batch concept?  For
example, if I remember correctly, Velox works this way and only has a "row
vector" (struct array) but no equivalent to record batch.  Should these
systems reject a record batch or should they just accept it as a struct
array?

What about ArrowArrayStream?  Must it always return "record batch" or can
it return single columns? Should the stream be homogenous or is it valid if
some arrays are single columns and some are record batches?

If a scalar comes across on its own then what is the length of the scalar?
I think this might be a reason to prefer something like REE for scalars
since the length would be encoded along with the scalar.


On Wed, Apr 24, 2024 at 6:21 PM Keith Kraus 
wrote:


I believe several array implementations (e.g., numpy, R) are able to

broadcast/recycle a length-1 array. Run-end-encoding is also an option that
would make that broadcast explicit without expanding the scalar.

Some libraries behave this way, i.e. Polars, but others like Pandas and
cuDF only broadcast up dimensions. I.E. scalars can be broadcast across
columns or dataframes, columns can be broadcast across dataframes, but
length 1 columns do not broadcast across columns where trying to add say a
length 5 and length 1 column isn't valid but adding a length 5 column and a
scalar is. Additionally, it differentiates between operations that are
guaranteed to return a scalar, i.e. something like a reduction of `sum()`
versus operations that can return a length 1 column depending on the data,
i.e. `unique()`.


For UDFs: UDFs are a system-specific interface. Presumably, that

interface can encode whether an Arrow array is meant to represent a column
or scalar (or record batch or ...). Again, because Arrow doesn't define
scalars (for now...) or UDFs, the UDF interface needs to layer its own
semantics on top of Arrow.


In other words, I don't think the C Data Interface was meant to be

something where you can expect to _only_ pass the ArrowDeviceArray around
and have it encode all the semantics for a particular system, right? The
UDF example is something where the engine would pass an ArrowDeviceArray
plus additional context.

There's a growing trend in execution engines supporting UDFs of Arrow in
and Arrow out, DuckDB, PySpark, DataFusion, etc. Many of them have
different options of passing in RecordBatches vs Arrays where they
currently rely on the Arrow library containers in order to differentiate
them.

Additionally, libcudf has some generic functions that currently use Arrow
C++ containers (
https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/interop_arrow/
)
for differentiating between RecordBatches, Arrays, and Scalars which could
be moved to using the C Data Interfaces, Polars has similar (
https://docs.pola.rs/py-polars/html/reference/api/polars.from_arrow.html)
that currently uses PyArrow containers, and you could imagine other
DataFrame libraries having similar.

Ultimately, there's a desire to be able to move Arrow data between
different libraries, applications, frameworks, etc. and given Arrow
implementations like C++, Rust, and Go have containers for RecordBatches,
Arrays, and Scalars respectively, things have been built around and
differentiated around the concepts. Maybe trying to differentiate this
information at runtime isn't the correct path, but I believe there's a
demonstrated desire for being able to differentiate things in a library
agnostic way.

On Tue, Apr 23, 2024 at 8:37 PM David Li  wrote:


For scalars: Arrow doesn't define scalars. They're an implementation
concept. (They may be a *useful* one, but if we want to define them more
generally, that's a separate discussion.)

For UDFs: UDFs are a system-specific interface. Presumably, that

interface

can encode whether an Arrow array is meant to represent a column or

scalar

(or record batch or ...). Again, because Arrow doesn't define scalars

(for

now...) or UDFs, the UDF interface needs to layer its own semantics on

top

of Arrow.

In other words, I don't think the C Data Interface was meant to be
something where you can expect to _only_ pass the ArrowDeviceArray around
and have it encode all the semantics for a particular system, right? The
UDF example is something where the engine would pass an ArrowDeviceArray
plus additional context.


since we can't determine which a given ArrowArray is on its own. In the
libcudf situation, it came up with what happens if you pass a


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou

+1 (binding)


Le 19/04/2024 à 22:22, Rok Mihevc a écrit :

Hi all,

Following initial requests [1][2] and recent tangential ML discussion [3] I
would like to propose a vote to add language for UUID canonical extension
type to CanonicalExtensions.rst as in PR [4] and written below.
A draft C++ and Python implementation PR can be seen here [5].

[1] https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j
[2] https://github.com/apache/arrow/issues/15058
[3] https://lists.apache.org/thread/8d5ldl5cb7mms21rd15lhpfrv4j9no4n
[4] https://github.com/apache/arrow/pull/41299 <- proposed change
[5] https://github.com/apache/arrow/pull/37298


The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...


UUID


* Extension name: `arrow.uuid`.

* The storage type of the extension is ``FixedSizeBinary`` with a length of
16 bytes.

.. note::
A specific UUID version is not required or guaranteed. This extension
represents
UUIDs as FixedSizeBinary(16) and does not interpret the bytes in any way.



Rok



Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 
requirement and the 3 current String types allowed.


Regards

Antoine.


Le 30/04/2024 à 19:26, Rok Mihevc a écrit :

Hi all, thanks for the votes and comments so far.
I've amended [1] the proposed language with the RFC-8259 requirement as it
seems to be almost unanimously requested. New language is below.
To Micah's comment regarding rejecting Binary arrays [2] - please discuss
in the PR.

Let's leave the vote open until after the May holiday.

Rok

[1]
https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e
[2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040


JSON


* Extension name: `arrow.json`.

* The storage type of this extension is ``StringArray`` or
   or ``LargeStringArray`` or ``StringViewArray``.
   *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.*

* Extension type parameters:

   This type does not have any parameters.

* Description of the serialization:

   Metadata is either an empty string or a JSON string with an empty object.
   In the future, additional fields may be added, but they are not required
   to interpret the array.



Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou



I think this should be:
- a canonical extension type
- with a parameter unambiguously identifying the type for applications 
supporting it (such as "org.postgres.pg_lsn")
- with storage type left for each implementation to decide, but with a 
recommendation to use either 1) binary, 2) fixed-size-binary or 3) null.


Regards

Antoine.


Le 17/04/2024 à 16:25, Weston Pace a écrit :

people generally find use in Arrow schemas independently of concrete data.


This makes sense.  I think we do want to encourage use of Arrow as a "type
system" even if there is no data involved.  And, given that we cannot
easily change a field's data type property to "optional" it makes sense to
use a dedicated type and I so I would be in favor of such a proposal (we
may eventually add an "unknown type" concept in Substrait as well, it's
come up several times, and so we could use this in that context).

I think that I would still prefer a canonical extension type (with storage
type null) over a new dedicated type.

On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou  wrote:



Ah! Well, I think this could be an interesting proposal, but someone
should put a more formal proposal, perhaps as a draft PR.

Regards

Antoine.


Le 17/04/2024 à 11:57, David Li a écrit :

For an unsupported/other extension type.

On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:

What is "this proposal"?


Le 17/04/2024 à 10:38, David Li a écrit :

Should I take it that this proposal is dead in the water? While we

could define our own Unknown/Other type for say the ADBC PostgreSQL driver
it might be useful to have a singular type for consumers to latch on to.


On Fri, Apr 12, 2024, at 07:32, David Li wrote:

I think an "Other" extension type is slightly different than an
arbitrary extension type, though: the latter may be understood
downstream but the former represents a point at which a component
explicitly declares it does not know how to handle a field. In this
example, the PostgreSQL ADBC driver might be able to provide a
representation regardless, but a different driver (or say, the JDBC
adapter, which cannot necessarily get a bytestring for an arbitrary
JDBC type) may want an Other type to signal that it would fail if

asked

to provide particular columns.

On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:

Depending where your Arrow-encoded data is used, either extension
types or generic field metadata are options. We have this problem in
the ADBC Postgres driver, where we can convert *most* Postgres types
to an Arrow type but there are some others where we can't or don't
know or don't implement a conversion. Currently for these we return
opaque binary (the Postgres COPY representation of the value) but put
field metadata so that a consumer can implement a workaround for an
unsupported type. It would be arguably better to have implemented

this

as an extension type; however, field metadata felt like less of a
commitment when I first worked on this.

Cheers,

-dewey

On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
 wrote:


I was using UUID as an example. It looks like extension types

covers my original request.


From: Felipe Oliveira Carvalho 
Sent: Thursday, April 11, 2024 7:15 AM
To: dev@arrow.apache.org 
Subject: Re: Unsupported/Other Type

The OP used UUID as an example. Would that be enough or the request

is for

a flexible mechanism that allows the creation of one-off nominal

types for

very specific use-cases?

—
Felipe

On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 

wrote:




Yes, JSON and UUID are obvious candidates for new canonical

extension

types. XML also comes to mind, but I'm not sure there's much of a

use

case for it.

Regards

Antoine.


Le 10/04/2024 à 22:55, Wes McKinney a écrit :

In the past we have discussed adding a canonical type for UUID

and JSON.

I

still think this is a good idea and could improve ergonomics in

downstream

language bindings (e.g. by exposing JSON querying function or

automatically

boxing UUIDs in built-in UUID types, like the Python uuid

library). Has

anyone done any work on this to anyone's knowledge?

On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <

emkornfi...@gmail.com>

wrote:


Hi Norman,
Arrow has a concept of extension types [1] along with the

possibility of

proposing new canonical extension types [2].  This seems to

cover the

use-cases you mention but I might be misunderstanding?

Thanks,
Micah

[1]





https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types

[2]

https://arrow.apache.org/docs/format/CanonicalExtensions.html


On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
 wrote:


Problem Description

Currently Arrow schemas can only contain columns of types

supported by

Arrow. In some cases an Arrow schema maps to an external

schema. This

can

result in the Arrow schema not being able to support all the

columns

from

the external s

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou



Ah! Well, I think this could be an interesting proposal, but someone 
should put a more formal proposal, perhaps as a draft PR.


Regards

Antoine.


Le 17/04/2024 à 11:57, David Li a écrit :

For an unsupported/other extension type.

On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:

What is "this proposal"?


Le 17/04/2024 à 10:38, David Li a écrit :

Should I take it that this proposal is dead in the water? While we could define 
our own Unknown/Other type for say the ADBC PostgreSQL driver it might be 
useful to have a singular type for consumers to latch on to.

On Fri, Apr 12, 2024, at 07:32, David Li wrote:

I think an "Other" extension type is slightly different than an
arbitrary extension type, though: the latter may be understood
downstream but the former represents a point at which a component
explicitly declares it does not know how to handle a field. In this
example, the PostgreSQL ADBC driver might be able to provide a
representation regardless, but a different driver (or say, the JDBC
adapter, which cannot necessarily get a bytestring for an arbitrary
JDBC type) may want an Other type to signal that it would fail if asked
to provide particular columns.

On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:

Depending where your Arrow-encoded data is used, either extension
types or generic field metadata are options. We have this problem in
the ADBC Postgres driver, where we can convert *most* Postgres types
to an Arrow type but there are some others where we can't or don't
know or don't implement a conversion. Currently for these we return
opaque binary (the Postgres COPY representation of the value) but put
field metadata so that a consumer can implement a workaround for an
unsupported type. It would be arguably better to have implemented this
as an extension type; however, field metadata felt like less of a
commitment when I first worked on this.

Cheers,

-dewey

On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
 wrote:


I was using UUID as an example. It looks like extension types covers my 
original request.

From: Felipe Oliveira Carvalho 
Sent: Thursday, April 11, 2024 7:15 AM
To: dev@arrow.apache.org 
Subject: Re: Unsupported/Other Type

The OP used UUID as an example. Would that be enough or the request is for
a flexible mechanism that allows the creation of one-off nominal types for
very specific use-cases?

—
Felipe

On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou  wrote:



Yes, JSON and UUID are obvious candidates for new canonical extension
types. XML also comes to mind, but I'm not sure there's much of a use
case for it.

Regards

Antoine.


Le 10/04/2024 à 22:55, Wes McKinney a écrit :

In the past we have discussed adding a canonical type for UUID and JSON.

I

still think this is a good idea and could improve ergonomics in

downstream

language bindings (e.g. by exposing JSON querying function or

automatically

boxing UUIDs in built-in UUID types, like the Python uuid library). Has
anyone done any work on this to anyone's knowledge?

On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
wrote:


Hi Norman,
Arrow has a concept of extension types [1] along with the possibility of
proposing new canonical extension types [2].  This seems to cover the
use-cases you mention but I might be misunderstanding?

Thanks,
Micah

[1]



https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types

[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
 wrote:


Problem Description

Currently Arrow schemas can only contain columns of types supported by
Arrow. In some cases an Arrow schema maps to an external schema. This

can

result in the Arrow schema not being able to support all the columns

from

the external schema.

Consider an external system that contains a column of type UUID. To

model

the schema in Arrow, the user has two choices:

 1.  Do not include the UUID column in the Arrow schema

 2.  Map the column to an existing Arrow type. This will not include

the

original type information. A UUID can be mapped to a FixedSizeBinary,

but

consumers of the Arrow schema will be unable to distinguish a
FixedSizeBinary field from a UUID field.

Possible Solution

 *   Add a new type code that represents unsupported types

 *   Values for the new type are represented as variable length

binary


Some drivers can expose data even when they don’t understand the data
type. For example, the PostgreSQL driver will return the raw bytes for
fields of an unknown type. Using an explicit type lets clients know

that

they should convert values if they were able to determine the actual

data

type.

Questions

 *   What is the impact on existing clients when they encounter

fields

of

the unsupported type?

 *   Is it safe to assume that all unsupported values can safely be
converted to a variable length binary?

 *   How can we preser

Re: AW: Personal feedback on your last release on Apache Arrow ADBC 0.11.0

2024-04-17 Thread Antoine Pitrou



Out of curiosity, did you notice this by chance or do you have some kind 
of script that processes ASF mailing-list archives for possible voting 
irregularities?


Regards

Antoine.


Le 17/04/2024 à 10:44, Christofer Dutz a écrit :

When looking at whimsy, I can’t see any person named Sutou Kouhei listed as 
member of the Arrow PMC.

Cut that … I was looking for Sutou Kouhei, but it’s Kouhei Sutou … yeah … ok … 
then please ignore my mumbling ;-)

And yeah … the result now also moved to the same page … guess it was sent out a 
while after the Announce … guess that’s why I missed it.

Thanks for following up …

Chris

Von: David Li 
Datum: Mittwoch, 17. April 2024 um 10:36
An: Christofer Dutz , dev@arrow.apache.org 

Betreff: Re: Personal feedback on your last release on Apache Arrow ADBC 0.11.0
Hi Christofer,

Sutou Kouhei is part of the PMC.

Additionally, there is a result email: 
https://lists.apache.org/thread/gb5k69pd3k6lnbzw978fm7ppx1p9cx15

On Wed, Apr 17, 2024, at 16:52, Christofer Dutz wrote:

Hi all,

while reviewing your projects activity in the last quarter as part of
my preparation for today's borads meeting I came across your last vote
on Apache Arrow ADBC 0.11.0 RC0

Technically I count only 2 binding +1 votes:
- Matthew Topol
- Dewey Dunnington

All others are not part of the PMC.

I assume the Release Manager David implicitly counted himself as +1,
however does a concept of an implicit vote not exist at Apache. If you
want to save sending an additional email, adding something like "this
also counts as my +1 vote" to your email, or - even better - send an
explicit vote email.

Also would it be good to have a RESULT email containing the result of a vote.

So right now we would need a third binding vote as soon as possible
(Possibly also for other votes, where we had the release manager
provide the missing third vote).

Chris

PS: Please keep me in CC as I'm not subscribed here.




Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou



What is "this proposal"?


Le 17/04/2024 à 10:38, David Li a écrit :

Should I take it that this proposal is dead in the water? While we could define 
our own Unknown/Other type for say the ADBC PostgreSQL driver it might be 
useful to have a singular type for consumers to latch on to.

On Fri, Apr 12, 2024, at 07:32, David Li wrote:

I think an "Other" extension type is slightly different than an
arbitrary extension type, though: the latter may be understood
downstream but the former represents a point at which a component
explicitly declares it does not know how to handle a field. In this
example, the PostgreSQL ADBC driver might be able to provide a
representation regardless, but a different driver (or say, the JDBC
adapter, which cannot necessarily get a bytestring for an arbitrary
JDBC type) may want an Other type to signal that it would fail if asked
to provide particular columns.

On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:

Depending where your Arrow-encoded data is used, either extension
types or generic field metadata are options. We have this problem in
the ADBC Postgres driver, where we can convert *most* Postgres types
to an Arrow type but there are some others where we can't or don't
know or don't implement a conversion. Currently for these we return
opaque binary (the Postgres COPY representation of the value) but put
field metadata so that a consumer can implement a workaround for an
unsupported type. It would be arguably better to have implemented this
as an extension type; however, field metadata felt like less of a
commitment when I first worked on this.

Cheers,

-dewey

On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
 wrote:


I was using UUID as an example. It looks like extension types covers my 
original request.

From: Felipe Oliveira Carvalho 
Sent: Thursday, April 11, 2024 7:15 AM
To: dev@arrow.apache.org 
Subject: Re: Unsupported/Other Type

The OP used UUID as an example. Would that be enough or the request is for
a flexible mechanism that allows the creation of one-off nominal types for
very specific use-cases?

—
Felipe

On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou  wrote:



Yes, JSON and UUID are obvious candidates for new canonical extension
types. XML also comes to mind, but I'm not sure there's much of a use
case for it.

Regards

Antoine.


Le 10/04/2024 à 22:55, Wes McKinney a écrit :

In the past we have discussed adding a canonical type for UUID and JSON.

I

still think this is a good idea and could improve ergonomics in

downstream

language bindings (e.g. by exposing JSON querying function or

automatically

boxing UUIDs in built-in UUID types, like the Python uuid library). Has
anyone done any work on this to anyone's knowledge?

On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
wrote:


Hi Norman,
Arrow has a concept of extension types [1] along with the possibility of
proposing new canonical extension types [2].  This seems to cover the
use-cases you mention but I might be misunderstanding?

Thanks,
Micah

[1]



https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types

[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
 wrote:


Problem Description

Currently Arrow schemas can only contain columns of types supported by
Arrow. In some cases an Arrow schema maps to an external schema. This

can

result in the Arrow schema not being able to support all the columns

from

the external schema.

Consider an external system that contains a column of type UUID. To

model

the schema in Arrow, the user has two choices:

1.  Do not include the UUID column in the Arrow schema

2.  Map the column to an existing Arrow type. This will not include

the

original type information. A UUID can be mapped to a FixedSizeBinary,

but

consumers of the Arrow schema will be unable to distinguish a
FixedSizeBinary field from a UUID field.

Possible Solution

*   Add a new type code that represents unsupported types

*   Values for the new type are represented as variable length

binary


Some drivers can expose data even when they don’t understand the data
type. For example, the PostgreSQL driver will return the raw bytes for
fields of an unknown type. Using an explicit type lets clients know

that

they should convert values if they were able to determine the actual

data

type.

Questions

*   What is the impact on existing clients when they encounter

fields

of

the unsupported type?

*   Is it safe to assume that all unsupported values can safely be
converted to a variable length binary?

*   How can we preserve information about the original type?









Warning: The sender of this message could not be validated and may not be the 
actual sender.


Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou



One-off nominal types can already be created as application-specific 
extension types.
The specific thing about UUID, JSON and a couple other types is that 
they exist in many systems already, so a standardized way of conveying 
them with Arrow would enhance interoperation between all these systems.


Regards

Antoine.


Le 11/04/2024 à 16:15, Felipe Oliveira Carvalho a écrit :

The OP used UUID as an example. Would that be enough or the request is for
a flexible mechanism that allows the creation of one-off nominal types for
very specific use-cases?

—
Felipe

On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou  wrote:



Yes, JSON and UUID are obvious candidates for new canonical extension
types. XML also comes to mind, but I'm not sure there's much of a use
case for it.

Regards

Antoine.


Le 10/04/2024 à 22:55, Wes McKinney a écrit :

In the past we have discussed adding a canonical type for UUID and JSON.

I

still think this is a good idea and could improve ergonomics in

downstream

language bindings (e.g. by exposing JSON querying function or

automatically

boxing UUIDs in built-in UUID types, like the Python uuid library). Has
anyone done any work on this to anyone's knowledge?

On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
wrote:


Hi Norman,
Arrow has a concept of extension types [1] along with the possibility of
proposing new canonical extension types [2].  This seems to cover the
use-cases you mention but I might be misunderstanding?

Thanks,
Micah

[1]



https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types

[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
 wrote:


Problem Description

Currently Arrow schemas can only contain columns of types supported by
Arrow. In some cases an Arrow schema maps to an external schema. This

can

result in the Arrow schema not being able to support all the columns

from

the external schema.

Consider an external system that contains a column of type UUID. To

model

the schema in Arrow, the user has two choices:

1.  Do not include the UUID column in the Arrow schema

2.  Map the column to an existing Arrow type. This will not include

the

original type information. A UUID can be mapped to a FixedSizeBinary,

but

consumers of the Arrow schema will be unable to distinguish a
FixedSizeBinary field from a UUID field.

Possible Solution

*   Add a new type code that represents unsupported types

*   Values for the new type are represented as variable length

binary


Some drivers can expose data even when they don’t understand the data
type. For example, the PostgreSQL driver will return the raw bytes for
fields of an unknown type. Using an explicit type lets clients know

that

they should convert values if they were able to determine the actual

data

type.

Questions

*   What is the impact on existing clients when they encounter

fields

of

the unsupported type?

*   Is it safe to assume that all unsupported values can safely be
converted to a variable length binary?

*   How can we preserve information about the original type?












Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou



Yes, JSON and UUID are obvious candidates for new canonical extension 
types. XML also comes to mind, but I'm not sure there's much of a use 
case for it.


Regards

Antoine.


Le 10/04/2024 à 22:55, Wes McKinney a écrit :

In the past we have discussed adding a canonical type for UUID and JSON. I
still think this is a good idea and could improve ergonomics in downstream
language bindings (e.g. by exposing JSON querying function or automatically
boxing UUIDs in built-in UUID types, like the Python uuid library). Has
anyone done any work on this to anyone's knowledge?

On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
wrote:


Hi Norman,
Arrow has a concept of extension types [1] along with the possibility of
proposing new canonical extension types [2].  This seems to cover the
use-cases you mention but I might be misunderstanding?

Thanks,
Micah

[1]

https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
[2] https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
 wrote:


Problem Description

Currently Arrow schemas can only contain columns of types supported by
Arrow. In some cases an Arrow schema maps to an external schema. This can
result in the Arrow schema not being able to support all the columns from
the external schema.

Consider an external system that contains a column of type UUID. To model
the schema in Arrow, the user has two choices:

   1.  Do not include the UUID column in the Arrow schema

   2.  Map the column to an existing Arrow type. This will not include the
original type information. A UUID can be mapped to a FixedSizeBinary, but
consumers of the Arrow schema will be unable to distinguish a
FixedSizeBinary field from a UUID field.

Possible Solution

   *   Add a new type code that represents unsupported types

   *   Values for the new type are represented as variable length binary

Some drivers can expose data even when they don’t understand the data
type. For example, the PostgreSQL driver will return the raw bytes for
fields of an unknown type. Using an explicit type lets clients know that
they should convert values if they were able to determine the actual data
type.

Questions

   *   What is the impact on existing clients when they encounter fields

of

the unsupported type?

   *   Is it safe to assume that all unsupported values can safely be
converted to a variable length binary?

   *   How can we preserve information about the original type?








Re: [RFC] Enabling data frames in disaggregated shared memory

2024-04-10 Thread Antoine Pitrou



Hello John,

Arrow IPC files can be backed quite naturally by shared memory, simply 
by memory-mapping them for reading. So if you have some pieces of shared 
memory containing Arrow IPC files, and they are reachable using a 
filesystem mount point, you're pretty much done.


You can see an example of memory-mapped read in Python at the end of 
this documentation section:

https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data

Note: Arrow IPC files are just a way of storing Arrow columnar data on 
"disk", with enough additional metadata to interpret the data (such as 
its schema).


Regards

Antoine.


Le 10/04/2024 à 04:13, John Groves a écrit :

This is a request for comments from the Arrow developer community.

I’m reaching out to start making the Arrow community aware of work that my
team at Micron has recently open-sourced. Because of the Compute Express
Link (CXL) standard, sharable disaggregated memory is coming – this is
memory shared by multiple nodes in a cluster.  Arrow and the other
zero-copy formats are a great fit for shared memory if a natural enough
access method emerges.

That’s where famfs comes in.  (Famfs stands for Fabric-Attached Memory File
System.) Famfs supports formatting shared memory as a file system that can
be simultaneously mounted from multiple hosts.

Putting zero-copy data frames in famfs files allows jobs across a cluster
to memory map data frames from a single copy in shared memory. This has the
potential to deduplicate memory while reducing or avoiding sharding and
shuffling overheads.

Famfs files can be memory-mapped and used without awareness that the files
are “special” (though creating famfs files does require special steps).
Memory mapping a famfs file provides direct access to the memory – with no
copying through the page cache.

Famfs was published in February as a Linux kernel patch

and a user space CLI and library; all are available on github
. The
kernel patch set has been received seriously; if legitimate use cases are
demonstrated, we expect it will make its way into mainline Linux – and we
intend to step up and maintain it.

Famfs is already usable with shared disaggregated memory (though this
memory is not commercially available yet). Conventional memory can be
shared among virtual machines today, to build (admittedly scaled down) POCs.

I am looking for the following feedback:

- Any questions are welcome, on or off-list.
- Please tell us what sorts of work flows you might try with famfs
shared memory if you had it – we are looking for ways to demonstrate use
cases.
- Help us get the word out. Are there people, groups, forums or
conferences where we should introduce this capability?
- If you are interested in testing famfs, please do – and let me know
how we can help.

Micron’s interest is in enabling an ecosystem where shared memory is
practically usable. If famfs is successful, other access methods will
surely emerge. Famfs is our attempt to enable shared memory via an
existing, ubiquitous interface – making it easy to use without having to
adopt new abstractions in advance.

Thanks for reading,
John Groves
Micron



Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-09 Thread Antoine Pitrou



It seems that perhaps this discussion should be rebooted for each 
individual component, one at a time?


Let's start with something simple and obvious, with some frequent 
contribution activity, such as perhaps Go?




Le 09/04/2024 à 14:27, Joris Van den Bossche a écrit :

I am also in favor of this idea in general and in the principle, but
(somewhat repeating others) I think we should be aware that this will
create _more_ work overall for releasing (refactoring release scripts
(at least initially), deciding which version to use for which
component, etc), and not less, given that easing the burden of the
release managers was mentioned as a goal.
So we if pursue this, it should be for other benefits that we think this has:
1) because separate versions would be beneficial for users? (have a
clearer messaging in the version number (eg no major new version if
there were hardly any changes in a certain component, or use a
versioning scheme more common in a certain language's ecosystem, ..?)
2) because it would actually allow separate releases, even though when
initially always releasing in batch (eg being able to just do a bug
fix for go, without having to also add a tag for all others)
3) because it would make the release process more manageable / easier
to delegate? (and thus potentially easing the burden for an
_individual_ release manager, although requiring more work overall)
4) .. other things I am missing?


We could simply release C++, R, Python and C/GLib together.
...

I think that versioning will require additional thinking for libraries like 
PyArrow

I think we should maybe focus on a few more obvious cases. [i.e. not C++ and 
Python]


Yes, agreed to not focus on those. At least for PyArrow, speaking as
one of its maintainers, I am personally (at this moment) not really
interested in dealing with the complexity of allowing a decoupled
Arrow C++ and PyArrow build.

Related to the docs:


There is a dedicated documentation page for this... though the
versioning of the docs themselves would become ill-defined:
https://arrow.apache.org/docs/status.html

...

I think it would be best to hold off on Java also, in part because
of how the Java docs are integrated with the C++ and Python docs and
controlled by the version selector menu.


We should indeed consider how to handle the current documentation
site. Previously, we actually did some work to split the sphinx docs
(used for the format, dev docs, and for the Python/C++/(part of the)
Java docs) into multiple sphinx projects that could be built
separately (https://github.com/apache/arrow/issues/30627,
https://github.com/apache/arrow/pull/11980), but we abandoned that
work last year because not seeming worthwhile. But we could certainly
revive that idea, for example to at least split the format docs (and
let that have its own versioning based on the Format Version
(currently 1.4)? or just only host a single, latest version?)

Joris


Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-07 Thread Antoine Pitrou



Le 28/03/2024 à 21:42, Jacob Wujciak a écrit :


For Arrow C++ bindings like Arrow R and PyArrow having distinct versions
would require additional work to both enable the use of different versions
and ensure version compatibility is monitored and potentially updated if
needed.


We could simply release C++, R, Python and C/GLib together.


A more meta question is about the messaging that different versioning
schemes carry, as it might no longer be obvious on first glance which
versions are compatible or have the newest features.


There is a dedicated documentation page for this... though the 
versioning of the docs themselves would become ill-defined:

https://arrow.apache.org/docs/status.html

Regards

Antoine.


Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou



Thanks. The Arrow spec does support multiple union members with the same 
type, but not all implementations do. The C++ implementation should 
support it, though to my surprise we do not seem to have any tests for it.


If the Java implementation doesn't, then you can probably open an issue 
for it (and even submit a PR if you would like to tackle it).


I've also opened https://github.com/apache/arrow/issues/40947 to create 
integration tests for this.


Regards

Antoine.


Le 02/04/2024 à 13:19, Finn Völkel a écrit :

Can you explain what ADT means ?


Sorry about that. ADT stands for Abstract Data Type. What do I mean by an
ADT style vector?

Let's take an example from the project I am on. We have an `op` union
vector with three child vectors `put`, `delete`, `erase`. `delete` and
`erase` have the same type but represent different things.

On Tue, 2 Apr 2024 at 13:16, Steve Kim  wrote:


Thank you for asking this question. I have the same question.

I noted a similar problem in the c++/python implementation:
https://github.com/apache/arrow/issues/19157#issuecomment-1528037394

On Tue, Apr 2, 2024, 04:30 Finn Völkel  wrote:


Hi,

my question primarily concerns the union layout described at
https://arrow.apache.org/docs/format/Columnar.html#union-layout

There are two ways to use unions:

- polymorphic vectors (world 1)
- ADT style vectors (world 2)

In world 1 you have a vector that stores different types. In the ADT

world

you could have multiple child vectors with the same type but different

type

ids in the union type vector. The difference is apparent if you want to

use

two BigIntVectors as children which doesn't exist in world 1. World 1 is

a

subset of world 2.

The spec (to my understanding) doesn’t explicitly forbid world 2, but the
implementation we have been using (Java) has been making the assumption

of

being in world 1 (a union only having ONE child of each type). We

sometimes

use union in the ADT style which has led to problems down the road.

Could someone clarify what the specification allows and what it doesn’t
allow? Could we tighten the specification after that clarification?

Best, Finn







Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou



Can you explain what ADT means ?



Le 02/04/2024 à 11:31, Finn Völkel a écrit :

Hi,

my question primarily concerns the union layout described at
https://arrow.apache.org/docs/format/Columnar.html#union-layout

There are two ways to use unions:

- polymorphic vectors (world 1)
- ADT style vectors (world 2)

In world 1 you have a vector that stores different types. In the ADT world
you could have multiple child vectors with the same type but different type
ids in the union type vector. The difference is apparent if you want to use
two BigIntVectors as children which doesn't exist in world 1. World 1 is a
subset of world 2.

The spec (to my understanding) doesn’t explicitly forbid world 2, but the
implementation we have been using (Java) has been making the assumption of
being in world 1 (a union only having ONE child of each type). We sometimes
use union in the ADT style which has led to problems down the road.

Could someone clarify what the specification allows and what it doesn’t
allow? Could we tighten the specification after that clarification?

Best, Finn



Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Antoine Pitrou



Regardless of whether they have different compression ratios, it doesn't 
explain why you would want a different compression *algorithm* altogether.


The choice of a compression algorithm should basically be driven by two 
concerns: the acceptable space/time tradeoff (do you want to minimize 
disk footprint and IO at the cost of more CPU processing time?), and 
compatibility with other Parquet implementations. None of those two 
concerns should be row group-dependent.


Regards

Antoine.


Le 25/03/2024 à 16:30, Gang Wu a écrit :

Sometimes rows from different row groups may have different compression
ratios when data distribution varies a lot among them. It seems to me that
a harder problem is how would you figure out that pattern before the data
is written and compressed. If that is not a problem in your case, it would
be
much easier just to make each parquet file contain only one row group and
apply different compression algorithms on a file basis.

Best,
Gang

On Sun, Mar 24, 2024 at 2:04 AM Aldrin  wrote:


Hi Andrei,

I tried finding more details on block compression in parquet (or
compression per data page) and I couldn't find anything to satisfy my
curiosity about how it can be used and how it performs.

I hate being the person to just say "test it first," so I want to also
recommend figuring out how you'd imagine the interface to be designed. Some
formats like ORC seem to have 2 compression modes (optimize for speed or
space) while parquet exposes more of the tuning knobs (according to [1]).
And to Gang's point, there's a question of what can be exposed to the
various abstraction levels (perhaps end users would never be interested in
this so it's exposed only through an advanced or internal interface).

Anyways, good luck scoping it out and feel free to iterate with the
mailing list as you try things out rather than just when finished, maybe
someone can chime in with more information and thoughts in the meantime.

[1]: https://arxiv.org/pdf/2304.05028.pdf

Sent from Proton Mail  for iOS


On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr > wrote:

Hi Aldrin, thanks for taking the time to reply to my email!

In my understanding, compression on Parquet files happens on the Data Page
level for every column, meaning that even across a row group, there can be
multiple units of data compression, and most certainly there are going to
be different units of data compression across an entire Parquet file.
Therefore, what I am hoping for is that more granular compression algorithm
choices could lead to overall better compression as the data in the same
column across row groups can differ quite a lot.

At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or even no compression at all), significantly speeds up my queries
than keeping the data compressed with something like ZSTD or GZIP.

That being said, your suggestion of writing a benchmark and sharing the
results here to support considering this approach is a great idea, I will
try doing that!

Once again, thank you for your time!

Kind regards,
Andrei

On Fri, 22 Mar 2024 at 22:12, Aldrin  wrote:


Hello!

I don't do much with compression, so I could be wrong, but I assume a
compression algorithm spans the whole column and areas of large variance
generally benefit less from the compression, but the encoding still
provides benefits across separate areas (e.g. separate row groups).

My impression is that compression will not be any better if it's
restricted to only a subset of the data and if it is only scoped to a
subset of the data then there are extra overheads you'd have beyond what
you normally would have (the same raw value would have the same encoded
value stored per row group). I suppose things like run-length encoding
won't be any less efficient, but it also wouldn't be any more efficient
(with the caveat of a raw value repeating across row groups).

A different compression for different columns isn't unreasonable, so I
think I could be easily convinced that has benefits (though would require
per-column logic that could slow other things down).

These are just my thoughts, though. Can you share the design and results
of your benchmark? Have you (or could you) prototyped anything to test it
out?

Sent from Proton Mail  for iOS


On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr > wrote:

Hi Gang,

Thanks a lot for getting back to me!

So the use case I am having is relatively simple: I was playing around

with

some data and I wanted to benchmark different compression algorithms in

an

effort to speed up data retrieval in a simple Parquet based database

that I

am playing around with. Whilst doing so, I've noticed a very large

variance

in the performance of the same compression algorithm over different row
groups in my 

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Antoine Pitrou



Hello Andrei,

Le 23/03/2024 à 13:23, Andrei Lazăr a écrit :


At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or even no compression at all), significantly speeds up my queries
than keeping the data compressed with something like ZSTD or GZIP.


Ok, but you are still not explaining why you would like a different 
compression algorithm *per row group*, rather than per column.


Regards

Antoine.


Re: ADBC - OS-level driver manager

2024-03-20 Thread Antoine Pitrou



Also, with ADBC driver implementations currently in flux (none of them 
has reached the "stable" status in 
https://arrow.apache.org/adbc/main/driver/status.html), it might be a 
disservice to users to implicitly fetch drivers from potentially 
outdated DLLs on the current system.


Regards

Antoine.


Le 20/03/2024 à 15:08, Matt Topol a écrit :

it seems like the current driver manager work has been largely targeting

an app-specific implementation.

Yup, that was the intention. So far discussions of ADBC having a
system-wide driver registration paradigm like ODBC have mostly been to
discuss how much we dislike that paradigm and would prefer ADBC to stay
with the app-specific approach that we currently have. :)

As of yet, no one has requested such a paradigm so the discussions haven't
gotten revived.

On Wed, Mar 20, 2024 at 9:22 AM David Coe 
wrote:


ODBC has different OS-level driver managers available on their respective
systems. It seems like the current driver manager<
https://arrow.apache.org/adbc/main/cpp/driver_manager.html> work has been
largely targeting an app-specific implementation. Have there been any
discussions of ADBC having a similar system-wide driver registration
paradigm like ODBC does?





Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Antoine Pitrou



Congratulations Bryce, and keep up the good work!

Regards

Antoine.

Le 18/03/2024 à 03:21, Nic Crane a écrit :

On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
accepted an invitation to become a committer on Apache Arrow. Welcome, and
thank you for your contributions!

Nic



Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-04 Thread Antoine Pitrou



I didn't run the release script but I'm +1 on this (binding).

Regards

Antoine.


Le 04/03/2024 à 10:05, Raúl Cumplido a écrit :

Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 15.0.1. This is a release consisting of 37
resolved GitHub issues[1].

This release candidate is based on commit:
5ce6ff434c1e7daaa2d7f134349f3ce4c22683da [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 15.0.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow 15.0.1 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A15.0.1+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/5ce6ff434c1e7daaa2d7f134349f3ce4c22683da
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-15.0.1-rc0
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/15.0.1-rc0
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/15.0.1-rc0
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/15.0.1-rc0
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/5ce6ff434c1e7daaa2d7f134349f3ce4c22683da/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/40211


Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou



If there's no engagement, then I'm afraid it might mean that third 
parties have no interest in this. I don't really have any solution for 
generating engagement except nagging and pinging people explicitly :-)




Le 27/02/2024 à 19:09, Matt Topol a écrit :

I would like to see the same Antoine, currently given the lack of
engagement (both for OR against) I was going to take the silence as assent
and hope for non-Voltron Data PMC members to vote in this.

If anyone has any suggestions on how we could potentially generate more
engagement and discussion on this, please let me know as I want as many
parties in the community as possible to be part of this.

Thanks everyone.

--Matt

On Tue, Feb 27, 2024 at 12:48 PM Antoine Pitrou  wrote:



Hello,

I'd really like to see more engagement and criticism from non-Voltron
Data parties before this is formally adopted as an Arrow spec.

Regards

Antoine.


Le 27/02/2024 à 18:35, Matt Topol a écrit :

Hey all,

I'd like to propose a vote for us to officially adopt the protocol
described in the google doc[1] for Dissociated Arrow IPC Transports. This
proposal was originally discussed at [2]. Once this proposal is adopted,

I

will work on adding the necessary documentation to the Arrow website

along

with examples etc.

The vote will be open for at least 72 hours.

[ ] +1 Accept this Proposal
[ ] +0
[ ] -1 Do not accept this proposal because...

Thank you everyone!

--Matt

[1]:


https://docs.google.com/document/d/1zHbnyK1r6KHpMOtEdIg1EZKNzHx-MVgUMOzB87GuXyk/edit#heading=h.38515dnp2bdb

[2]: https://lists.apache.org/thread/tn5wt4p52f6kqjtx3tjxqd9122n4pf94







Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou



Hello,

I'd really like to see more engagement and criticism from non-Voltron 
Data parties before this is formally adopted as an Arrow spec.


Regards

Antoine.


Le 27/02/2024 à 18:35, Matt Topol a écrit :

Hey all,

I'd like to propose a vote for us to officially adopt the protocol
described in the google doc[1] for Dissociated Arrow IPC Transports. This
proposal was originally discussed at [2]. Once this proposal is adopted, I
will work on adding the necessary documentation to the Arrow website along
with examples etc.

The vote will be open for at least 72 hours.

[ ] +1 Accept this Proposal
[ ] +0
[ ] -1 Do not accept this proposal because...

Thank you everyone!

--Matt

[1]:
https://docs.google.com/document/d/1zHbnyK1r6KHpMOtEdIg1EZKNzHx-MVgUMOzB87GuXyk/edit#heading=h.38515dnp2bdb
[2]: https://lists.apache.org/thread/tn5wt4p52f6kqjtx3tjxqd9122n4pf94



Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-14 Thread Antoine Pitrou



I've added a bunch of additional closed issues for 15.0.1.

Regards

Antoine.


Le 14/02/2024 à 10:29, Raúl Cumplido a écrit :

Hi,

There are already several other issues tagged as 15.0.1. I can take
some time to create a patch release and add all those fixes.
I'll add it to the agenda for today's bi-weekly call.

Thanks,
Raúl

El mar, 13 feb 2024 a las 23:20, Antoine Pitrou () escribió:



Well, https://github.com/apache/arrow/issues/20379 makes me wonder if
anyone is using the Java Dataset bridge seriously.


Le 13/02/2024 à 21:10, Dane Pitkin a écrit :

Hi all,

Arrow Java identified an issue[1] in the 15.0.0 release. There is an
undefined symbol in the dataset module that causes a linking error at run
time. The issue is resolved[2] and I'd like to propose a patch release. We
also have an open issue to implement testing to prevent this from happening
in the future[3]. This is a major regression in the Arrow Java package, so
it would be great to get a patch released for users.

Special thanks to David Susanibar and Kou for triaging and fixing this
issue.

Thanks,
Dane

[1]https://github.com/apache/arrow/issues/39919
[2]https://github.com/apache/arrow/pull/40015
[3]https://github.com/apache/arrow/issues/40018



Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-13 Thread Antoine Pitrou



Well, https://github.com/apache/arrow/issues/20379 makes me wonder if 
anyone is using the Java Dataset bridge seriously.



Le 13/02/2024 à 21:10, Dane Pitkin a écrit :

Hi all,

Arrow Java identified an issue[1] in the 15.0.0 release. There is an
undefined symbol in the dataset module that causes a linking error at run
time. The issue is resolved[2] and I'd like to propose a patch release. We
also have an open issue to implement testing to prevent this from happening
in the future[3]. This is a major regression in the Arrow Java package, so
it would be great to get a patch released for users.

Special thanks to David Susanibar and Kou for triaging and fixing this
issue.

Thanks,
Dane

[1]https://github.com/apache/arrow/issues/39919
[2]https://github.com/apache/arrow/pull/40015
[3]https://github.com/apache/arrow/issues/40018



Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-13 Thread Antoine Pitrou



I think the original proposal is sufficient.

Also, it is not obvious to me how one would switch from e.g. grpc+tls to 
http without an explicit server location (unless both Flight servers are 
hosted under the same port?). So the "+" proposal seems a bit weird.



Le 12/02/2024 à 23:39, David Li a écrit :

The idea is that the client would reuse the existing connection, in which case 
the protocol and such are implicit. (If the client doesn't have a connection 
anymore, it can't use the fallback anyways.)

I suppose this has the advantage that you could "fall back" to a known hostname 
with a different protocol, but I'm not sure that always applies anyways. (Correct me if 
I'm wrong Matt, but as I recall, UCX addresses aren't hostnames but rather opaque byte 
blobs, for instance.)

If we do prefer this, to avoid overloading the hostname, there's also the 
informal convention of using + in the scheme, so it could be 
arrow-flight-fallback+grpc+tls://, arrow-flight-fallback+http://, etc.

On Mon, Feb 12, 2024, at 17:03, Joel Lubinitsky wrote:

Thanks for clarifying.

Given the relationship between these two proposals, would it also be
necessary to distinguish the scheme (or schemes) supported by the
originating Flight RPC service?

If that is the case, it may be preferred to use the "host" portion of the
URI rather than the "scheme" to denote the location of the data. In this
scenario, the host "0.0.0.0" could be used. This IP address is defined in
IETF RFC1122 [1] as "This host on this network", which seems most
consistent with the intended use-case. There are some caveats to this usage
but in my experience it's not uncommon for protocols to extend the
definition of this address in their own usage.

A benefit of this convention is that the scheme remains available in the
URI to specify the transport available. For example, the following list of
locations may be included in the response:

["grpc://0.0.0.0", "ucx://0.0.0.0", "grpc://1.2.3.4", ...]

This would indicate that grpc and ucx transport is available from the
current service, grpc is available at 1.2.3.4, and possibly more
combinations of scheme/host.

[1] https://datatracker.ietf.org/doc/html/rfc1122#section-3.2.1.3

On Mon, Feb 12, 2024 at 2:53 PM David Li  wrote:


Ah, while I was thinking of it as useful for a fallback, I'm not
specifying it that way.  Better ideas for names would be appreciated.

The actual precedence has never been specified. All endpoints are
equivalent, so clients may use what is "best". For instance, with Matt
Topol's concurrent proposal, a GPU-enabled client may preferentially try
UCX endpoints while other clients may choose to ignore them entirely (e.g.
because they don't have UCX installed).

In practice the ADBC/JDBC drivers just scan the list left to right and try
each endpoint in turn for lack of a better heuristic.

On Mon, Feb 12, 2024, at 14:28, Joel Lubinitsky wrote:

Thanks for proposing this David.

I think the ability to include the Flight RPC service itself in the list

of

endpoints from which data can be fetched is a helpful addition.

The current choice of name for the URI (arrow-flight-fallback://) seems

to

imply that there is an order of precedence that should be considered in

the

list of URI’s. Specifically, as a developer receiving the list of

locations

I might assume that I should try fetching from other locations first. If
those do not succeed, I may try the original service as a fallback.

Are these the intended semantics? If so, is there a way to include the
original service in the list of locations without the implied precedence?

Thanks,
Joel

On Mon, Feb 12, 2024 at 11:52 James Duong 
.invalid>

wrote:


This seems like a good idea, and also improves consistency with clients
that erroneously assumed that the service endpoint was always in the

list

of endpoints.

From: Antoine Pitrou 
Date: Monday, February 12, 2024 at 6:05 AM
To: dev@arrow.apache.org 
Subject: Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

Hello,

This looks fine to me.

Regards

Antoine.


Le 12/02/2024 à 14:46, David Li a écrit :

Hello,

I'd like to propose a slight update to Flight RPC to make Flight SQL

work better in different deployment scenarios.  Comments on the doc

would

be appreciated:






https://docs.google.com/document/d/1g9M9FmsZhkewlT1mLibuceQO8ugI0-fqumVAXKFjVGg/edit?usp=sharing


The gist is that FlightEndpoint allows specifying either (1) a list of

concrete URIs to fetch data from or (2) no URIs, meaning to fetch from

the

Flight RPC service itself; but it would be useful to combine both

behaviors

(try these concrete URIs and fall back to the Flight RPC service itself)
without requiring the service to know its own public address.


Best,
David






Re: [ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-12 Thread Antoine Pitrou



Hi Dewey,

Le 12/02/2024 à 15:01, Dewey Dunnington a écrit :

Apache Arrow nanoarrow is a small C library for building and
interpreting Arrow C Data interface structures with bindings for users
of the R programming language.


Do you want to reconsider this sentence? It seems nanoarrow is starting 
to be more versatile now.


Regards

Antoine.


Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-12 Thread Antoine Pitrou



Hello,

This looks fine to me.

Regards

Antoine.


Le 12/02/2024 à 14:46, David Li a écrit :

Hello,

I'd like to propose a slight update to Flight RPC to make Flight SQL work 
better in different deployment scenarios.  Comments on the doc would be 
appreciated:

https://docs.google.com/document/d/1g9M9FmsZhkewlT1mLibuceQO8ugI0-fqumVAXKFjVGg/edit?usp=sharing

The gist is that FlightEndpoint allows specifying either (1) a list of concrete 
URIs to fetch data from or (2) no URIs, meaning to fetch from the Flight RPC 
service itself; but it would be useful to combine both behaviors (try these 
concrete URIs and fall back to the Flight RPC service itself) without requiring 
the service to know its own public address.

Best,
David


Re: [DISCUSS] Proposal to expand Arrow Communications

2024-02-07 Thread Antoine Pitrou



I think we should find a proper descriptive name for the 
"high-performance protocol", because "high-performance" is vague and 
context-dependent, and also spreads unnecessary confusion about existing 
alternatives such as regular Arrow IPC.


I would for example propose "Dissociated Arrow IPC" to stress the idea 
that metadata and data can be on separate transports.



Le 03/02/2024 à 00:22, Matt Topol a écrit :

Hey all,

In my current work I've been experimenting and playing around with
utilizing Arrow and non-cpu memory data. While the creation of the
ArrowDeviceArray struct and the enhancements to the Arrow library Device
abstractions were necessary, there is also a need to extend the
communications specs we utilize, i.e. Flight.

Currently there is no real way to utilize Arrow Flight with shared memory
or with non-CPU memory (without an expensive Device -> Host copy first). To
this end I've done a bunch of research and toying around and came up with a
protocol to propose and a reference implementation using UCX[1]. Attached
to the proposal is also a couple extensions for Flight itself to make it
easier for users to still use Flight for metadata / dataset information and
then point consumers elsewhere to actually retrieve the data. The idea here
is that this would be a new specification for how to transport Arrow data
across these high-performance transports such as UCX / libfabric / shared
memory / etc. We wouldn't necessarily expose / directly add implementations
of the spec to the Arrow libraries, just provide reference/example
implementations.

I've written the proposal up on a google doc[2] that everyone should be
able to comment on. Once we get some community discussion on there, if
everyone is okay with it I'd like eventually do a vote on adopting this
spec and if we do, I'll then make a PR to start adding it to the Arrow
documentation, etc.

Anyways, thank you everyone in advance for your feedback and comments!

--Matt

[1]: https://github.com/openucx/ucx/
[2]:
https://docs.google.com/document/d/1zHbnyK1r6KHpMOtEdIg1EZKNzHx-MVgUMOzB87GuXyk/edit?usp=sharing



Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-27 Thread Antoine Pitrou



My 2 cents : I don't understand what an open source project gains by 
publishing on a microblogging platform.


As for Twitter specifically, its recent governance changes would be good 
reason for terminating the @ApacheArrow account, IMHO.


Regards

Antoine.


Le 27/01/2024 à 23:06, Bryce Mecum a écrit :

I noticed that the @ApacheArrow Twitter account [1] hasn't posted
since June 2023 which is around the time of the Arrow 12 release. When
I asked on Zulip [2] about who runs or has access to post as that
account, Kou indicated the account was managed using TweetDeck [3] and
that this may no longer be an option due to subscription changes.

I'm writing to get a sense of who currently has access and how the
community would like to move forward with using the account. I'm also
volunteering to help manage it.

My questions are:

- Who has access to @ApacheArrow [1]?
- Is the community still interested in engaging on Twitter?
- Is the community interested in other platforms, potentially just
engaging with them through cross-posting?

Thanks,
Bryce

[1] https://twitter.com/ApacheArrow
[2] 
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/ApacheArrow.20Twitter.20account/near/418346643
[3] https://en.wikipedia.org/wiki/Tweetdeck


Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-25 Thread Antoine Pitrou



Hello,

My own answers:

1) isDelta should be true only when a delta is being transmitted (to be 
appended to the existing dictionary with the same id); it should be 
false when a full dictionary is being transmitted (to replace the 
existing dictionary with the same id, if any)

2) yes, it could
3) yes
4) there's no reason it can't be valid

Regards

Antoine.


Le 25/01/2024 à 07:25, Micah Kornfield a écrit :

Hi Chris,
My interpretations:
1) I'm not sure it is clearly defined, but my impression is the first
dictionary is never a delta dictionary (option 1)
2) I don't think they are prevented from switching state (which I supposed
is more complicated?) but hopefully not by much?
3) Dictionaries are reused across batches unless replaced.
4)  I'm not sure I understand this question.  Dictionary should be passed
independently of indexes?

Thanks,
Micah

On Fri, Jan 19, 2024 at 1:55 PM Chris Larsen 
wrote:


Hi folks,

I'm working on multi-batch dictionary with delta support in Java [1] and
would like some clarifications. Given the "isDelta" flag in the dictionary
message [2], when should this be set to "true"?

1) If we have dictionary with an ID of 1 that we want to delta encode and
it is used across multiple batches, should the initial batch have
`isDelta=false` then subsequent batches have `isDelta=true`? E.g.

batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4]

Or should the flag be true for the entire IPC flow? E.g.

batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3]

Either works for me.

2) Could (in stream, not file IPCs) a single dictionary ever switch state
across batches from delta to replacement mode or vice-versa? E.g.

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2]

I'd like to keep the protocol and API simple and assume switching is not
allowed. This would mean the 2nd example above would be canonical.

3) Are replacement dictionaries required to be serialized for every batch
or is a dictionary re-used across batches until a replacement is received?
The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a
dictionary type must have the same dictionary in each record batch". I
assume (and prefer) the latter, that replacements are serialized once and
re-used. E.g.

batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1]
// use previous dictionary
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2] // replacement

And I assume that 'unify_dictionaries' simply concatenates all dictionaries
into a single vector serialized in the first batch (haven't looked at the
code yet).

4) Is it valid for a delta dictionary to have an update in a subsequent
batch even though the update is not used in that batch? A silly example
would be:

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null,
null, null]
batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2]

Thanks for your help!

[1] https://github.com/apache/arrow/pull/38423
[2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134
[3]

https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE

--


Chris Larsen





Re: [DataFusion] New Blog Post -- DataFusion 34.0

2024-01-23 Thread Antoine Pitrou



Impressive, thank you!


Le 23/01/2024 à 14:06, Andrew Lamb a écrit :

If anyone is interested, here is a new blog post about the last 6 months in
DataFusion[1] and where we are heading this year.

Andrew

[1]: https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/



Re: [DISC] Improve Arrow Release verification process

2024-01-19 Thread Antoine Pitrou



Well, if the main objective is to just follow the ASF Release 
guidelines, then our verification process can be simplified drastically.


The ASF indeed just requires:
"""
Every ASF release MUST contain one or more source packages, which MUST 
be sufficient for a user to build and test the release provided they 
have access to the appropriate platform and tools. A source release 
SHOULD not contain compiled code.

"""

So, basically, if the source tarball is enough to compile Arrow on a 
single platform with a single set of tools, then we're ok. :-)


The rest is just an additional burden that we voluntarily inflict to 
ourselves. *Ideally*, our continuous integration should be able to 
detect any build-time or runtime issue on supported platforms. There 
have been rare cases where some issues could be detected at release time 
thanks to the release verification script, but these are a tiny portion 
of all issues routinely detected in the form of CI failures. So, there 
doesn't seem to be a reason to believe that manual release verification 
is bringing significant benefits compared to regular CI.


Regards

Antoine.


Le 19/01/2024 à 11:42, Raúl Cumplido a écrit :

Hi,

One of the challenges we have when doing a release is verification and voting.

Currently the Arrow verification process is quite long, tedious and error prone.

I would like to use this email to get feedback and user requests in
order to improve the process.

Several things already on my mind:

One thing that is quite annoying is that any flaky failure makes us
restart the process and possibly requires downloading everything
again. It would be great to have some kind of retry mechanism that
allows us to keep going from where it failed and doesn't have to redo
the previous successful jobs.

We do have a bunch of flags to do specific parts but that requires
knowledge and time to go over the different flags, etcetera so the UX
could be improved.

Based on the ASF release policy [1] in order to cast a +1 vote we have
to validate the source code packages but it is not required to
validate binaries locally. Several binaries are currently tested using
docker images and they are already tested and validated on CI. Our
documentation for release verification points to perform binary
validation. I plan to update the documentation and move it to the
official docs instead of the wiki [2].

I would appreciate input on the topic so we can improve the current process.

Thanks everyone,
Raúl

[1] https://www.apache.org/legal/release-policy.html#release-approval
[2] 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


Re: [VOTE] Release Apache Arrow 15.0.0 - RC1

2024-01-17 Thread Antoine Pitrou



Go verification fails on Ubuntu 22.04:

```
# google.golang.org/grpc
../../gopath/pkg/mod/google.golang.org/grpc@v1.58.3/server.go:2096:14: 
undefined: atomic.Int64

note: module requires Go 1.19
# github.com/apache/arrow/go/v15/arrow/avro
arrow/avro/reader_types.go:594:16: undefined: fmt.Append
note: module requires Go 1.20
```

Ubuntu 22.04 has Go 1.18.1.

Regards

Antoine.


Le 17/01/2024 à 11:58, Raúl Cumplido a écrit :

Hi,

I would like to propose the following release candidate (RC1) of Apache
Arrow version 15.0.0. This is a release consisting of 330
resolved GitHub issues[1].

This release candidate is based on commit:
a61f4af724cd06c3a9b4abd20491345997e532c0 [2]

The source release rc1 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 15.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 15.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A15.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/a61f4af724cd06c3a9b4abd20491345997e532c0
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-15.0.0-rc1
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/15.0.0-rc1
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/15.0.0-rc1
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/15.0.0-rc1
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/a61f4af724cd06c3a9b4abd20491345997e532c0/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/39641


Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Antoine Pitrou



Hi,

For now, I would suggest that each implementation decides on their own 
strategy, because we don't have a clear idea of which is better (and 
extension types are probably not getting a lot of use yet).


Regards

Antoine.


Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :

The main problem I see with adding properties to ExtensionType is I'm not
sure where that information would reside. Allowing type authors to declare
in which ways the type is equivalent (or not) to its storage is appealing,
but it seems to need an official extension field like
ARROW:extension:semantics. Otherwise I think each extension type's
semantics would need to be maintained within every implementation as well
as in a central reference (probably in Columnar.rst), which seems
unreasonable to expect of extension type authors. I'm also skeptical that
useful information could be packed into an ARROW:extension:semantics field;
even if the type can declare that ordering-as-with-storage is invalid I
don't think it'd be feasible to specify the correct ordering.

If we cannot attach this information to extension types, the question
becomes which defaults are most reasonable for engines and how can the
engine most usefully be configured outside those defaults. My own
preference would be to refuse operations other than selection or
casting-to-storage, with a runtime extensible registry of allowed implicit
casts. This will allow users of the engine to configure their extension
types as they need, and the error message raised when an implicit
cast-to-storage is not allowed could include the suggestion to register the
implicit cast. For applications built against a specific engine, this
approach would allow recovering much of the advantage of attaching
properties to an ExtensionType by including registration of implicit casts
in the ExtensionType's initialization.

On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
wrote:


Hello all,

Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
any extension type to its storage type in acero. This raises questions
about the validity of applying operations to an extension array's storage.
For example, some extension type authors may intend different ordering for
arrays of their new type than would be applied to the array's storage or
may not intend for the type to participate in arithmetic even though its
storage could.

Suggestions/observations from discussion on that PR included:
- Extension types could provide general semantic description of storage
type equivalence [2], so that a flag on the extension type enables ordering
by storage but disables arithmetic on it
- Compute functions or kernels could be augmented with a filter declaring
which extension types are supported [3].
- Currently arrow-rs considers extension types metadata only [4], so all
kernels treat extension arrays equivalently to their storage.
- Currently arrow c++ only supports explicitly casting from an extension
type to its storage (and from storage to ext), so any operation can be
performed on an extension array's storage but it requires opting in.

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/39200
[2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
[3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
[4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651





Re: Java, dictionary ids and schema equality

2023-12-09 Thread Antoine Pitrou



Hi Curt,

Yes, it's a problem in the Java implementation of these tests. Ideally 
this should be fixed, but doing so would require some amount of scaffolding.


Regards

Antoine.


Le 09/12/2023 à 21:47, Curt Hagenlocher a écrit :

I've (mostly) fixed the C# implementation of dictionary IPC but I'm getting
a failing integration test. The Java checks are explicitly validating that
the dictionary IDs in the schema match the values it expects. None of the
other implementations seem to do that, though they're obviously passing and
so they're assigning dictionary IDs consistently with what the Java
implementation expects.

This seems to be because the C# implementation starts numbering
dictionaries with 1 while Java seems to expect them to start with 0. (I
have not yet validated this theory.)

But more broadly, I'm curious -- is the Java implementation being overly
pedantic here or is there an explicit expectation that the dictionary
number serialized into Flatbuffer format for files will follow a specific
ordering?

Thanks,
-Curt



Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Antoine Pitrou

+1 (binding)


Le 08/12/2023 à 20:42, David Li a écrit :

Let's start a formal vote just so we're on the same page now that we've 
discussed a few things.

I would like to propose we remove 'experimental' from Flight SQL and make it 
stable:

- Remove the 'experimental' option from the Protobuf definitions (but leave the 
option definition for future additions)
- Update specifications/documentation/implementations to no longer refer to 
Flight SQL as experimental, and describe what stable means (no 
backwards-incompatible changes)

The vote will be open for at least 72 hours.

[ ] +1
[ ] +0
[ ] -1 Keep Flight SQL experimental because...

On Fri, Dec 8, 2023, at 13:37, Weston Pace wrote:

+1

On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield 
wrote:


+1

On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb  wrote:


I agree it is time to "promote" ArrowFlightSQL to the same level as other
standards in Arrow

Now that it is used widely (we use and count on it too at InfluxData) I
agree it makes sense to remove the experimental label from the overall
spec.

It would make sense to leave experimental / caveats on any places (like
extension APIs) that are likely to change

Andrew

On Fri, Dec 8, 2023 at 11:39 AM David Li  wrote:


Yes, I think we can continue marking new features (like the bulk
ingest/session proposals) as experimental but remove it from anything
currently in the spec.

On Fri, Dec 8, 2023, at 11:36, Laurent Goujon wrote:

I'm the author of the initial pull request which triggered the

discussion.

I was focusing first on the comment in Maven pom.xml files which show

up

in

Maven Central and other places, and which got some people confused

about

the state of the driver/code. IMHO this would apply to the current
Flight/Flight SQL protocol and code as it is today. Protocol

extensions

should be still deemed experimental if still in their incubating

phase?


Laurent

On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield <

emkornfi...@gmail.com>

wrote:


This applies to mostly existing APIs (e.g. recent additions are

still

experimental)? Or would it apply to everything going forward?

Thanks,
Micah

On Thu, Dec 7, 2023 at 2:25 PM David Li 

wrote:



Yes, we'd update the docs, the Protobuf definitions, and anything

else

referring to Flight SQL as experimental.

On Thu, Dec 7, 2023, at 17:14, Joel Lubinitsky wrote:

The message types defined in FlightSql.proto are all marked

experimental

as

well. Would this include changes to any of those?

On Thu, Dec 7, 2023 at 16:43 Laurent Goujon




wrote:


we have been using it with Dremio for a while now, and we

consider

it

stable

+1 (not binding)

Laurent

On Wed, Dec 6, 2023 at 4:52 PM Matt Topol




wrote:


+1, I agree with everyone else

On Wed, Dec 6, 2023 at 7:49 PM James Duong
 wrote:


+1 from me. It's used in a good number of databases now.

Get Outlook for Android

From: David Li 
Sent: Wednesday, December 6, 2023 9:59:54 AM
To: dev@arrow.apache.org 
Subject: [DISCUSS] Flight SQL as experimental

Flight SQL has been marked 'experimental' since the

beginning.

Given

that

it's now used by a few systems for a few years now, should

we

remove

this

qualifier? I don't expect us to be making breaking changes

anymore.


This came up in a GitHub PR:

https://github.com/apache/arrow/pull/39040


-David

















Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2023-12-06 Thread Antoine Pitrou



Hi,

While this looks like a nice start, I would expect more precise 
recommendations for writing non-trivial services. Especially, one 
question is how to send both an application-specific POST request and an 
Arrow stream, or an application-specific GET response and an Arrow 
stream. This might necessitate some kind of framing layer, or a 
standardized delimiter.


Regards

Antoine.



Le 05/12/2023 à 21:10, Ian Cook a écrit :

This is a continuation of the discussion entitled "[DISCUSS] Protocol for
exchanging Arrow data over REST APIs". See the previous messages at
https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.

To inform this discussion, I created some basic Arrow-over-HTTP client and
server examples here:
https://github.com/apache/arrow/pull/39081

My intention is to expand and improve this set of examples (with your help)
until they reflect a set of conventions that we are comfortable documenting
as recommendations.

Please take a look and add comments / suggestions in the PR.

Thanks,
Ian

On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
 wrote:


I also think a set of best practices for Arrow over HTTP would be a
valuable resource for the community...even if it never becomes a
specification of its own, it will be beneficial for API developers and
consumers of those APIs to have a place to look to understand how
Arrow can help improve throughput/latency/maybe other things. Possibly
something like httpbin.org but for requests/responses that use Arrow
would be helpful as well. Thank you Ian for leading this effort!

It has mostly been covered already, but in the (ubiquitous) situation
where a response contains some schema/table and some non-schema/table
information there is some tension between throughput (best served by a
JSON response plus one or more IPC stream responses) and latency (best
served by a single HTTP response? JSON? IPC with metadata/header?). In
addition to Antoine's list, I would add:

- How to serve the same table in multiple requests (e.g., to saturate
a network connection, or because separate worker nodes are generating
results anyway).
- How to inline a small schema/table into a single request with other
metadata (I have seen this done as base64-encoded IPC in JSON, but
perhaps there is a better way)

If anybody is interested in experimenting, I repurposed a previous
experiment I had as a flask app that can stream IPC to a client:

https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
.


- recommendations about compression


Just a note that there is also Content-Encoding: gzip (for consumers
like Arrow JS that don't currently support buffer compression but that
can leverage the facilities of the browser/http library)

Cheers!

-dewey


On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei  wrote:


Hi,


But how is the performance?


It's faster than the original JSON based API.

I implemented Apache Arrow support for a C# client. So I
measured only with Apache Arrow C# but the Apache Arrow
based API is faster than JSON based API.


Have you measured the throughput of this approach to see
if it is comparable to using Flight SQL?


Sorry. I didn't measure the throughput. In the case, elapsed
time of one request/response pair is important than
throughput. And it was faster than JSON based API and enough
performance.

I couldn't compare to a Flight SQL based approach because
Groonga doesn't support Flight SQL yet.


Is this approach able to saturate a fast network
connection?


I think that we can't measure this with the Groonga case
because the Groonga case doesn't send data without
stopping. Here is one of request patterns:

1. Groonga has log data partitioned by day
2. Groonga does full text search against one partition (2023-11-01)
3. Groonga sends the result to client as Apache Arrow
streaming format record batches
4. Groonga does full text search against the next partition (2023-11-02)
5. Groonga sends the result to client as Apache Arrow
streaming format record batches
6. ...

In the case, the result data aren't always sending. (search
-> send -> search -> send -> ...) So it doesn't saturate a
fast network connection.

(3. and 4. can be parallel but it's not implemented yet.)

If we optimize this approach, this approach may be able to
saturate a fast network connection.


And what about the case in which the server wants to begin sending

batches

to the client before the total number of result batches / records is

known?


Ah, sorry. I forgot to explain the case. Groonga uses the
above approach for it.


- server should not return the result data in the body of a response

to a

query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data


If we want to do this, the standard "Location" HTTP headers
may be suitable.


- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size


Ah, sorry. I forgot to explain this case too. 

Re: [Discussion][Gandiva] Migration JIT engine from MCJIT to ORC v2

2023-12-06 Thread Antoine Pitrou



Given that MCJIT is deprecated and there doesn't seem to be a downside 
to the new APIs, migrating to ORC v2 sounds fine to me.


Just a question: does it raise the minimum supported LLVM version?

Regards

Antoine.


Le 05/12/2023 à 03:35, Yue Ni a écrit :

Hi there,

I'd like to initiate a discussion regarding the proposal to migrate the JIT
engine from LLVM MCJIT to LLVM ORC v2 [1] in Gandiva. I've provided a
concise description of the proposal in the following issue:
https://github.com/apache/arrow/issues/37848. I welcome any feedback or
comments on this topic. Please feel free to share your thoughts either here
on the mailing list or directly within the issue. Thank you for your
attention and help.

*Background:*
Gandiva currently employs MCJIT as its internal JIT engine. However, LLVM
has introduced a newer JIT API known as ORC v2/LLJIT [1], which presents
several advantages over MCJIT:

* Active Maintenance: ORC v2 is under active development and maintenance by
LLVM developers. In contrast, MCJIT is not receiving active updates and,
based on indications from LLVM developers, is slated for eventual
deprecation and removal.
* Modularity and Organization: ORC v2 boasts a more organized and modular
structure, granting users the flexibility to seamlessly integrate various
JIT components.
* Thread-Local Variable Support: ORC v2 natively supports thread-local
variables, enhancing its functionality.
* Enhanced Resource Management: When compared to MCJIT, ORC v2 provides a
more granular approach to resource management, optimizing memory usage and
code compilation.

*Proposal:*
I propose the introduction of ORC v2/LLJIT to replace MCJIT in gandiva.
There should not be any user facing change, and performance is expected to
be roughly the same.

Any feedback is appreciated. Thanks.

*References:*
[1] https://llvm.org/docs/ORCv2.html
[2] https://github.com/apache/arrow/issues/37848

Regards,
Yue Ni



Re: CIDR 2024

2023-12-06 Thread Antoine Pitrou



For the sake of clarity, it seems this is talking about the Conference 
on Innovative Data Systems Research:

https://www.cidrdb.org/cidr2024/

Regards

Antoine.


Le 06/12/2023 à 01:15, Wes McKinney a écrit :

I will also be there.

On Mon, Dec 4, 2023 at 12:58 PM Tony Wang  wrote:


I am

Get Outlook for Android

From: Curt Hagenlocher 
Sent: Monday, December 4, 2023 12:53:00 PM
To: dev@arrow.apache.org 
Subject: CIDR 2024

Who's going to CIDR in January?

(And who else is shocked that it's already going to be 2024...?)

-Curt





Re: Documentation of Breaking Changes

2023-11-21 Thread Antoine Pitrou



Hello,

Le 21/11/2023 à 22:59, Chris Thomas a écrit :


I apologize if this is not the appropriate venue for this request; if
that's the case, please let me know where I should be asking:

Earlier this month Dependabot flagged a security vulnerability with PyArrow
which prompted us to do an upgrade from v10 to v14.1 of the software.
Obviously this is a lot of major versions so the upgrade was subjected to a
bunch of tests but, alas, there was a breaking change to the way PyArrow
handled time precision that slipped through the cracks.


Can you explain which change around time precision you observed?

Regards

Antoine.


Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-20 Thread Antoine Pitrou



I also agree that an informal spec "how to efficiently transfer Arrow 
data over HTTP" makes sense.


Probably with several aspects:
- one-shot GET data
- streaming GET
- one-shot PUT or POST
- streaming POST
- non-Arrow prologue and epilogue (for example JSON-based metadata)
- conventions for well-known headers


Le 20/11/2023 à 15:23, David Li a écrit :

I'm with Kou: what exactly are we trying to specify?

- The HTTP mapping of Flight RPC?
- A full, locked down RPC framework like Flight RPC, but otherwise unrelated?
- Something else?

I'd also ask: do we need to specify anything in the first place? What is stopping people 
from using Arrow in their REST APIs, and what kind of interoperability are we trying to 
achieve? I would say that Flight RPC effectively has no interoperability at all - each 
project using it has its own bespoke layers on top, and the "standardized" RPC 
methods just hinder the applications that would like more control and flexibility that 
Flight RPC does not provide. The recent additions to the Flight RPC spec speak to that: 
they were meant for Flight SQL, but needed to be implemented at the Flight RPC layer; 
there is not a real abstraction layer that Flight RPC really serves.


It could consist only of a specification for how to implement
support for exchanging Arrow-formatted data in an existing REST API.


I would say that this is the only part that might make sense: once a client has 
acquired an Arrow-aware endpoint, what should be the format of the Arrow data 
it gets (whether this is just the Arrow stream format, or something fancier 
like FlightData in Flight RPC).

Separately, it might make sense to define how GraphQL works with Arrow, or 
other specific, full protocols/APIs. But I'm not sure there's much room for a 
Flight RPC equivalent for HTTP/1, if Flight RPC on its own really ever made 
sense as a full framework/protocol in the first place.

On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote:

I know that myself and a number of folks I work with would be interested in
this.

gRPC is a bit of a barrier for a lot of services.
Having a spec for doing Arrow over HTTP API's would be solid.

In my opinion, it doesn't necessarily need to be REST-ful.
Something like JSON-RPC might fit well with the existing model for Arrow
over the wire that's been implemented in things like Flight/FlightSQL.

Something else I've been interested in (I think Matt Topol has done work in
this area) is Arrow over GraphQL, too:
GraphQL and Apache Arrow: A Match Made in Data (youtube.com)


On Sat, Nov 18, 2023 at 1:52 PM Ian Cook  wrote:


Hi Kou,

I think it is too early to make a specific proposal. I hope to use this
discussion to collect more information about existing approaches. If
several viable approaches emerge from this discussion, then I think we
should make a document listing them, like you suggest.

Thank you for the information about Groonga. This type of straightforward
HTTP-based approach would work in the context of a REST API, as I
understand it.

But how is the performance? Have you measured the throughput of this
approach to see if it is comparable to using Flight SQL? Is this approach
able to saturate a fast network connection?

And what about the case in which the server wants to begin sending batches
to the client before the total number of result batches / records is known?
Would this approach work in that case? I think so but I am not sure.

If this HTTP-based type of approach is sufficiently performant and it works
in a sufficient proportion of the envisioned use cases, then perhaps the
proposed spec / protocol could be based on this approach. If so, then we
could refocus this discussion on which best practices to incorporate /
recommend, such as:
- server should not return the result data in the body of a response to a
query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data
- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size
- support range requests (Accept-Range: bytes) to allow clients to request
result ranges (or not?)
- recommendations about compression
- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck

On the other hand, if the performance and functionality of this HTTP-based
type of approach is not sufficient, then we might consider fundamentally
different approaches.

Ian



Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Antoine Pitrou



Welcome Raul, we're glad to have you!

Regards

Antoine.


Le 13/11/2023 à 20:27, Andrew Lamb a écrit :

The Project Management Committee (PMC) for Apache Arrow has invited
Raúl Cumplido  to become a PMC member and we are pleased to announce
that  Raúl Cumplido has accepted.

Please join me in congratulating them.

Andrew



Re: decimal64

2023-11-09 Thread Antoine Pitrou



I would say no, because first-class float16 semantics are more easily 
provided on first-class data type.


Regards

Antoine.


Le 09/11/2023 à 18:38, Curt Hagenlocher a écrit :

It certainly could be. Would float16 be done as a canonical extension type
if it were proposed today?

On Thu, Nov 9, 2023 at 9:36 AM David Li  wrote:


cuDF has decimal32/decimal64 [1].

Would a canonical extension type [2] be appropriate here? I think that's
come up as a solution before.

[1]: https://docs.rapids.ai/api/cudf/stable/user_guide/data-types/
[2]: https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote:

Or they could trivially use a int64 column for that, since the scale is
fixed anyway, and you're probably not going to multiply money values
together.


Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :

If Arrow had a decimal64 type, someone could choose to use that for a
PostgreSQL money column knowing that there are edge cases where they may
get an undesired result.

On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou 

wrote:




Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :

Or more succinctly,
"111,111,111,111,111." will fit into a decimal64; would you

prevent

it

from being stored in one so that you can describe the column as
"decimal(18, 4)"?


That's what we do for other decimal types, see PyArrow below:
```
   >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
Traceback (most recent call last):
 [...]
ArrowInvalid: Precision is not great enough for the result. It should

be

at least 19
```










Re: decimal64

2023-11-09 Thread Antoine Pitrou



If we want to provide useful arithmetic and conversions, then full-blown 
decimal64 (and perhaps decimal32) is warranted.


If we want to easily expose and roundtrip PostgreSQL's fixed-scale money 
type with full binary precision, then I agree a canonical extension type 
is the way.


And we can of course do both.

Sidenote: I haven't seen many proposals for canonical extension types 
until now, which is a bit surprising. The barrier for standardizing a 
canonical extension type is much lower than for a new Arrow data type.


Regards

Antoine.


Le 09/11/2023 à 18:35, David Li a écrit :

cuDF has decimal32/decimal64 [1].

Would a canonical extension type [2] be appropriate here? I think that's come 
up as a solution before.

[1]: https://docs.rapids.ai/api/cudf/stable/user_guide/data-types/
[2]: https://arrow.apache.org/docs/format/CanonicalExtensions.html

On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote:

Or they could trivially use a int64 column for that, since the scale is
fixed anyway, and you're probably not going to multiply money values
together.


Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :

If Arrow had a decimal64 type, someone could choose to use that for a
PostgreSQL money column knowing that there are edge cases where they may
get an undesired result.

On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou  wrote:



Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :

Or more succinctly,
"111,111,111,111,111." will fit into a decimal64; would you prevent

it

from being stored in one so that you can describe the column as
"decimal(18, 4)"?


That's what we do for other decimal types, see PyArrow below:
```
   >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
Traceback (most recent call last):
 [...]
ArrowInvalid: Precision is not great enough for the result. It should be
at least 19
```






Re: decimal64

2023-11-09 Thread Antoine Pitrou



Or they could trivially use a int64 column for that, since the scale is 
fixed anyway, and you're probably not going to multiply money values 
together.



Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :

If Arrow had a decimal64 type, someone could choose to use that for a
PostgreSQL money column knowing that there are edge cases where they may
get an undesired result.

On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou  wrote:



Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :

Or more succinctly,
"111,111,111,111,111." will fit into a decimal64; would you prevent

it

from being stored in one so that you can describe the column as
"decimal(18, 4)"?


That's what we do for other decimal types, see PyArrow below:
```
  >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
Traceback (most recent call last):
[...]
ArrowInvalid: Precision is not great enough for the result. It should be
at least 19
```






Re: decimal64

2023-11-09 Thread Antoine Pitrou



Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :

Or more succinctly,
"111,111,111,111,111." will fit into a decimal64; would you prevent it
from being stored in one so that you can describe the column as
"decimal(18, 4)"?


That's what we do for other decimal types, see PyArrow below:
```
>>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
Traceback (most recent call last):
  [...]
ArrowInvalid: Precision is not great enough for the result. It should be 
at least 19

```



Re: [VOTE][FORMAT] Bulk ingestion support for Flight SQL

2023-11-09 Thread Antoine Pitrou



For the record, the correct PR link seems to be 
https://github.com/apache/arrow/pull/38385



Le 08/11/2023 à 21:49, David Li a écrit :

Hello,

Joel Lubi has proposed adding bulk ingestion support to Arrow Flight SQL [1]. 
This provides a path for uploading an Arrow dataset to a Flight SQL server to 
create or append to a table, without having to know the specifics of the SQL or 
Substrait support on the server. The functionality mimics similar functionality 
in ADBC.

Joel has provided reference implementations of this for C++ and Go at [2], 
along with an integration test.

The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...

[1]: https://lists.apache.org/thread/mo98rsh20047xljrbfymrks8f2ngn49z
[2]: https://github.com/apache/arrow/pull/38256

Thanks,
David


CVE-2023-47248: PyArrow, PyArrow: Arbitrary code execution when loading a malicious data file

2023-11-08 Thread Antoine Pitrou
Severity: critical

Affected versions:

- PyArrow 0.14.0 through 14.0.0
- PyArrow 0.14.0 through 14.0.0

Description:

Deserialization of untrusted data in IPC and Parquet readers in PyArrow 
versions 0.14.0 to 14.0.0 allows arbitrary code execution. An application is 
vulnerable if it reads Arrow IPC, Feather or Parquet data from untrusted 
sources (for example user-supplied input files).

This vulnerability only affects PyArrow, not other Apache Arrow implementations 
or bindings.

It is recommended that users of PyArrow upgrade to 14.0.1. Similarly, it is 
recommended that downstream libraries upgrade their dependency requirements to 
PyArrow 14.0.1 or later. PyPI packages are already available, and we hope that 
conda-forge packages will be available soon.

If it is not possible to upgrade, we provide a separate package 
`pyarrow-hotfix` that disables the vulnerability on older PyArrow versions. See 
 https://pypi.org/project/pyarrow-hotfix/  for instructions.

References:

https://arrow.apache.org/
https://www.cve.org/CVERecord?id=CVE-2023-47248



Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :

Is this buffer lengths buffer only present if the array type is Utf8View?


IIUC, the proposal would add the buffer lengths buffer for all types if the
schema's
flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
the special case and that `n_buffers` would continue to be consistent with
IPC.


This begs the question of what happens if a consumer receives an unknown 
flag value. We haven't specified that unknown flag values should be 
ignored, so a consumer could judiciously choose to error out instead of 
potentially misinterpreting the data.


All in all, personally I'd rather we make a special case for Utf8View 
instead of adding a flag that can lead to worse interoperability.


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 18:59, Dewey Dunnington a écrit :



That sounds a bit hackish to me.


Including only *some* buffer sizes in array->buffers[array->n_buffers]
special-cased for only two types (or altering the number of buffers
required by the IPC format vs. the number of buffers required by the C
Data interface) seem equally hackish to me (not that I'm opposed to
either necessarily...the alternatives really are very bad).


I think the plan for Utf8View is that `n_buffers` is incremented to 
reflect that additional buffer at the end.



How can you *not* care about buffer sizes, if you for example need to send the 
buffers over IPC?


I think IPC is the *only* operation that requires that information?
(Other than perhaps copying to another device?) I don't think there's
any barrier to accessing the content of all the array elements but I
could be mistaken.


That's true, but IPC is implemented by all major Arrow implementations, 
AFAIK :-)


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
The lack of buffer sizes is something that has come up for me a few 
times working with nanoarrow (which dedicates a significant amount of 
code to calculating buffer sizes, which it uses to do validation and 
more efficient copying).


By the way, this is a bit surprising since it's really 35 lines of code 
in C++ currently:


https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754

I expect C code to not be much longer then this :-)

Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> A potential alternative might be to allow any ArrowArray to declare
> its buffer sizes in array->buffers[array->n_buffers], perhaps with a
> new flag in schema->flags to advertise that capability.

That sounds a bit hackish to me.

I'd rather live with the current arrangement for now. Once enough griefs 
with the C Data Interface accumulate, it will be time to think about a 
new specification.



We might want to keep the variadic buffers at the end and instead export
the buffer sizes as buffer #2? Though that's mostly stylistic...


I would prefer the buffer sizes to be after as it preserves the
connection between Columnar/IPC format and the C Data interface...the
need for buffer_sizes is more of a convenience for implementations
that care about this kind of thing than something inherent to the
array data.


How can you *not* care about buffer sizes, if you for example need to 
send the buffers over IPC?


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Antoine Pitrou



Hello,

We might want to keep the variadic buffers at the end and instead export 
the buffer sizes as buffer #2? Though that's mostly stylistic...


Regards

Antoine.


Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit :

Hello all,

The C ABI does not store buffer lengths explicitly, which presents a
problem for Utf8View since buffer lengths are not trivially extractable
from other data in the array. A potential solution is to store the lengths
in an extra buffer after the variadic data buffers. I've adopted this
approach in my (currently draft) PR [1] to add c++ library import/export
for Utf8VIew, but I thought this warranted raising on the ML in case anyone
has a better idea.

Sincerely,
Ben Kietzman

[1]
https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549



Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Antoine Pitrou



Welcome Xuwei!


Le 23/10/2023 à 05:28, Sutou Kouhei a écrit :

On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
has accepted an invitation to become a committer on Apache
Arrow. Welcome, and thank you for your contributions!



Re: [Format] C Data Interface integration testing

2023-10-19 Thread Antoine Pitrou



Hello again,

Quick update: the C++, C#, Go and Java implementations now all 
participate in C Data Interface integration testing.


(this helped us fix a few interoperability bugs, and add deterministic 
releasing of imported data in Go)


Arrow Rust does not participate yet, but given how active the community 
is being, I'm reasonably confident that they'll come to it soon :)


Regards

Antoine.


Le 26/09/2023 à 14:46, Antoine Pitrou a écrit :


Hello,

We have added some infrastructure for integration testing of the C Data
Interface between Arrow implementations. We are now testing the C++ and
Go implementations, but the goal in the future is for all major
implementations to be tested there (perhaps including nanoarrow).

- PR to add the testing infrastructure and enable the C++ implementation:
https://github.com/apache/arrow/pull/37769

- PR to enable the Go implementation
https://github.com/apache/arrow/pull/37788

Feel free to ask any questions.

Regards

Antoine.





Re: Apache Arrow file format

2023-10-18 Thread Antoine Pitrou



The fact that they describe Arrow and Feather as distinct formats 
(they're not!) with different characteristics is a bit of a bummer.



Le 18/10/2023 à 22:20, Andrew Lamb a écrit :

If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.

[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
 wrote:


To further what others have already mentioned, the IPC file format is
primarily optimised for IPC use-cases, that is exchanging the entire
contents between processes. It is relatively inexpensive to encode and
decode, and supports all arrow datatypes, making it ideal for things
like spill-to-disk processing, distributed shuffles, etc...

Parquet by comparison is a storage format, optimised for space
efficiency and selective querying, with [1] containing an overview of
the various techniques the format affords. It is comparatively expensive
to encode and decode, and instead relies on index structures and
statistics to accelerate access.

Both are therefore perfectly viable options depending on your particular
use-case.

[1]:

https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

On 18/10/2023 13:59, Dewey Dunnington wrote:

Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).


you're going to read all the columns for a record batch in the file, no

matter what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple 

wrote:

Arrow IPC file is great, it focuses on in-memory representation and

direct

computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has

high

efficient
encoding, which could make the Parquet file smaller than the Arrow IPC

file

under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow

document to

take a look.

Adam Lippai  于2023年10月18日周三 10:50写道:


Also there is
https://github.com/lancedb/lance between the two formats. Depending

on the

use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol 

wrote:



One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger

than

memory file of data. Since, as Felipe mentioned, the format is

focused on

in-memory representation, you can easily and simply mmap the file and

use

the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in

the

file, no matter what). So it's going to be a trade off based on your

needs

as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt


On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
felipe...@gmail.com>
wrote:


It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:

https://arrow.apache.org/docs/python/feather.html

—
Felipe

On Tue, 17 Oct 2023 at 23:26 Nara 

wrote:

Hi,

Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.

One of the benefits of Parquet file format is column projection,

where

the

IO is limited to just the columns projected.

Regards ,
Nara







Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Antoine Pitrou

+1

Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :

Hello all,

I propose "vu" and "vz" as format strings for the Utf8View and
BinaryView types in the Arrow C data interface [1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of these new C data format strings
[ ] +0
[ ] -1 - I'm against adding these new format strings because

Ben Kietzman

[1] https://arrow.apache.org/docs/format/CDataInterface.html



Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-14 Thread Antoine Pitrou



Welcome to the PMC, Jon!

Le 14/10/2023 à 19:42, David Li a écrit :

Congrats Jon!

On Sat, Oct 14, 2023, at 13:25, Ian Cook wrote:

Congratulations Jonathan!

On Sat, Oct 14, 2023 at 13:24 Andrew Lamb  wrote:


The Project Management Committee (PMC) for Apache Arrow has invited
Jonathan Keane to become a PMC member and we are pleased to announce
that Jonathan Keane has accepted.

Congratulations and welcome!

Andrew



Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-11 Thread Antoine Pitrou



Is Arrow C++ the only reusable codebase for that or could the work be 
based on e.g. Arrow Rust instead?


I don't know the interoperability story of the Swift language, so there 
might be something that favors C++ here.


Regards

Antoine.


Le 11/10/2023 à 02:43, David Li a écrit :

I'm -0 on this without more reasoning. I don't think a large download is a 
compelling reason to split the repo, and being in the same repo doesn't mean 
you have to take a dependency on the C++ implementation. (Plus, unless there is 
enough of a community to replicate all the work done for C++ I suspect you will 
want access to Parquet, Dataset, Acero, etc.)

On Tue, Oct 10, 2023, at 17:24, Jacob Wujciak-Jens wrote:

+1 on Dewey's sentiment.

With regards to technicalities:
- a PMC member can create the repo via ASF's gitbox (I assume
'arrow-swift'?)
- the repo then needs to be configured using the '.asf.yaml'
   - which merge styles are allowed
   - branch protection rules
   - to which ml should notifications be sent
   - see [1] for more features
- CI
- PR/Issue template
- ...

What is the usual versioning scheme for swift projects and what release
cadence are you planning?

Best
Jacob


On Tue, Oct 10, 2023 at 10:25 PM Dewey Dunnington
 wrote:


Hi Alva,

I would encourage you to do whatever will make life more pleasant for
you and other contributors to the Swift Arrow implementation. I have
found development of an Arrow subproject (nanoarrow) in a separate
repository very pleasant. While I don't run integration tests there,
it's not because of any technical limitation (instead of pulling one
repo in your CI job, just pull two).

For the R bindings to Arrow, which do depend on the C++ bindings, we
do have some benefit because Arrow C++ changes that break R tend to
get fixed by the C++ contributor in their PR, rather than that
responsibility always falling on us. That said, it doesn't happen very
often, and we have informally toyed with the idea of moving out of the
monorepo to make it less intimidating for outside contributors.

Cheers,

-dewey

On Tue, Oct 10, 2023 at 2:33 PM Antoine Pitrou  wrote:



Hi Alva,

I'll let others give their opinions on the repo.

Regards

Antoine.


Le 10/10/2023 à 19:25, Alva Bandy a écrit :

Hi Antoine,

Thanks for the reply.

It would be great to get the Swift implementation added to the

integration test.  I have a task for adding the C Data Interface and I will
work on getting the integration test running for Swift after that task.
Can we move forward with setting up the repo as long as there is a
task/issue to ensure the integration test will be run against Swift soon or
would this be a blocker?


Also, I am not sure about Julia, I have not looked into Julia’s

implementation.


Thank you,
Alva Bandy

On 2023/10/10 08:54:30 Antoine Pitrou wrote:


Hello Alva,

This is a reasonable request, but it might come with its own drawbacks
as well.

One significant drawback is that adding the Swift implementation to

the

cross-implementation integration tests will be slightly more

complicated.

It is very important that all Arrow implementations are
integration-tested against each other, otherwise we only have a
theoretical guarantee that they are compatible. See how this is done

here:

https://arrow.apache.org/docs/dev/format/Integration.html

Unless I'm mistaken, neither Swift nor Julia are running the

integration

tests.

Regards

Antoine.



Le 09/10/2023 à 22:26, Alva Bandy a écrit :

Hi,

I would like to request a repo for Arrow Swift (similar to

arrow-rs).  Swift arrow is currently fully Swift and doesn't leverage the
C++ libraries. One of the goals of Arrow Swift was to provide a fully Swift
impl and splitting them now would help ensure that Swift Arrow stays on
this path.


Also, the Swift Package Manager uses a git repo url to pull down a

package.  This can lead to a large download since the entire arrow repo
will be pulled down just to include Arrow Swift.  It would be great to make
this change before registering Swift Arrow with a Swift registry (such as
Swift Package Registry).


Please let me know if this is possible and if so, what would be the

process going forward.


Thank you,
Alva Bandy





Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou



Hi Alva,

I'll let others give their opinions on the repo.

Regards

Antoine.


Le 10/10/2023 à 19:25, Alva Bandy a écrit :

Hi Antoine,

Thanks for the reply.

It would be great to get the Swift implementation added to the integration 
test.  I have a task for adding the C Data Interface and I will work on getting 
the integration test running for Swift after that task.  Can we move forward 
with setting up the repo as long as there is a task/issue to ensure the 
integration test will be run against Swift soon or would this be a blocker?

Also, I am not sure about Julia, I have not looked into Julia’s implementation.

Thank you,
Alva Bandy

On 2023/10/10 08:54:30 Antoine Pitrou wrote:


Hello Alva,

This is a reasonable request, but it might come with its own drawbacks
as well.

One significant drawback is that adding the Swift implementation to the
cross-implementation integration tests will be slightly more complicated.
It is very important that all Arrow implementations are
integration-tested against each other, otherwise we only have a
theoretical guarantee that they are compatible. See how this is done here:
https://arrow.apache.org/docs/dev/format/Integration.html

Unless I'm mistaken, neither Swift nor Julia are running the integration
tests.

Regards

Antoine.



Le 09/10/2023 à 22:26, Alva Bandy a écrit :

Hi,

I would like to request a repo for Arrow Swift (similar to arrow-rs).  Swift 
arrow is currently fully Swift and doesn't leverage the C++ libraries. One of 
the goals of Arrow Swift was to provide a fully Swift impl and splitting them 
now would help ensure that Swift Arrow stays on this path.

Also, the Swift Package Manager uses a git repo url to pull down a package.  
This can lead to a large download since the entire arrow repo will be pulled 
down just to include Arrow Swift.  It would be great to make this change before 
registering Swift Arrow with a Swift registry (such as Swift Package Registry).

Please let me know if this is possible and if so, what would be the process 
going forward.

Thank you,
Alva Bandy



Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou



Hello Alva,

This is a reasonable request, but it might come with its own drawbacks 
as well.


One significant drawback is that adding the Swift implementation to the 
cross-implementation integration tests will be slightly more complicated.
It is very important that all Arrow implementations are 
integration-tested against each other, otherwise we only have a 
theoretical guarantee that they are compatible. See how this is done here:

https://arrow.apache.org/docs/dev/format/Integration.html

Unless I'm mistaken, neither Swift nor Julia are running the integration 
tests.


Regards

Antoine.



Le 09/10/2023 à 22:26, Alva Bandy a écrit :

Hi,

I would like to request a repo for Arrow Swift (similar to arrow-rs).  Swift 
arrow is currently fully Swift and doesn't leverage the C++ libraries. One of 
the goals of Arrow Swift was to provide a fully Swift impl and splitting them 
now would help ensure that Swift Arrow stays on this path.

Also, the Swift Package Manager uses a git repo url to pull down a package.  
This can lead to a large download since the entire arrow repo will be pulled 
down just to include Arrow Swift.  It would be great to make this change before 
registering Swift Arrow with a Swift registry (such as Swift Package Registry).

Please let me know if this is possible and if so, what would be the process 
going forward.

Thank you,
Alva Bandy



Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Antoine Pitrou



+1 from me.

But I also reiterate my plea that these existing parsers get fixed so as 
to entirely validate the format string instead of stopping early.


Regards

Antoine.


Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit :

Hello,

I'm writing to propose "+vl" and "+vL" as format strings for list-view and
large list-view arrays passing through the Arrow C data interface [1].

The previous proposal was considered a bad idea because existing parsers of
these format strings might be looking at only the first `l` (or `L`) after
the `+` and assuming the classic list format from that alone, so now I'm
proposing we start with a `+v` as this prefix is not shared with any other
existing type so far.

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html



Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Antoine Pitrou




Le 06/10/2023 à 17:54, Felipe Oliveira Carvalho a écrit :

Hello,

Since existing C Data Interface integrations sometimes don't parse beyond
the first `l` (or `L`) I'm going to start a new [VOTE] thread with Dewey's
suggestion:


Regardless of which format string we choose for ListView, a bug should 
certainly be reported to these implementations. A robust implementation 
should ensure the imported format string is conformant, otherwise there 
is a risk that the exporter actually meant something else.


Regards

Antoine.




+vl and +vL

If anyone objects to that and has a different suggestion, reply here so I
don't have to spam the list with too many new threads.

--
Felipe

On Thu, Oct 5, 2023 at 6:49 PM Dewey Dunnington
 wrote:


I won't belabour the point any more, but the difference in layout
between a list and a list view is consequential enough to deserve its
own top-level character in my opinion. My vote would be +1 for +vl and
+vL.

On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho
 wrote:



Union format strings share enough properties that having them in the
same switch case doesn't result in additional complexity...lists and
list views are completely different types (for the purposes of parsing
the format string).


Dense and sparse union differ a bit more than list and list-view.

Not starting with `+l` for list-views would be a deviation from this
pattern started by unions.



++---++

| ``+ud:I,J,...``| dense union with type ids I,J...
  ||


++---++

| ``+us:I,J,...``| sparse union with type ids I,J...
   ||


++---++


Is sharing prefixes an issue?

To make this more concrete, these are the parser changes for supporting
`+lv` and `+Lv` as I proposed in the beginning:

@@ -1097,9 +1101,9 @@ struct SchemaImporter {
  RETURN_NOT_OK(f_parser_.CheckHasNext());
  switch (f_parser_.Next()) {
case 'l':
-return ProcessListLike();
+return ProcessVarLengthList();
case 'L':
-return ProcessListLike();
+return ProcessVarLengthList();
case 'w':
  return ProcessFixedSizeList();
case 's':
@@ -1195,12 +1199,30 @@ struct SchemaImporter {
  return CheckNoChildren(type);
}

-  template 
-  Status ProcessListLike() {
-RETURN_NOT_OK(f_parser_.CheckAtEnd());
-RETURN_NOT_OK(CheckNumChildren(1));
-ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
-type_ = std::make_shared(field);
+  template 
+  Status ProcessVarLengthList() {
+if (f_parser_.AtEnd()) {
+  RETURN_NOT_OK(CheckNumChildren(1));
+  ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+  if constexpr (is_large_variation) {
+type_ = large_list(field);
+  } else {
+type_ = list(field);
+  }
+} else {
+  if (f_parser_.Next() == 'v') {
+RETURN_NOT_OK(CheckNumChildren(1));
+ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
+if constexpr (is_large_variation) {
+  type_ = large_list_view(field);
+} else {
+  type_ = list_view(field);
+}
+  } else {
+return f_parser_.Invalid();
+  }
+}
+
  return Status::OK();
}

--
Felipe


On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou 

wrote:




I don't think the parsing will be a problem even in C. It's not like

you

have to backtrack anyway.

+1 from me on Felipe's proposal.

Regards

Antoine.


Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :

This mailing list thread is going to be the discussion.

The union types also use two characters, so I didn’t think it would

be a

problem.

—
Felipe

On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington



wrote:


I'm sorry for missing earlier discussion on this or a PR into the
format where this discussion may have occurred...is there a reason
that +lv and +Lv were chosen over a single-character version (i.e.,
maybe +v and +V)? A single-character version is (slightly) easier to
parse in C.

On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
 wrote:


Hello,

I'm writing to propose "+lv" and "+Lv" as format strings for

list-view

and

large list-view arrays passing through the Arrow C data interface

[1].


The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html












Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Antoine Pitrou



I don't think the parsing will be a problem even in C. It's not like you 
have to backtrack anyway.


+1 from me on Felipe's proposal.

Regards

Antoine.


Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit :

This mailing list thread is going to be the discussion.

The union types also use two characters, so I didn’t think it would be a
problem.

—
Felipe

On Thu, 5 Oct 2023 at 15:26 Dewey Dunnington 
wrote:


I'm sorry for missing earlier discussion on this or a PR into the
format where this discussion may have occurred...is there a reason
that +lv and +Lv were chosen over a single-character version (i.e.,
maybe +v and +V)? A single-character version is (slightly) easier to
parse in C.

On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho
 wrote:


Hello,

I'm writing to propose "+lv" and "+Lv" as format strings for list-view

and

large list-view arrays passing through the Arrow C data interface [1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--
Felipe

[1] https://arrow.apache.org/docs/format/CDataInterface.html






Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Antoine Pitrou



+1 from me. It might be worth spelling out whether any relationship is 
expected between the `app_metadata` for a FlightInfo and any of the 
corresponding `FlightEndpoint`s and `FlightData` chunks.



Le 12/09/2023 à 17:48, Matt Topol a écrit :

Hey all,

I would like to propose adding a new app_metadata field to both the
FlightInfo and FlightEndpoint message types of the Arrow Flight protocol.
There has been discussion of doing so for a while and has now been brought
back up in regards to [1]. More specifically, this enables adding
application defined metadata for FlightSQL (by way of FlightInfo) which can
then be utilized to pass information such as QueryID, QueryCost, etc.

I've put up a PR to add this at [2].

The vote will be open for at least 24 hours:

[ ] +1 Add these fields to the Arrow Flight RPC protocol
[ ] +0
[ ] -1 Do not add these fields to the Arrow Flight RPC protocol because

Thanks much!
--Matt

[1]: https://github.com/apache/arrow/issues/37635
[2]: https://github.com/apache/arrow/pull/37679



Re: [DISCUSS][C++] Raw pointer string views

2023-10-03 Thread Antoine Pitrou



Le 03/10/2023 à 01:36, Matt Topol a écrit :


The cost of conversion is actually significantly higher than the actual
overhead of simply accessing the values in either representation, leading
to a high potential for bottleneck. For systems like Velox and DuckDB where
it's important to be able to return results as fast as possible, if they
have an operation with a throughput of several hundred MB/s or even G/s,
this conversion cost would become a huge bottleneck to returning results
given several cases of converting Raw Pointer views to the offset-based
views go as low as ~22MB/s.


I think you misread the benchmark numbers. It's 22 MItems/s, not 22 MB/s.
Since that number is for the kLongAndSeldomInlineable case, I assume the 
MB/s would two or three orders of magnitude higher.


Regards

Antoine.


Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou



Even if performance were significant better, I don't think it's a good 
enough reason to add these representations to Arrow. By construction, a 
standard cannot continuously chase the performance state of art, it has 
to weigh the benefits of performance improvements against the increased 
cost for the ecosystem (for example the cost of adapting to frequent 
standard changes and a growing standard size).


We have extension types which could reasonably be used for non-standard 
data types, especially the kind that are motivated by leading-edge 
performance research and innovation and come with unusual constraints 
(such as requiring trusting and dereferencing raw pointers embedded in 
data buffers). There could even be an argument for making some of them 
canonical extension types if there's enough anteriority in favor.


Regards

Antoine.


Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :

I think what would really help would be some concrete numbers, do we
have any numbers comparing the performance of the offset and pointer
based representations? If there isn't a significant performance
difference between them, would the systems that currently use a
pointer-based approach be willing to meet us in the middle and switch to
an offset based encoding? This to me feels like it would be the best
outcome for the ecosystem as a whole.

Kind Regards,

Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:


Le 01/10/2023 à 16:21, Micah Kornfield a écrit :


I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.


I've lost a little context but on all the concerns of adding raw
pointers
as an official option to the spec.  But I see making raw-pointer
variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different
levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems.  For instance, what happens if Velox and DuckDb start adding
new
divergent memory layouts?  Are we expecting to add them to the spec?


This is a slippery slope. The more Arrow has a policy of integrating
existing practices simply because they exist, the more the Arrow
format will become _à la carte_, with different implementations
choosing to implement whatever they want to spend their engineering
effort on (you can see this occur, in part, on the Parquet format with
its many different encodings, compression algorithms and a 96-bit
timestamp type).

We _have_ to think carefully about the middle- and long-term future of
the format when adopting new features.

In this instance, we are doing a large part of the effort by adopting
a string view format with variadic buffers, inlined prefixes and
offset-based views into those buffers. But some implementations with
historically different internal representations will have to share
part of the effort to align with the newly standardized format.

I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.

Regards

Antoine.


Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Antoine Pitrou



Hello,

+1 and thanks for working on this!

There'll probably be some minor comments to the format PR, but those 
don't deter from accepting these new layouts into the standard.


Regards

Antoine.


Le 29/09/2023 à 14:09, Felipe Oliveira Carvalho a écrit :

Hello,

I'd like to propose adding ListView and LargeListView arrays to the Arrow
format.
Previous discussion in [1][2], columnar format description and flatbuffers
changes in [3].

There are implementations available in both C++ [4] and Go [5]. I'm working
on the integration tests which I will push to one of the PR branches before
they are merged. I've made a graph illustrating how this addition affects,
in a backwards compatible way, the type predicates and inheritance chain on
the C++ implementation. [6]

The vote will be open for at least 72 hours not counting the weekend.

[ ] +1 add the proposed ListView and LargeListView types to the Apache
Arrow format
[ ] -1 do not add the proposed ListView and LargeListView types to the
Apache Arrow format
because...

Sincerely,
Felipe

[1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
[2] https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
[3] https://github.com/apache/arrow/pull/37877
[4] https://github.com/apache/arrow/pull/35345
[5] https://github.com/apache/arrow/pull/37468
[6] https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db



Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou



Le 01/10/2023 à 16:21, Micah Kornfield a écrit :


I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.


I've lost a little context but on all the concerns of adding raw pointers
as an official option to the spec.  But I see making raw-pointer variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems.  For instance, what happens if Velox and DuckDb start adding new
divergent memory layouts?  Are we expecting to add them to the spec?


This is a slippery slope. The more Arrow has a policy of integrating 
existing practices simply because they exist, the more the Arrow format 
will become _à la carte_, with different implementations choosing to 
implement whatever they want to spend their engineering effort on (you 
can see this occur, in part, on the Parquet format with its many 
different encodings, compression algorithms and a 96-bit timestamp type).


We _have_ to think carefully about the middle- and long-term future of 
the format when adopting new features.


In this instance, we are doing a large part of the effort by adopting a 
string view format with variadic buffers, inlined prefixes and 
offset-based views into those buffers. But some implementations with 
historically different internal representations will have to share part 
of the effort to align with the newly standardized format.


I don't think "we have to adjust the Arrow format so that existing 
internal representations become Arrow-compliant without any 
(re-)implementation effort" is a reasonable design principle.


Regards

Antoine.


Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou



To make things clear, any of the factory functions listed below create a 
type that maps exactly onto an Arrow columnar layout:

https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions

For example, calling `arrow::dictionary` creates a dictionary type that 
exactly represents the dictionary layout specified in 
https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout


Similarly, if you use any of the builders listed below, what you will 
get at the end is data that complies with the Arrow columnar specification:

https://arrow.apache.org/docs/dev/cpp/api/builder.html

All the core Arrow C++ APIs create and process data which complies with 
the Arrow specification, and which is interoperable with other Arrow 
implementations.


Conversely, non-Arrow data such as CSV or Parquet (or Python lists, 
etc.) goes through dedicated converters. There is no ambiguity.



Creating top-level utilities that create non-Arrow data introduces 
confusion and ambiguity as to what Arrow is. Users who haven't studied 
the spec in detail - which is probably most users of Arrow 
implementations - will call `arrow::string_view(raw_pointers=true)` and 
might later discover that their data cannot be shared with other 
implementations (or, if it can, there will be an unsuspected conversion 
cost at the edge).


It also creates a risk of introducing a parallel Arrow-like ecosystem 
based on the superset of data layouts understood by Arrow C++. People 
may feel encouraged to code for that ecosystem, pessimizing 
interoperability with non-C++ runtimes.


Which is why I think those APIs, however convenient, also go against the 
overarching goals of the Arrow project.



If we want to keep such convenience APIs as part of Arrow C++, they 
should be clearly flagged as being non-Arrow compliant.


It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by 
specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).


But, they could be also be provided by a distinct library.

Regards

Antoine.



Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :


Hi Ben,

Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :


@Antoine

What this PR is creating is an "unofficial" Arrow format, with data

types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property.


I'm not sure I understand your point. Dictionary encoding is part of the
Arrow spec, and considering it as a data type is an API choice that does
not violate the spec.

Raw pointers in string views is just not an Arrow format.

Regards

Antoine.


Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou



Hi Ben,

Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :


@Antoine

What this PR is creating is an "unofficial" Arrow format, with data

types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.

We already do this in every implementation of the arrow format I'm
aware of: it's more convenient to consider dictionary as a data type
even though the spec says that it is a field property.


I'm not sure I understand your point. Dictionary encoding is part of the 
Arrow spec, and considering it as a data type is an API choice that does 
not violate the spec.


Raw pointers in string views is just not an Arrow format.

Regards

Antoine.


Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Antoine Pitrou



Hello,

What this PR is creating is an "unofficial" Arrow format, with data 
types exposed in Arrow C++ that are not part of the Arrow standard, but 
are exposed as if they were. Most users will probably not read the 
official format spec, but will simply trust the official Arrow 
implementations. So the official Arrow implementations have an 
obligation to faithfully represent the Arrow format and not breed confusion.


So I'm -1 on the way the PR presents things currently.

I'm not sure how DuckDB and Velox data could be exposed, but it could be 
for example an extension type with a fixed_size_binary<16> storage type.


Regards

Antoine.



Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :

Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4



[Format] C Data Interface integration testing

2023-09-26 Thread Antoine Pitrou



Hello,

We have added some infrastructure for integration testing of the C Data 
Interface between Arrow implementations. We are now testing the C++ and 
Go implementations, but the goal in the future is for all major 
implementations to be tested there (perhaps including nanoarrow).


- PR to add the testing infrastructure and enable the C++ implementation:
https://github.com/apache/arrow/pull/37769

- PR to enable the Go implementation
https://github.com/apache/arrow/pull/37788

Feel free to ask any questions.

Regards

Antoine.





Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-25 Thread Antoine Pitrou



Hi Yue,

Le 25/09/2023 à 18:15, Yue Ni a écrit :



a CMake entrypoint (for example a function) making it easy for

third-party projects to compile their own functions
I can come up with a minimum CMake template so that users can compile C++
based functions, and I think if the integration happens at the LLVM IR
level, it is possible to author the functions beyond C++ languages, such as
Rust/Zig as long as the compiler can generate LLVM IR (there are other
issues that need to be addressed from the Rust experiment I made, but that
can be another proposal/PR). If we make that work, CMake is probably not so
important either since other languages can use their own build tools such
as Cargo/zig build, and we just need some documentation to describe how it
should be interfaced typically.


As long as there's a well-known and supported way to generate the code 
for external functions, then it's fine to me.


(also the required signature for these functions should be documented 
somewhere)



The rest of the proposal (a specific JSON file format, a bunch of functions

to iterate directory entries in a specific layout) is IMHO off-topic for
Gandiva, and each third-party project can implement their own idioms for
the discovery of external functions

>

Could you give some more guidance on how this should work without an
external function registry containing metadata? As far as I know, for each
pre-compiled function used in an expression, Gandiva needs to lookup its
signature from the function registry, which currently is a C++ class that
is hard coded to contain 6 categories of built-in functions
(arithmetic/datetime/hash/mathops/string/datetime arithmetic). If a third
party function cannot be found in the registry, it cannot be used in the
expression. If we don't load the pre-compiled function metadata from
external files, how do we avoid Gandiva rejecting the expression when a
third party function cannot be found in the function registry? Thanks.


What I'm saying is that code to load function metadata from JSON and 
walk directories of .bc files does not belong in Gandiva. The definition 
of an external function registry can certainly belong in Gandiva, but 
how it's populated should be left to third-party projects (which then 
don't have to use JSON or a given directory layout).


Regards

Antoine.


Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-25 Thread Antoine Pitrou



Hello,

Being making Gandiva more extensible sounds like a worthwhile improvement.

However, I'm not sure why we would need to choose a JSON-based format 
for this. Instead, I think Gandiva could simply provide the two 
following basic-blocks:


1. a CMake entrypoint (for example a function) making it easy for 
third-party projects to compile their own functions


2. a C++ entrypoint to load a bitcode file with the corresponding 
function definition(s)


The rest of the proposal (a specific JSON file format, a bunch of 
functions to iterate directory entries in a specific layout) is IMHO 
off-topic for Gandiva, and each third-party project can implement their 
own idioms for the discovery of external functions.


I'd add that this should be documented somewhere so that it is generally 
useful, not only for the contributors of the feature.


Also, I hope that this will get more people interested in Gandiva 
maintenance.


Regards

Antoine.


Le 25/09/2023 à 16:17, Yue Ni a écrit :

Hi there,

I'd like to initiate a discussion regarding the proposal to introduce
external function registry support in Gandiva. I've provided a concise
description of the proposal in the following issue:
https://github.com/apache/arrow/issues/37753. I welcome any feedback or
comments on this topic. Please feel free to share your thoughts either here
on the mailing list or directly within the issue. Thank you for your
attention and help.

*Background:*
Our team has been leveraging Gandiva in our projects, and its performance
and capabilities have been commendable. However, we've identified a
constraint concerning the registration of functions. At present, Gandiva
necessitates that functions be registered directly within its codebase.
This method, while functional, is not the most user-friendly and presents
hurdles for those aiming to incorporate third-party functions. Direct
modifications to Gandiva's source code for such integrations can
inadvertently introduce maintenance challenges and potential versioning
conflicts down the line.

*Proposal:*
To address this limitation, I propose the introduction of an external
function registry mechanism in Gandiva. This would allow users and
developers to register and integrate custom functions without directly
modifying Gandiva's core source code. You can find more details in the
issue [1] and the PR [2].

Any feedback is appreciated. Thanks.

*References:*
[1] https://github.com/apache/arrow/issues/37753
[2] https://github.com/apache/arrow/pull/37787

Regards,
Yue Ni



Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-13 Thread Antoine Pitrou



Le 13/09/2023 à 02:37, Rok Mihevc a écrit :


   * **ragged_dimensions** = indices of ragged dimensions whose sizes may
 differ. Dimensions where all elements have the same size are called
 uniform dimensions. Indices are a subset of all possible dimension
 indices ([0, 1, .., N-1]).
 Ragged dimensions list can be left out. In that case all dimensions
 are assumed ragged.


It's a bit confusing that an empty list means "no ragged dimensions" but 
a missing entry means "all dimensions are ragged". This seems 
error-prone to me.


Also, to be clear, "ragged_dimensions" is only useful for data validation?

Regards

Antoine.


Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Antoine Pitrou



Hi Li,

Le 06/09/2023 à 17:55, Li Jin a écrit :

Hello,

I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".

The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the process led me to think that
there are some potential cache/memory issues in the dataset/parquet cpp
code.

Main observation:
(1) As I am scanning through the dataset, I printed out (a) memory
allocated by the memory pool from ScanOptions (b) process rss. I found that
while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
increasing during the scan (looks linear to the number of files scanned).


RSS is typically not a very reliable indicator, because allocators tend 
to keep memory around as an allocation cache even if the application 
code deallocated it.


(in other words: the application returns memory to the allocator, but 
the allocator does not always return memory to the OS, because 
requesting memory from the OS is expensive)


You may start by trying a different memory pool (see 
https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL 
for an easy way to do that).


My second suggestion is to ask the memory pool to release more memory: 
https://arrow.apache.org/docs/cpp/api/memory.html#_CPPv4N5arrow10MemoryPool13ReleaseUnusedEv


If either of these two fix the apparent RSS issue, then there is no leak 
nor caching issue.


*However*, in addition to the Arrow memory pool, many small or 
medium-sized allocations (such as all Parquet Thrift metadata) will use 
the system allocator. These allocations will evade tracking by the 
memory pool.


Which leads to the question: what is your OS?

Regards

Antoine.


Re: Standardizing a Python PEP-249 extension to retrieve Arrow data

2023-09-05 Thread Antoine Pitrou



Hello Jonas,

What is the standardization model you are after? PEP 249 is marked final 
and therefore won't be updated (except for minutiae such as typos, 
markup, etc.).


Are you planning to submit a new PEP for this extension? If so, I would 
suggest starting a discussion on https://discuss.python.org/


Regards

Antoine.



Le 05/09/2023 à 09:41, Jonas Haag a écrit :

Hello,

I was asked to bring this discussion to the attention of this email list:
https://github.com/jonashaag/pep-249-arrow/issues/2

It is an initiative started by me to standardize the interfaces of Python
database connectors to retrieve PyArrow data.

Jonas



Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-24 Thread Antoine Pitrou



+1 on the format additions

The implementations will probably need a bit more review back-and-forth.

Regards

Antoine.


Le 28/06/2023 à 21:34, Benjamin Kietzman a écrit :

Hello,

I'd like to propose adding Utf8View arrays to the arrow format.
Previous discussion in [1], columnar format description in [2],
flatbuffers changes in [3].

There are implementations available in both C++[4] and Go[5] which
exercise the new type over IPC. Utf8View format demonstrates[6]
significant performance benefits over Utf8 in common tasks.

The vote will be open for at least 72 hours.

[ ] +1 add the proposed Utf8View type to the Apache Arrow format
[ ] -1 do not add the proposed Utf8View type to the Apache Arrow format
because...

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
[2]
https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
[3]
https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
[4] https://github.com/apache/arrow/pull/35628
[5] https://github.com/apache/arrow/pull/35769
[6] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617



[Discuss][C++] A framework for contextual/implicit/ambient vars

2023-08-24 Thread Antoine Pitrou



Hello,

Arrow C++ comes with execution facilities (such as thread pools, async 
generators...) meant to unlock higher performance by hiding IO latencies 
and exploiting several CPU cores. These execution facilities also 
obscure the context in which a task is executed: you cannot simply use 
local, global or thread-local variables to store ancillary parameters.


Over the years we have started adding optional metadata that can be 
associated with tasks:


- StopToken
- TaskHints (though that doesn't seem to be used currently?)
- some people have started to ask about IO tags:
https://github.com/apache/arrow/issues/37267

However, any such additional metadata must currently be explicitly 
passed to all tasks that might make use of them.


My questions are thus:

- do we want to continue using the explicit passing style?
- on the contrary, do we want to switch to a paradigm where those, once 
set, are propagated implicitly along the task dependency flow (e.g. from 
the caller of Executor::Submit to the task submitted)

- are there useful or insightful precedents in the C++ ecosystem?

(note: a similar facility in Python is brought by "context vars":
https://docs.python.org/3/library/contextvars.html)

Regards

Antoine.


Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou



And of course this is a bit pedantic, and only important if we want to 
comply with *the letter* of the ASF policies. My own personal opinion is 
that complying in spirit is enough (but I'm also not sure I understand 
the ASF's spirit :-)).


Regards

Antoine.


Le 22/08/2023 à 17:10, Antoine Pitrou a écrit :


Hmm... perhaps Flatbuffers compilation is usually more deterministic
than compiling C++ code into machine code, but that's mainly (AFAIK)
because the transformation step is much simpler in the former case, than
in the latter. The Flatbuffers compiler also has a range of options that
influence code generation, certainly with less variation than a C++
compiler, but still.

In other words, I don't think being deterministic is a good criterion to
know what "compiled code" means. There is a growing movement towards
making generation of machine code artifacts deterministic:
https://reproducible-builds.org/

Regards

Antoine.



Le 22/08/2023 à 16:47, Adam Lippai a écrit :

Compiled code usually means binaries you can’t derive in a deterministic,
verifiable way from the source code *shipped next to it*. So in this case
any developer should be able to reproduce the flatbuffers output from the
release package only.

“Caches”, multi stage compilation etc should be ok.

Best regards,
Adam Lippai

On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou  wrote:



If the main impetus for the verification script is to comply with ASF
requirements, probably the script can be made much simpler, such as just
verify the GPG signatures are valid? Or perhaps this can be achieved
without a script at all.

The irony is that, however complex, our verification script doesn't seem
to check the actual ASF requirements on artifacts.

For example, we don't check that """a source release SHOULD not contain
compiled code""" (also, what does "compiled code" mean? does generated
code, e.g. by the Flatbuffers compiler, apply?)

Checking that the release """MUST be sufficient for a user to build and
test the release provided they have access to the appropriate platform
and tools""" is ill-defined and potentially tautologic, because the
"appropriate platform and tools" is too imprecise and contextual (can
the "appropriate platform and tools" contain a bunch of proprietary
software that gets linked with the binaries? Well, it can, otherwise you
can't build on Windows).

Regards

Antoine.



Le 22/08/2023 à 12:31, Raúl Cumplido a écrit :

Hi,

I do agree that currently verifying the release locally provides
little benefit for the effort we have to put in but I thought this was
required as per Apache policy:
https://www.apache.org/legal/release-policy.html#release-approval

Copying the important bit:
"""
Before casting +1 binding votes, individuals are REQUIRED to download
all signed source code packages onto their own hardware, verify that
they meet all requirements of ASF policy on releases as described
below, validate all cryptographic signatures, compile as provided, and
test the result on their own platform.
"""

I also think we should try and challenge those.

In the past we have identified some minor issues on the local
verification but I don't recall any of them being blockers for the
release.

Thanks,
Raúl

El mar, 22 ago 2023 a las 11:46, Andrew Lamb ()

escribió:


The Rust arrow implementation (arrow-rs) and DataFusion also use release
verification scripts, mostly inherited from when they were split from

the

mono repo. They have found issues from time to time, for us, but those
issues are often not platform related and have not been release

blockers.


Thankfully for Rust, the verification scripts don't need much

maintenance

so we just continue the ceremony. However, I certainly don't think we

would

lose much/any test coverage if we stopped their use.

Andrew

On Tue, Aug 22, 2023 at 4:54 AM Antoine Pitrou 

wrote:




Hello,

Abiding by the Apache Software Foundation's guidelines, every Arrow
release is voted on and requires at least 3 "binding" votes to be

approved.


Also, every Arrow release vote is accompanied by a little ceremonial
where contributors and core developers run a release verification

script

on their machine, wait for long minutes (sometimes an hour) and report
the results.

This ceremonial has gone on for years, and it has not really been
questioned. Yet, it's not obvious to me what it is achieving exactly.
I've been here since 2018, but I don't really understand what the
verification script is testing for, or, more importantly, *why* it is
testing for what it is testing. I'm probably not the only one?

I would like to bring the following points:

* platform compatibility is (supposed to be) exercised on Continuous
Integration; there is no understandable reason why it should be
ceremoniously tested on each developer's machine before the release

* just before

Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou



Hmm... perhaps Flatbuffers compilation is usually more deterministic 
than compiling C++ code into machine code, but that's mainly (AFAIK) 
because the transformation step is much simpler in the former case, than 
in the latter. The Flatbuffers compiler also has a range of options that 
influence code generation, certainly with less variation than a C++ 
compiler, but still.


In other words, I don't think being deterministic is a good criterion to 
know what "compiled code" means. There is a growing movement towards 
making generation of machine code artifacts deterministic:

https://reproducible-builds.org/

Regards

Antoine.



Le 22/08/2023 à 16:47, Adam Lippai a écrit :

Compiled code usually means binaries you can’t derive in a deterministic,
verifiable way from the source code *shipped next to it*. So in this case
any developer should be able to reproduce the flatbuffers output from the
release package only.

“Caches”, multi stage compilation etc should be ok.

Best regards,
Adam Lippai

On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou  wrote:



If the main impetus for the verification script is to comply with ASF
requirements, probably the script can be made much simpler, such as just
verify the GPG signatures are valid? Or perhaps this can be achieved
without a script at all.

The irony is that, however complex, our verification script doesn't seem
to check the actual ASF requirements on artifacts.

For example, we don't check that """a source release SHOULD not contain
compiled code""" (also, what does "compiled code" mean? does generated
code, e.g. by the Flatbuffers compiler, apply?)

Checking that the release """MUST be sufficient for a user to build and
test the release provided they have access to the appropriate platform
and tools""" is ill-defined and potentially tautologic, because the
"appropriate platform and tools" is too imprecise and contextual (can
the "appropriate platform and tools" contain a bunch of proprietary
software that gets linked with the binaries? Well, it can, otherwise you
can't build on Windows).

Regards

Antoine.



Le 22/08/2023 à 12:31, Raúl Cumplido a écrit :

Hi,

I do agree that currently verifying the release locally provides
little benefit for the effort we have to put in but I thought this was
required as per Apache policy:
https://www.apache.org/legal/release-policy.html#release-approval

Copying the important bit:
"""
Before casting +1 binding votes, individuals are REQUIRED to download
all signed source code packages onto their own hardware, verify that
they meet all requirements of ASF policy on releases as described
below, validate all cryptographic signatures, compile as provided, and
test the result on their own platform.
"""

I also think we should try and challenge those.

In the past we have identified some minor issues on the local
verification but I don't recall any of them being blockers for the
release.

Thanks,
Raúl

El mar, 22 ago 2023 a las 11:46, Andrew Lamb ()

escribió:


The Rust arrow implementation (arrow-rs) and DataFusion also use release
verification scripts, mostly inherited from when they were split from

the

mono repo. They have found issues from time to time, for us, but those
issues are often not platform related and have not been release

blockers.


Thankfully for Rust, the verification scripts don't need much

maintenance

so we just continue the ceremony. However, I certainly don't think we

would

lose much/any test coverage if we stopped their use.

Andrew

On Tue, Aug 22, 2023 at 4:54 AM Antoine Pitrou 

wrote:




Hello,

Abiding by the Apache Software Foundation's guidelines, every Arrow
release is voted on and requires at least 3 "binding" votes to be

approved.


Also, every Arrow release vote is accompanied by a little ceremonial
where contributors and core developers run a release verification

script

on their machine, wait for long minutes (sometimes an hour) and report
the results.

This ceremonial has gone on for years, and it has not really been
questioned. Yet, it's not obvious to me what it is achieving exactly.
I've been here since 2018, but I don't really understand what the
verification script is testing for, or, more importantly, *why* it is
testing for what it is testing. I'm probably not the only one?

I would like to bring the following points:

* platform compatibility is (supposed to be) exercised on Continuous
Integration; there is no understandable reason why it should be
ceremoniously tested on each developer's machine before the release

* just before a release is probably the wrong time to be testing
platform compatibility, and fixing compatibility bugs (though, of
course, it might still be better than not noticing?)

* home environments are unstable, and not all developers run the
verification script for each release, so each release is actually
verifie

Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou



If the main impetus for the verification script is to comply with ASF 
requirements, probably the script can be made much simpler, such as just 
verify the GPG signatures are valid? Or perhaps this can be achieved 
without a script at all.


The irony is that, however complex, our verification script doesn't seem 
to check the actual ASF requirements on artifacts.


For example, we don't check that """a source release SHOULD not contain 
compiled code""" (also, what does "compiled code" mean? does generated 
code, e.g. by the Flatbuffers compiler, apply?)


Checking that the release """MUST be sufficient for a user to build and 
test the release provided they have access to the appropriate platform 
and tools""" is ill-defined and potentially tautologic, because the 
"appropriate platform and tools" is too imprecise and contextual (can 
the "appropriate platform and tools" contain a bunch of proprietary 
software that gets linked with the binaries? Well, it can, otherwise you 
can't build on Windows).


Regards

Antoine.



Le 22/08/2023 à 12:31, Raúl Cumplido a écrit :

Hi,

I do agree that currently verifying the release locally provides
little benefit for the effort we have to put in but I thought this was
required as per Apache policy:
https://www.apache.org/legal/release-policy.html#release-approval

Copying the important bit:
"""
Before casting +1 binding votes, individuals are REQUIRED to download
all signed source code packages onto their own hardware, verify that
they meet all requirements of ASF policy on releases as described
below, validate all cryptographic signatures, compile as provided, and
test the result on their own platform.
"""

I also think we should try and challenge those.

In the past we have identified some minor issues on the local
verification but I don't recall any of them being blockers for the
release.

Thanks,
Raúl

El mar, 22 ago 2023 a las 11:46, Andrew Lamb () escribió:


The Rust arrow implementation (arrow-rs) and DataFusion also use release
verification scripts, mostly inherited from when they were split from the
mono repo. They have found issues from time to time, for us, but those
issues are often not platform related and have not been release blockers.

Thankfully for Rust, the verification scripts don't need much maintenance
so we just continue the ceremony. However, I certainly don't think we would
lose much/any test coverage if we stopped their use.

Andrew

On Tue, Aug 22, 2023 at 4:54 AM Antoine Pitrou  wrote:



Hello,

Abiding by the Apache Software Foundation's guidelines, every Arrow
release is voted on and requires at least 3 "binding" votes to be approved.

Also, every Arrow release vote is accompanied by a little ceremonial
where contributors and core developers run a release verification script
on their machine, wait for long minutes (sometimes an hour) and report
the results.

This ceremonial has gone on for years, and it has not really been
questioned. Yet, it's not obvious to me what it is achieving exactly.
I've been here since 2018, but I don't really understand what the
verification script is testing for, or, more importantly, *why* it is
testing for what it is testing. I'm probably not the only one?

I would like to bring the following points:

* platform compatibility is (supposed to be) exercised on Continuous
Integration; there is no understandable reason why it should be
ceremoniously tested on each developer's machine before the release

* just before a release is probably the wrong time to be testing
platform compatibility, and fixing compatibility bugs (though, of
course, it might still be better than not noticing?)

* home environments are unstable, and not all developers run the
verification script for each release, so each release is actually
verified on different, uncontrolled, platforms

* as for sanity checks on binary packages, GPG signatures, etc., there
shouldn't be any need to run them on multiple different machines, as
they are (should be?) entirely deterministic and platform-agnostic

* maintaining the verification scripts is a thankless task, in part due
to their nature (they need to track and mirror changes made in each
implementation's build chain), in part due to implementation choices

* due to the existence of the verification scripts, the release vote is
focussed on getting the script to run successfully (a very contextual
and non-reproducible result), rather than the actual *contents* of the
release

The most positive thing I can personally say about the verification
scripts is that they *may* help us trust the release is not broken? But
that's a very unqualified statement, and is very close to cargo-culting.

Regards

Antoine.



Re: [VOTE] Release Apache Arrow 13.0.0 - RC3

2023-08-22 Thread Antoine Pitrou



+1 from me (binding). The verification script failed for me, but I 
consider it not a problem (see separate discussion thread).


Regards

Antoine.


Le 18/08/2023 à 10:00, Raúl Cumplido a écrit :

Hi,

I would like to propose the following release candidate (RC3) of Apache
Arrow version 13.0.0. This is a release consisting of 440
resolved GitHub issues[1].

This release candidate is based on commit:
b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0 [2]

The source release rc3 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 13.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 13.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc3
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc3
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc3
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc3
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/37220


[Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou



Hello,

Abiding by the Apache Software Foundation's guidelines, every Arrow 
release is voted on and requires at least 3 "binding" votes to be approved.


Also, every Arrow release vote is accompanied by a little ceremonial 
where contributors and core developers run a release verification script 
on their machine, wait for long minutes (sometimes an hour) and report 
the results.


This ceremonial has gone on for years, and it has not really been 
questioned. Yet, it's not obvious to me what it is achieving exactly. 
I've been here since 2018, but I don't really understand what the 
verification script is testing for, or, more importantly, *why* it is 
testing for what it is testing. I'm probably not the only one?


I would like to bring the following points:

* platform compatibility is (supposed to be) exercised on Continuous 
Integration; there is no understandable reason why it should be 
ceremoniously tested on each developer's machine before the release


* just before a release is probably the wrong time to be testing 
platform compatibility, and fixing compatibility bugs (though, of 
course, it might still be better than not noticing?)


* home environments are unstable, and not all developers run the 
verification script for each release, so each release is actually 
verified on different, uncontrolled, platforms


* as for sanity checks on binary packages, GPG signatures, etc., there 
shouldn't be any need to run them on multiple different machines, as 
they are (should be?) entirely deterministic and platform-agnostic


* maintaining the verification scripts is a thankless task, in part due 
to their nature (they need to track and mirror changes made in each 
implementation's build chain), in part due to implementation choices


* due to the existence of the verification scripts, the release vote is 
focussed on getting the script to run successfully (a very contextual 
and non-reproducible result), rather than the actual *contents* of the 
release


The most positive thing I can personally say about the verification 
scripts is that they *may* help us trust the release is not broken? But 
that's a very unqualified statement, and is very close to cargo-culting.


Regards

Antoine.


Re: [VOTE] Release Apache Arrow 13.0.0 - RC3

2023-08-22 Thread Antoine Pitrou



Hello,

It seems the verification instructions are not up to date?
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates

I've tried to run the suggested command:

$ dev/release/verify-release-candidate.sh source 13.0.0 3

and I get the following error message:
"""
Usage:
  Verify release candidate:
dev/release/verify-release-candidate.sh X.Y.Z RC_NUMBER
  Verify only the source distribution:
TEST_DEFAULT=0 TEST_SOURCE=1 dev/release/verify-release-candidate.sh X.Y.Z 
RC_NUMBER
  Verify only the binary distributions:
TEST_DEFAULT=0 TEST_BINARIES=1 dev/release/verify-release-candidate.sh 
X.Y.Z RC_NUMBER
  Verify only the wheels:
TEST_DEFAULT=0 TEST_WHEELS=1 dev/release/verify-release-candidate.sh X.Y.Z 
RC_NUMBER
"""

Regards

Antoine.




Le 18/08/2023 à 10:00, Raúl Cumplido a écrit :

Hi,

I would like to propose the following release candidate (RC3) of Apache
Arrow version 13.0.0. This is a release consisting of 440
resolved GitHub issues[1].

This release candidate is based on commit:
b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0 [2]

The source release rc3 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
The changelog is located at [12].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [13] for how to validate a release candidate.

See also a verification result on GitHub pull request [14].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 13.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 13.0.0 because...

[1]: 
https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
[2]: 
https://github.com/apache/arrow/tree/b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc3
[4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
[5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
[6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
[7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
[8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc3
[9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc3
[10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc3
[11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
[12]: 
https://github.com/apache/arrow/blob/b7d2f7ffca66c868bd2fce5b3749c6caa002a7f0/CHANGELOG.md
[13]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[14]: https://github.com/apache/arrow/pull/37220


Re: Sort a Table In C++?

2023-08-17 Thread Antoine Pitrou



Or you can simply call the "sort_indices" compute function:
https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions


Le 17/08/2023 à 23:20, Ian Cook a écrit :

Li,

Here's a standalone C++ example that constructs a Table and executes
an Acero ExecPlan to sort it:
https://gist.github.com/ianmcook/2aa9aa82e61c3ea4405450b93cf80fbc

Ian

On Thu, Aug 17, 2023 at 4:50 PM Li Jin  wrote:


Hi,

I am writing some C++ test and found myself in need for an c++ function to
sort an arrow Table. Before I go around implementing one myself, I wonder
if there is already a function that does that? (I searched the doc but
didn’t find one).

There is function in Acero can do it but I didn’t find a super easy way to
wrap a Table as An Acero source node either.

Appreciate it if someone can give some pointers.

Thanks,
Li


Re: [VOTE] Apache Arrow ADBC (API) 1.1.0

2023-08-16 Thread Antoine Pitrou



+1 (binding)


Le 14/08/2023 à 19:39, David Li a écrit :

Hello,

We have been discussing revisions [1] to the ADBC APIs, which we formerly 
decided to treat as a specification [2]. These revisions clean up various 
missing features (e.g. cancellation, error metadata) and better position ADBC 
to help different data systems interoperate (e.g. by exposing more metadata, 
like table/column statistics).

For details, see the PR at [3]. (The main file to read through is adbc.h.)

I would like to propose that the Arrow project adopt this RFC, along with the 
linked PR, as version 1.1.0 of the ADBC API standard.

Please vote to adopt the specification as described above. This is not a vote 
to release any packages; the first package release to support version 1.1.0 of 
the APIs will be 0.7.0 of the packages. (So I will not merge the linked PR 
until after we release ADBC 0.6.0.)

This vote will be open for at least 72 hours.

[ ] +1 Adopt the ADBC 1.1.0 specification
[ ]  0
[ ] -1 Do not adopt the specification because...

Thanks to Sutou Kouhei, Matt Topol, Dewey Dunnington, Antoine Pitrou, Will Ayd, 
and Will Jones for feedback on the design and various work-in-progress PRs.

[1]: https://github.com/apache/arrow-adbc/milestone/3
[2]: https://lists.apache.org/thread/s8m4l9hccfh5kqvvd2x3gxn3ry0w1ryo
[3]: https://github.com/apache/arrow-adbc/pull/971

Thank you,
David


Re: [DISCUSS][Arrow] Extension metadata encoding design

2023-08-16 Thread Antoine Pitrou



Hmm, you're right that letting the extension type peek at the entire 
metadata values would have been another solution.


That said, for protocol compatibility reasons, we cannot easily change 
this anymore.


Regards

Antoine.



Le 16/08/2023 à 17:48, Jeremy Leibs a écrit :

Thanks for the context, Antoine.

However, even in those examples, I don't really see how coercing the
metadata to a single string makes much of a difference.
I believe the main difference of what I'm proposing would be that the
ExtensionType::Deserialize interface:
https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51

Would instead look like:
```
   arrow::Result> Deserialize(
   std::shared_ptr storage_type,
   std::shared_ptr metadata) const;
```

In both of those cases though it seems like a
valid std::shared_ptr is available to be passed to the
extension.

I suspect the more challenging case might be related to DataType equality
checks? It would not be possible for generic code to know whether it can
validly do things like concatenate two extension arrays without knowledge
of which metadata keys are relevant to the extension.  That said, with the
current adhoc serialization of metadata to a string, different
encoder-implementations still might still produce non-comparable strings,
resulting in falsely reported datatype mismatches, but at least avoiding
the case of false positives.

On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou  wrote:



Hi Jeremy,

A single key makes it easier for generic code to recreate extension
types it does not know about.

Here is an example in the C++ IPC layer:

https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845

Here is similar logic in the C++ bridge for the C Data Interface:

https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029

It is probably expected that many extension types will be parameter-less
(such as UUID, JSON, BSON...).

It does imply that extension types with sophisticated parameterization
must implement a custom (de)serialization mechanism themselves. I'm not
sure this tradeoff was discussed at the time, perhaps other people (Wes?
Jacques?) may chime in.

Regards

Antoine.



Le 16/08/2023 à 16:32, Jeremy Leibs a écrit :

Hello,

I've recently started working with extension types as part of our project
and I was surprised to discover that extension types are required to pack
all of their own metadata into a single string value of the
"ARROW:extension:metadata" key.

In turn this then means we have to endure arbitrary unstructured /
hard-to-validate strings with custom encodings (e.g. JSON inside
flatbuffer) when dealing with extensions.

Can anyone provide some context on the rationale for this design

decision?


Given that we already have (1) a perfectly good metadata keyvalue store
already in place, and (2) established recommendations for
namespaced scoping of keys, why would we not just use that to store the
metadata for the extension. For example:

"ARROW:extension:name": "myorg.myextension",
"myorg:myextension:meta1": "value1",
"myorg:myextension:meta2": "value2",

Thanks for any insights,
Jeremy







Re: [DISCUSS][Arrow] Extension metadata encoding design

2023-08-16 Thread Antoine Pitrou



Hi Jeremy,

A single key makes it easier for generic code to recreate extension 
types it does not know about.


Here is an example in the C++ IPC layer:
https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845

Here is similar logic in the C++ bridge for the C Data Interface:
https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029

It is probably expected that many extension types will be parameter-less 
(such as UUID, JSON, BSON...).


It does imply that extension types with sophisticated parameterization 
must implement a custom (de)serialization mechanism themselves. I'm not 
sure this tradeoff was discussed at the time, perhaps other people (Wes? 
Jacques?) may chime in.


Regards

Antoine.



Le 16/08/2023 à 16:32, Jeremy Leibs a écrit :

Hello,

I've recently started working with extension types as part of our project
and I was surprised to discover that extension types are required to pack
all of their own metadata into a single string value of the
"ARROW:extension:metadata" key.

In turn this then means we have to endure arbitrary unstructured /
hard-to-validate strings with custom encodings (e.g. JSON inside
flatbuffer) when dealing with extensions.

Can anyone provide some context on the rationale for this design decision?

Given that we already have (1) a perfectly good metadata keyvalue store
already in place, and (2) established recommendations for
namespaced scoping of keys, why would we not just use that to store the
metadata for the extension. For example:

"ARROW:extension:name": "myorg.myextension",
"myorg:myextension:meta1": "value1",
"myorg:myextension:meta2": "value2",

Thanks for any insights,
Jeremy



Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Antoine Pitrou



+1 from me (binding).

It would be nice to get approval from authors of other implementations 
such as Rust, C#, Javascript...


Thanks for doing this!


Le 16/08/2023 à 16:16, Matt Topol a écrit :

Hey All,

As proposed by Felipe [1] I'm starting a vote on the proposed update to the
Format Spec of adding "+r" as the format string for passing Run-End Encoded
arrays through the Arrow C Data Interface.

A PR containing an update to the C++ Arrow implementation to add support
for this format string along with documentation updates can be found here
[2].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--Matt

[1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781
[2]: https://github.com/apache/arrow/pull/37174



Re: [Format] C data interface format string for run-end encoded arrays

2023-08-15 Thread Antoine Pitrou



I think we should.

Regards

Antoine.


Le 15/08/2023 à 19:58, Matt Topol a écrit :

I'm in favor of this as the C Data format string. Though since this is
technically a format/spec change do others think we should take a vote on
this?

--Matt

On Tue, Aug 15, 2023, 12:19 PM Felipe Oliveira Carvalho 
wrote:


Hello,

I'm writing to inform you that I'm proposing "+r" as format string for
run-end encoded arrays passing through the Arrow C data interface [1].

Feel free to also discuss in the linked PR with the changes to bridge.cc
and reference docs.

[1] https://arrow.apache.org/docs/format/CDataInterface.html
[2] https://github.com/apache/arrow/pull/37174

--
Felipe





Re: hashing Arrow structures

2023-07-24 Thread Antoine Pitrou



Hi,

Le 21/07/2023 à 15:58, Yaron Gvili a écrit :

A first approach I found is using `Hashing32` and `Hashing64`. This approach 
seems to be useful for hashing the fields composing a key of multiple rows when 
joining. However, it has a couple of drawbacks. One drawback is that if the 
number of distinct keys is large (like in the scale of a million or so) then 
the probability of hash collision may no longer be acceptable for some 
applications, more so when using `Hashing32`. Another drawback that I noticed 
in my experiments is that the common `N/A` and `0` integer values both hash to 
0 and thus collide.


Ouch... so if N/A does have the same hash value as a common non-null 
value (0), this should be fixed.


Also, I don't understand why there are two versions of the hash table 
("hashing32" and "hashing64" apparently). What's the rationale? How is 
the user meant to choose between them? Say a Substrait plan is being 
executed: which hashing variant is chosen and why?


I don't think 32-bit hashing is a good idea when operating on large 
data. Unless the hash function is exceptionally good, you may get lots 
of hash collisions. It's nice to have a SIMD-accelerated hash table, but 
less so if access times degenerate to O(n)...


So IMHO we should only have one hashing variant with a 64-bit output. 
And make sure it doesn't have trivial collisions on common data patterns 
(such as nulls and zeros, or clustered integer ranges).



A second approach I found is by serializing the Arrow structures (possibly by 
streaming) and hashing using functions in `util/hashing.h`. I didn't yet look 
into what properties these hash functions have except for the documented high 
performance. In particular, I don't know whether they have unfortunate hash 
collisions and, more generally, what is the probability of hash collision. I 
also don't know whether they are designed for efficient use in the context of 
joining.


Those hash functions shouldn't have unfortunate hash, but they were not 
exercised on real-world data at the time. I have no idea whether they 
are efficient in the context of joining, as they have been written much 
earlier than our joining implementation.


Regards

Antoine.


Re: [DISCUSS] Canonical alternative layout proposal

2023-07-18 Thread Antoine Pitrou



Hello,

I'm trying to reason about the advantages and drawbacks of this 
proposal, but it seems to me that it lacks definition.


I would welcome a draft PR showcasing the changes necessary in the IPC 
format definition, and in the C Data Interface specification (no need to 
actually implement them for now :-)).



As it is, it seems that this proposal would allow us to switch from:

"""We'd like to add a more efficient physical data representation, so 
we'll introduce a new Arrow data type. Implementations may or may not 
support it, but we will progressively try to bring reference 
implementations to parity.""" (1)


to:

"""We'd like to add a more efficient physical data representation, so 
we'll introduce a new alternative layout for an existing Arrow data 
type. Implementations may or may not support it, but we will 
progressively try to bring reference implementations to parity.""" (2)


The expected advantage of (2) over (1) seems to be mainly a difference 
in how new format features are communicated. There are mainline 
features, and there are experimental / provisional features.


Regards

Antoine.



Le 13/07/2023 à 00:01, Neal Richardson a écrit :

Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I understand
it, is to be able to support memory layouts that some (but perhaps not all)
applications of Arrow find valuable, so that these nearly Arrow systems can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to the
core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
* Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and implement the new
layout. It would be good if they contribute in the discussion and consider
the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative layouts as
we want:
* A canonical alternative layout should, at a minimum, have some reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the primary and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance) of
the layout but also the interoperability. In other words, a layout can only
become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the comments
on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?" And
Pedro notes that because of the overhead of converting these not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered abandoning
Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity or
burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger issues,
and see if this helps us reason about the various proposals on the table. I
wonder if the 

Re: Webassembly?

2023-07-06 Thread Antoine Pitrou



Hi Joe,

Thank you for working on that.

The one question I have is: are you willing to help us maintain Arrow 
C++ on the long term? The logic you're adding in 
https://github.com/apache/arrow/pull/35672 is quite delicate; also I 
don't think anyone among us is a Webassembly expert, which means that we 
might break things unwillingly. So while it would be great to get Arrow 
C++ to work with WASM, a dedicated expert is needed to help maintain and 
debug WASM support in the future.


Regards

Antoine.


Le 03/07/2023 à 17:29, Joe Marshall a écrit :

Hi,

I'm a pyodide developer amongst other things (webassembly cpython intepreter) 
and I've got some PRs in progress on arrow relating to webassembly support. I 
wondered if it might be worth discussing my broader ideas for this on the list 
or at the biweekly development meeting?

So far I have 35176 in, which makes arrow run on a single thread. This is 
needed because in a lot of webassembly environments (browsers at least, 
pyodide), threading isn't available or is heavily constrained.

With that I've aimed to make it relatively transparent to users, so that things 
like datasets and acero mostly just work (but slower obviously). It's kind of 
fiddly in the arrow code but working, and means users can port things easily.

Once that is in, the plan is to submit a following pr that adds cmake presets 
for emscripten which can build the cpp libraries and pyarrow for pyodide. I've 
hacked this together in a build already, it's a bit fiddly and needs a load of 
tidying up, but I'm confident it can be done.

Essentially, I'm wanting to get this stuff in because pandas is moving towards 
arrow as a pretty much required dependency, and webassembly is a pandas 
platform, as well as
  being an official python platform, so it would be great to get it working in 
pyodide without us needing to maintain a load of patches. I guess it could also 
come in handy with various container platforms that are moving to webassembly.

Basically I thought it's probably worth a bit of a heads up relating to this, 
as I know the bigger picture of things is often hard to see from just pull 
requests.

Thanks
Joe



This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment.

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored
where permitted by law.







Re: Do we need CODEOWNERS ?

2023-07-05 Thread Antoine Pitrou



Thanks all for the responses.

I've opened https://github.com/apache/arrow/issues/36474 for a more 
selective approach as preferred by most. See you there :-)


Regards

Antoine.


Le 05/07/2023 à 06:22, Alenka Frim a écrit :

I agree with what was said till now.

I did agree to be added as a codeowner for the Python directory which didn't
turn out to be the best idea. As Joris mentioned, the number of
notifications
is not small. There are lots of PRs that are not Python related, but maybe
just have a test added in Python and therefore I am not capable of
reviewing.
So similarly as Dewey mentioned, most of the PRs on which I get assigned
to as a reviewer I simply ignore.

Not perfect, for sure. Hopefully that didn't in reality cause too much
bewilderment and bad experience from the side of the contributors.

But what I do like with this approach is that I am aware of most of the
things
that go on in the project and could be connected to pyarrow.

To give a "vote" on the proposed way forward, I think the second option
(de-assigning themselves, and if possible pinging another core developer)
could be a good way to go. If we would be expected to give a review on each
PR we are assigned to it would be fair that I remove myself from the
CODEOWNERS file.

Best,
Alenka

On Wed, Jul 5, 2023 at 12:05 AM Will Jones  wrote:


I haven't had as much time to review the Parquet PRs, so I'll remove myself
from the CODEOWNERS for that.

I've found that I have a much easier time keeping up with PR reviews in
projects that are smaller, even if there are proportionally fewer
maintainers. I think that's the piece that appealed to me originally about
CODEOWNERS: that we could start to make there be some more clarity on how
reviewing responsibility can be divided up. But I agree it hasn't really
lived up to that hope.

On Tue, Jul 4, 2023 at 1:13 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:


I think it can be useful in certain cases, where the selection is
specific enough (for example if all Go related PRs is not too much for
Matt, this features sounds useful for him. I can also imagine if you
are working on flight, just getting notifications for changes to the
flight-related files might be useful).

Personally, for myself I didn't add my name to the CODEOWNERS, because
as someone doing general pyarrow maintenance, I was thinking that
adding my name as owner of "python" directory would lead to way too
many notifications for me, and there is no obvious more specific
selection.

So if it's useful for some people, I wouldn't necessarily remove it,
as long as: 1) everyone individually evaluates for themselves whether
this is working or not (and it's fine to remove some entries again),
and 2) we know this is not a system to properly ping reviewers for all
PRs, and we still need to manually ping reviewers in other cases.

On Tue, 4 Jul 2023 at 20:11, Matt Topol  wrote:


I've found it useful for me so far since it auto adds me on any Go

related

PRs so I don't need to sift through the notifications or active PRs,

and

instead can easily find them in my reviews on GitHub notifications.

But if everyone else finds it more detrimental than helpful I can set

up

a

custom filter or something.

On Tue, Jul 4, 2023, 12:30 PM Weston Pace 

wrote:



I agree the experiment isn't working very well.  I've been meaning to
change my listing from `compute` to `acero` for a while.  I'd be +1

for

just removing it though.

On Tue, Jul 4, 2023, 6:44 AM Dewey Dunnington

wrote:


Just a note that for me, the main problem is that I get automatic
review requests for PRs that have nothing to do with R (I think

this

happens when a rebase occurs that contained an R commit). Because

that

happens a lot, it means I miss actual review requests and sometimes
mentions because they blend in. I think CODEOWNERS results in me
reviewing more PRs than if I had to set up some kind of custom
notification filter but I agree that it's not perfect.

Cheers,

-dewey

On Tue, Jul 4, 2023 at 10:04 AM Antoine Pitrou 


wrote:



Hello,

Some time ago we added a `.github/CODEOWNERS` file in the main

Arrow

repo. The idea is that, when specific files or directories are

touched

by a PR, specific people are asked for review.

Unfortunately, it seems that, most of the time, this produces the
following effects:

1) the people who are automatically queried for review don't show

up

(perhaps they simply ignore those automatic notifications)
2) when several people are assigned for review, each designated

reviewer

seems to hope that the other ones will be doing the work, instead

of

doing it themselves
3) contributors expect those people to show up and are therefore
bewildered when nobody comes to review their PR

Do we want to keep CODEOWNERS? If we still think it can be

beneficial,

we should institute a policy where people who are listed in that

file

promise to respond to review requests: 1) either by doing a

revie

Do we need CODEOWNERS ?

2023-07-04 Thread Antoine Pitrou



Hello,

Some time ago we added a `.github/CODEOWNERS` file in the main Arrow 
repo. The idea is that, when specific files or directories are touched 
by a PR, specific people are asked for review.


Unfortunately, it seems that, most of the time, this produces the 
following effects:


1) the people who are automatically queried for review don't show up 
(perhaps they simply ignore those automatic notifications)
2) when several people are assigned for review, each designated reviewer 
seems to hope that the other ones will be doing the work, instead of 
doing it themselves
3) contributors expect those people to show up and are therefore 
bewildered when nobody comes to review their PR


Do we want to keep CODEOWNERS? If we still think it can be beneficial, 
we should institute a policy where people who are listed in that file 
promise to respond to review requests: 1) either by doing a review 2) or 
by de-assigning themselves, and if possible pinging another core developer.


What do you think?

Regards

Antoine.


Re: [DISCUSS] UTF-8 validation

2023-07-02 Thread Antoine Pitrou




Le 02/07/2023 à 14:00, Raphael Taylor-Davies a écrit :


More an observation than an issue, but UTF-8 validation for StringArray can be 
done very efficiently by first verifying the entire buffer, and then verifying 
the offsets correspond to the start of a UTF-8 codepoint.


Caveat: null slots could potentially contain invalid UTF-8 data. Not 
likely of course, but it should probably not be an error.


That said, yes, it is a smart strategy for the common case!

Regards

Antoine.


  1   2   3   4   5   6   7   8   9   10   >