Seattle Arrow meetup (adjacent to post::conf)

2024-05-29 Thread Weston Pace
I've noticed that a number of Arrow people will be in Seattle in August.  I
know there are a number of Arrow contributors that live in the Seattle area
as well.  I'd like to organize a face-to-face meetup for the Arrow
community and have created an issue for discussion[1].  I welcome any
input, feedback, or interest!

Note: as mentioned on the issue, no decisions will be made at the meetup,
this is for community building and general discussion only and I will do my
best to make everything publicly available afterwards.

[1] https://github.com/apache/arrow/issues/41881


Re: [DISCUSS] Drop Java 8 support

2024-05-24 Thread Weston Pace
No vote is required from an ASF perspective (this is not a release)
No vote is required from Arrow conventions (this is not a spec change and
does not impact more than one implementation)

I will send a message to the parquet ML to solicit feedback.

On Fri, May 24, 2024 at 8:22 AM Laurent Goujon 
wrote:

> I would say so because it is akin to removing a large feature but maybe
> some PMC can chime in?
>
> Laurent
>
> On Tue, May 21, 2024 at 12:16 PM Dane Pitkin  wrote:
>
> > I haven't been active in Apache Parquet, but I did not see any prior
> > discussions on this topic in their Jira or dev mailing list.
> >
> > Do we think a vote is needed before officially moving forward with Java 8
> > deprecation?
> >
> > On Mon, May 20, 2024 at 12:50 PM Laurent Goujon
>  > >
> > wrote:
> >
> > > I also mentioned Apache Parquet and haven't seen someone mentioned
> > if/when
> > > Apache Parquet would transition.
> > >
> > >
> > >
> > > On Fri, May 17, 2024 at 9:07 AM Dane Pitkin 
> wrote:
> > >
> > > > Fokko, thank you for these datapoints! It's great to see how other
> low
> > > > level Java OSS projects are approaching this.
> > > >
> > > > JB, I believe yes we have formal consensus to drop Java 8 in Arrow.
> > There
> > > > was no contention in current discussions across [GitHub issues |
> Arrow
> > > > Mailing List | Community Syncs].
> > > >
> > > > We can save Java 11 deprecation for a future discussion. For users on
> > > Java
> > > > 11, I do anticipate this discussion to come shortly after Java 8
> > > > deprecation is released.
> > > >
> > > > On Fri, May 17, 2024 at 10:02 AM Fokko Driesprong 
> > > > wrote:
> > > >
> > > > > I was traveling the last few weeks, so just a follow-up from my
> end.
> > > > >
> > > > > Fokko, can you elaborate on the discussions held in other OSS
> > projects
> > > to
> > > > >> drop Java <17? How did they weigh the benefits/drawbacks for
> > dropping
> > > > both
> > > > >> Java 8 and 11 LTS versions? I'd also be curious if other projects
> > plan
> > > > to
> > > > >> support older branches with security patches.
> > > > >
> > > > >
> > > > > So, the ones that I'm involved with (including a TLDR):
> > > > >
> > > > >- Avro:
> > > > >   - (April 2024: Consensus on moving to 11+, +1 for moving to
> > 17+)
> > > > >
> > https://lists.apache.org/thread/6vbd3w5qk7mpb5lyrfyf2s0z1cymjt5w
> > > > >   - (Jan 2024: Consensus on dropping 8)
> > > > >
> > https://lists.apache.org/thread/bd39zhk655pgzfctq763vp3z4xrjpx58
> > > > >   - Iceberg:
> > > > >   - (Jan 2023: Concerns about Hive):
> > > > >
> > https://lists.apache.org/thread/hr7rdxvddw3fklfyg3dfbqbsy81hzhyk
> > > > >   - (Feb 2024: Concensus to drop Hadoop 2.x, and move to
> JDK11+,
> > > > >   also +1's for moving to 17+):
> > > > >
> > https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> > > > >
> > > > > I think the most noteworthy (slow-moving in general):
> > > > >
> > > > >- Spark 4 supports JDK 17+
> > > > >- Hive 4 is still on Java 8
> > > > >
> > > > >
> > > > > It looks like most of the projects are looking at each other. Keep
> in
> > > > > mind, that projects that still support older versions of Java, can
> > > still
> > > > > use older versions of Arrow.
> > > > >
> > > > > [image: spiderman-pointing-at-spiderman.jpeg]
> > > > > (in case the image doesn't come through, that's Spiderman pointing
> at
> > > > > Spiderman)
> > > > >
> > > > > Concerning the Java 11 support, some data:
> > > > >
> > > > >- Oracle 11: support until January 2032 (extended fee has been
> > > waived)
> > > > >- Cornetto 11: September 2027
> > > > >- Adoptium 11: At least Oct 2027
> > > > >- Zulu 11: Jan 2032
> > > > >- OpenJDK11: October 2024
> > > > >
> > > > > I think it is fair to support 11 for the time being, but at some
> > point,
> > > > we
> > > > > also have to move on and start exploiting the new features and make
> > > sure
> > > > > that we keep up to date. For example, Java 8 also has extended
> > support
> > > > > until 2030. Dependabot on the Iceberg project
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3Adependencies
> > > > >
> > > > > nicely shows which projects are already at JDK11+ :)
> > > > >
> > > > > Thanks Dane for driving this!
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Op vr 17 mei 2024 om 07:44 schreef Jean-Baptiste Onofré <
> > > j...@nanthrax.net
> > > > >:
> > > > >
> > > > >> Hi Dane
> > > > >>
> > > > >> Do we have a formal consensus about Java version in regards of
> arrow
> > > > >> version ?
> > > > >> I agree with the plan but just wondering if it’s ok from everyone
> > with
> > > > the
> > > > >> community.
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >> Le jeu. 16 mai 2024 à 18:05, Dane Pitkin  a
> > > écrit :
> > > > >>
> > > > >> > To 

Re: [DISCUSS] Statistics through the C data interface

2024-05-24 Thread Weston Pace
> I think what we are slowly converging on is the need for a spec to
> describe the encoding of Arrow array statistics as Arrow arrays.

This has been something that has always been desired for the Arrow IPC
format too.

My preference would be (apologies if this has been mentioned before):

- Agree on how statistics should be encoded into an array (this is not
hard, we just have to agree on the field order and the data type for
null_count)
- If you need statistics in the schema then simply encode the 1-row batch
into an IPC buffer (using the streaming format) or maybe just an IPC
RecordBatch message since the schema is fixed and store those bytes in the
schema



On Fri, May 24, 2024 at 1:20 AM Sutou Kouhei  wrote:

> Hi,
>
> Could you explain more about your idea? Does it propose that
> we add more callbacks to ArrowArrayStream such as
> ArrowArrayStream::get_statistics()? Or Does it propose that
> we define one more Arrow C XXX interface that wraps
> ArrowArrayStream like ArrowDeviceArray wraps ArrowArray?
>
> ArrowDeviceArray:
> https://arrow.apache.org/docs/format/CDeviceDataInterface.html
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 23 May
> 2024 06:55:40 -0700,
>   Curt Hagenlocher  wrote:
>
> >>  would it be easier to request statistics at a higher level of
> > abstraction?
> >
> > What if there were a "single table provider" level of abstraction between
> > ADBC and ArrowArrayStream as a C API; something that can report
> statistics
> > and apply simple predicates?
> >
> > On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
> >  wrote:
> >
> >> Thank you for the background! I understand that these statistics are
> >> important for query planning; however, I am not sure that I follow why
> >> we are constrained to the ArrowSchema to represent them. The examples
> >> given seem to going through Python...would it be easier to request
> >> statistics at a higher level of abstraction? There would already need
> >> to be a separate mechanism to request an ArrowArrayStream with
> >> statistics (unless the PyCapsule `requested_schema` argument would
> >> suffice).
> >>
> >> > ADBC may be a bit larger to use only for transmitting
> >> > statistics. ADBC has statistics related APIs but it has more
> >> > other APIs.
> >>
> >> Some examples of producers given in the linked threads (Delta Lake,
> >> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> >> can implement an ADBC driver without defining all the methods (where
> >> the producer could call AdbcConnectionGetStatistics(), although
> >> AdbcStatementGetStatistics() might be more relevant here and doesn't
> >> exist). One example listed (using an Arrow Table as a source) seems a
> >> bit light to wrap in an ADBC driver; however, it would not take much
> >> code to do so and the overhead of getting the reader via ADBC it is
> >> something like 100 microseconds (tested via the ADBC R package's
> >> "monkey driver" which wraps an existing stream as a statement). In any
> >> case, the bulk of the code is building the statistics array.
> >>
> >> > How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >>
> >> Whatever format for statistics is decided on, I imagine it should be
> >> exactly the same as the ADBC standard? (Perhaps pushing changes
> >> upstream if needed?).
> >>
> >> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > > Why not simply pass the statistics ArrowArray separately in your
> >> > > producer API of choice
> >> >
> >> > It seems that we should use the approach because all
> >> > feedback said so. How about the following schema for the
> >> > statistics ArrowArray? It's based on ADBC.
> >> >
> >> > | Field Name   | Field Type| Comments |
> >> > |--|---|  |
> >> > | column_name  | utf8  | (1)  |
> >> > | statistic_key| utf8 not null | (2)  |
> >> > | statistic_value  | VALUE_SCHEMA not null |  |
> >> > | statistic_is_approximate | bool not null | (3)  |
> >> >
> >> > 1. If null, then the statistic applies to the entire table.
> >> >It's for "row_count".
> >> > 2. We'll provide pre-defined keys such as "max", "min",
> >> >"byte_width" and "distinct_count" but users can also use
> >> >application specific keys.
> >> > 3. If true, then the value is approximate or best-effort.
> >> >
> >> > VALUE_SCHEMA is a dense union with members:
> >> >
> >> > | Field Name | Field Type |
> >> > |||
> >> > | int64  | int64  |
> >> > | uint64 | uint64 |
> >> > | float64| float64|
> >> > | binary | binary |
> >> >
> >> > If a column is an int32 column, it uses int64 for
> >> > "max"/"min". We don't provide all types here. Users should
> >> > use a compatible type (int64 for a int32 column) 

Re: [VOTE] Release Apache Arrow ADBC 12 - RC4

2024-05-20 Thread Weston Pace
+1 (binding)

I also tested on Ubuntu 22.04 with USE_CONDA=1
dev/release/verify-release-candidate.sh 12 4

On Mon, May 20, 2024 at 5:20 AM David Li  wrote:

> My vote: +1 (binding)
>
> Are any other PMC members able to take a look?
>
> On Fri, May 17, 2024, at 23:36, Dewey Dunnington wrote:
> > +1 (binding)
> >
> > Tested with MacOS M1 using TEST_YUM=0 TEST_APT=0 USE_CONDA=1
> > ./verify-release-candidate.sh 12 4
> >
> > On Fri, May 17, 2024 at 9:46 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> +1 (non binding)
> >>
> >> Testing on MacOS M2.
> >>
> >> Regards
> >> JB
> >>
> >> On Wed, May 15, 2024 at 7:00 AM David Li  wrote:
> >> >
> >> > Hello,
> >> >
> >> > I would like to propose the following release candidate (RC4) of
> Apache Arrow ADBC version 12. This is a release consisting of 56 resolved
> GitHub issues [1].
> >> >
> >> > Please note that the versioning scheme has changed.  This is the 12th
> release of ADBC, and so is called version "12".  The subcomponents,
> however, are versioned independently:
> >> >
> >> > - C/C++/GLib/Go/Python/Ruby: 1.0.0
> >> > - C#: 0.12.0
> >> > - Java: 0.12.0
> >> > - R: 0.12.0
> >> > - Rust: 0.12.0
> >> >
> >> > These are the versions you will see in the source and in actual
> packages.  The next release will be "13", and the subcomponents will
> increment their versions independently (to either 1.1.0, 0.13.0, or
> 1.0.0).  At this point, there is no plan to release subcomponents
> independently from the project as a whole.
> >> >
> >> > Please note that there is a known issue when using the Flight SQL and
> Snowflake drivers at the same time on x86_64 macOS [12].
> >> >
> >> > This release candidate is based on commit:
> 50cb9de621c4d72f4aefd18237cb4b73b82f4a0e [2]
> >> >
> >> > The source release rc4 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8].
> >> > The changelog is located at [9].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [10] for how to validate a release candidate.
> >> >
> >> > See also a verification result on GitHub Actions [11].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow ADBC 12
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow ADBC 12 because...
> >> >
> >> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export
> TEST_APT=0 TEST_YUM=0`.)
> >> >
> >> > [1]:
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+12%22+is%3Aclosed
> >> > [2]:
> https://github.com/apache/arrow-adbc/commit/50cb9de621c4d72f4aefd18237cb4b73b82f4a0e
> >> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-12-rc4/
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [7]:
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> > [8]:
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-12-rc4
> >> > [9]:
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-12-rc4/CHANGELOG.md
> >> > [10]:
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> > [11]: https://github.com/apache/arrow-adbc/actions/runs/9089931356
> >> > [12]: https://github.com/apache/arrow-adbc/issues/1841
>


Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread Weston Pace
Congrats Dane!

On Tue, May 7, 2024, 7:30 AM Nic Crane  wrote:

> Congrats Dane, well deserved!
>
> On Tue, 7 May 2024 at 15:16, Gang Wu  wrote:
> >
> > Congratulations Dane!
> >
> > Best,
> > Gang
> >
> > On Tue, May 7, 2024 at 10:12 PM Ian Cook  wrote:
> >
> > > Congratulations Dane!
> > >
> > > On Tue, May 7, 2024 at 10:10 AM Alenka Frim  > > .invalid>
> > > wrote:
> > >
> > > > Yay, congratulations Dane!!
> > > >
> > > > On Tue, May 7, 2024 at 4:00 PM Rok Mihevc 
> wrote:
> > > >
> > > > > Congrats Dane!
> > > > >
> > > > > Rok
> > > > >
> > > > > On Tue, May 7, 2024 at 3:57 PM wish maple 
> > > > wrote:
> > > > >
> > > > > > Congrats!
> > > > > >
> > > > > > Best,
> > > > > > Xuwei Fu
> > > > > >
> > > > > > Joris Van den Bossche 
> 于2024年5月7日周二
> > > > > 21:53写道:
> > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Dane
> Pitkin
> > > > has
> > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > Welcome,
> > > > > > > and thank you for your contributions!
> > > > > > >
> > > > > > > Joris
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Weston Pace
I think "inheritance" and "composition" are more concerns for
implementations than they are for spec (I could be wrong here).

So it seems that it would be sufficient to write the HLLSKETCH's canonical
definition as "this is an extension of the JSON logical type and supports
all the same storage types" and then allow implementations to use whatever
inheritance / composition scheme they want to behind the scenes.

On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:

> I think the biggest blocker to doing this is the way that we pass extension
> types through IPC. Extension types are sent as their underlying storage
> type with metadata key-value pairs of specific keys "ARROW:extension:name"
> and "ARROW:extension:metadata". Since you can't have multiple values for
> the same key in the metadata, this would prevent the ability to define an
> extension type in terms of another extension type as you wouldn't be able
> to include the metadata for the second-level extension part.
>
> i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> storage type. So the storage type needs to be a valid core Arrow data type
> for this reason.
>
> On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
>
> > The vote on adding a JSON canonical extension type [1] got me wondering:
> Is
> > it possible to define an extension type that is based on a canonical
> > extension type? If so, how?
> >
> > For example, say I wanted to define a (non-canonical) HLLSKETCH extension
> > type that corresponds to the type that Redshift uses for HyperLogLog
> > sketches and is represented as JSON [2]. Is there a way to do this by
> > building on the JSON canonical extension type?
> >
> > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> >
> > Ian
> >
>


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding)

On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc  wrote:

> Thanks for all the reviews and comments! I've included the big-endian
> requirement so the proposed language is now as below.
> I'll leave the vote open until after the May holiday.
>
> Rok
>
> UUID
> 
>
> * Extension name: `arrow.uuid`.
>
> * The storage type of the extension is ``FixedSizeBinary`` with a length of
> 16 bytes.
>
> .. note::
>A specific UUID version is not required or guaranteed. This extension
> represents
>UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
> interpret the bytes in any way.
>


Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding)

I agree we should be explicit about RFC-8259

On Mon, Apr 29, 2024 at 4:46 PM David Li  wrote:

> +1 (binding)
>
> assuming we explicitly state RFC-8259
>
> On Tue, Apr 30, 2024, at 08:02, Matt Topol wrote:
> > +1 (binding)
> >
> > On Mon, Apr 29, 2024 at 5:36 PM Ian Cook  wrote:
> >
> >> +1 (non-binding)
> >>
> >> I added a comment in the PR suggesting that we explicitly refer to
> RFC-8259
> >> in CanonicalExtensions.rst.
> >>
> >> On Mon, Apr 29, 2024 at 1:21 PM Micah Kornfield 
> >> wrote:
> >>
> >> > +1, I added a comment to the PR because I think we should recommend
> >> > implementations specifically reject parsing Binary arrays with the
> >> > annotation in-case we want to support non-UTF8 encodings in the future
> >> > (even thought IIRC these aren't really JSON spec compliant).
> >> >
> >> > On Fri, Apr 19, 2024 at 1:24 PM Rok Mihevc 
> wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > Following discussions [1][2] and preliminary implementation work (by
> >> > > Pradeep Gollakota) [3] I would like to propose a vote to add
> language
> >> for
> >> > > JSON canonical extension type to CanonicalExtensions.rst as in PR
> [4]
> >> and
> >> > > written below.
> >> > > A draft C++ implementation PR can be seen here [3].
> >> > >
> >> > > [1]
> https://lists.apache.org/thread/p3353oz6lk846pnoq6vk638tjqz2hm1j
> >> > > [2]
> https://lists.apache.org/thread/7xph3476g9rhl9mtqvn804fqf5z8yoo1
> >> > > [3] https://github.com/apache/arrow/pull/13901
> >> > > [4] https://github.com/apache/arrow/pull/41257 <- proposed change
> >> > >
> >> > >
> >> > > The vote will be open for at least 72 hours.
> >> > >
> >> > > [ ] +1 Accept this proposal
> >> > > [ ] +0
> >> > > [ ] -1 Do not accept this proposal because...
> >> > >
> >> > >
> >> > > JSON
> >> > > 
> >> > >
> >> > > * Extension name: `arrow.json`.
> >> > >
> >> > > * The storage type of this extension is ``StringArray`` or
> >> > >   or ``LargeStringArray`` or ``StringViewArray``.
> >> > >   Only UTF-8 encoded JSON is supported.
> >> > >
> >> > > * Extension type parameters:
> >> > >
> >> > >   This type does not have any parameters.
> >> > >
> >> > > * Description of the serialization:
> >> > >
> >> > >   Metadata is either an empty string or a JSON string with an empty
> >> > object.
> >> > >   In the future, additional fields may be added, but they are not
> >> > required
> >> > >   to interpret the array.
> >> > >
> >> > >
> >> > >
> >> > > Rok
> >> > >
> >> >
> >>
>


Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Weston Pace
t; non-struct
> > > column to the from_arrow_device method which returns a cudf::table?
> > Should
> > > it error, or should it create a table with a single column?
> >
> > Presumably it should just error? I can see this being ambiguous if there
> > were an API that dynamically returned either a table or a column based on
> > the input shape (where before it would be less ambiguous since you'd
> > explicitly pass pa.RecordBatch or pa.Array, and now it would be ambiguous
> > since you only pass ArrowDeviceArray). But it doesn't sound like that's
> the
> > case?
> >
> > On Tue, Apr 23, 2024, at 11:15, Weston Pace wrote:
> > > I tend to agree with Dewey.  Using run-end-encoding to represent a
> scalar
> > > is clever and would keep the c data interface more compact.  Also, a
> > struct
> > > array is a superset of a record batch (assuming the metadata is kept in
> > the
> > > schema).  Consumers should always be able to deserialize into a struct
> > > array and then downcast to a record batch if that is what they want to
> do
> > > (raising an error if there happen to be nulls).
> > >
> > >> Depending on the function in question, it could be valid to pass a
> > struct
> > >> column vs a record batch with different results.
> > >
> > > Are there any concrete examples where this is the case?  The closest
> > > example I can think of is something like the `drop_nulls` function,
> > which,
> > > given a record batch, would choose to drop rows where any column is
> null
> > > and, given an array, only drops rows where the top-level struct is
> null.
> > > However, it might be clearer to just give the two functions different
> > names
> > > anyways.
> > >
> > > On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
> > >  wrote:
> > >
> > >> Thank you for the background!
> > >>
> > >> I still wonder if these distinctions are the responsibility of the
> > >> ArrowSchema to communicate (although perhaps links to the specific
> > >> discussions would help highlight use-cases that I am not envisioning).
> > >> I think these distinctions are definitely important in the contexts
> > >> you mentioned; however, I am not sure that the FFI layer is going to
> > >> be helpful.
> > >>
> > >> > In the libcudf situation, it came up with what happens if you pass a
> > >> non-struct
> > >> > column to the from_arrow_device method which returns a cudf::table?
> > >> Should
> > >> > it error, or should it create a table with a single column?
> > >>
> > >> I suppose that I would have expected two functions (one to create a
> > >> table and one to create a column). As a consumer I can't envision a
> > >> situation where I would want to import an ArrowDeviceArray but where I
> > >> would want some piece of run-time information to decide what the
> > >> return type of the function would be? (With apologies if I am missing
> > >> a piece of the discussion).
> > >>
> > >> > If A and B have different lengths, this is invalid
> > >>
> > >> I believe several array implementations (e.g., numpy, R) are able to
> > >> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> > >> that would make that broadcast explicit without expanding the scalar.
> > >>
> > >> > Depending on the function in question, it could be valid to pass a
> > >> struct column vs a record batch with different results.
> > >>
> > >> If this is an important distinction for an FFI signature of a UDF,
> > >> there would probably be a struct definition for the UDF where there
> > >> would be an opportunity to make this distinction (and perhaps others
> > >> that are relevant) without loading this concept onto the existing
> > >> structs.
> > >>
> > >> > If no flags are set, then the behavior shouldn't change
> > >> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,
> then
> > it
> > >> > should error unless calling ImportRecordBatch.
> > >>
> > >> I am not sure I would have expected that (since a struct array has an
> > >> unambiguous interpretation as a record batch and as a user I've very
> > >> explicitly decided that I want one, since I'm using that function).
> > >>
> > >&

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Weston Pace
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though
code
> is available ) .*

This would be news to me.  Parquet releases are listed (by the parquet
community) at [1]

The vote to release parquet 2.10 is here: [2]

Neither of these links mention anything about this being an experimental,
unofficial, or non-finalized release.

I understand your concern.  I believe your quotes are coming from your
discussion on the parquet mailing list here [3].  This communication is
unfortunate and confusing to me as well.

[1] https://parquet.apache.org/blog/
[2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
[3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3


On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo  wrote:

> Hello Jacob,
> Thanks for the information, and my apologies for the weird format of my
> email.
>
> This is the email from the Parquet community. May I know why pyarrow is
> using Parquet V2 which is not official yet ?
>
> My question is from Parquet community V2 is not final yet so it is not
> official yet.
> "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet V2
> as a standard isn't finalized just yet. Meaning there is no formal,
> *finalized* "contract" that specifies what it means to write data in the V2
> version. The discussions/conversations about what the final V2 standard may
> be are still in progress and are evolving.
>
> That being said, because V2 code does exist (though unfinalized), there are
> clients / tools that are writing data in the un-finalized V2 format, as
> seems to be the case with Dremio.
>
> Now, as that comment you quoted said, you can have Spark write V2 files,
> but it's worth being mindful about the fact that V2 is a moving target and
> can (and likely will) change. You can overwrite parquet.writer.version to
> specify your desired version, but it can be dangerous to produce data in a
> moving-target format. For example, let's say you write a bunch of data in
> Parquet V2, and then the community decides to make a breaking change (which
> is completely fine / allowed since V2 isn't finalized). You are now left
> having to deal with a potentially large and complicated file format update.
> That's why it's not recommended to write files in parquet v2 just yet."
>
>
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> official . They are advising not to use Parquet V2 for writing (though code
> is available ) .*
>
>
> *As per above Spark hasn't started using Parquet V2 for writing *.
>
> May I know how an unstable /unofficial  version is being used in pyarrow ?
>
>
> On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak 
> wrote:
>
> > Hello,
> >
> > First off, please try to clean up formating of emails to be legible when
> > forwarding/quoting previous messages multiple times, especially when most
> > of the quotes do not contain any useful information. It makes it much
> > easier to parse the message and thus quicker to answer.
> >
> > The short answer is that we switched to 2.4 and more recently to 2.6 as
> > the default to enable the usage of features these versions provide. As
> you
> > have correctly quoted from the docs you can still write 1.0 if you want
> to
> > ensure compatibility with systems that can not process the 'newer'
> versions
> > yet (2.6 was released in 2018!).
> >
> > You can find the long form discussions about these changes here:
> > https://issues.apache.org/jira/browse/ARROW-12203
> > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> >
> > Best
> > Jacob
> >
> > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > Hello Team,
> > > Could you please share your thoughts about below questions?
> > > Sent from my iPhone
> > >
> > > Begin forwarded message:
> > >
> > > > From: Prem Sahoo 
> > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > To: dev-ow...@arrow.apache.org
> > > > Subject: Re: PyArrow Using Parquet V2
> > > >
> > > > dev@arrow.apache.org
> > > > Sent from my iPhone
> > > >
> > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo 
> > wrote:
> > > >>>
> > > >> Hello Team,
> > > >> Could anyone please help me on below query?
> > > >> Sent from my iPhone
> > > >>
> > >  On Apr 22, 2024, at 10:01 PM, Prem Sahoo 
> > wrote:
> > > 
> > > >>> 
> > > >>> Sent from my iPhone
> > > >>>
> > > > On Apr 22, 2024, at 9:51 PM, Prem Sahoo 
> > wrote:
> > > >
> > >  
> > > 
> > > >
> > > > 
> > > > Hello Team,
> > > > I have a question regarding Parquet V2 writing thro pyarrow .
> > > > As per below Pyarrow started writing Parquet in V2 encoding.
> > > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > >
> > > > version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > Determine which Parquet logical types are available for use,
> > whether the reduced set 

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-22 Thread Weston Pace
I tend to agree with Dewey.  Using run-end-encoding to represent a scalar
is clever and would keep the c data interface more compact.  Also, a struct
array is a superset of a record batch (assuming the metadata is kept in the
schema).  Consumers should always be able to deserialize into a struct
array and then downcast to a record batch if that is what they want to do
(raising an error if there happen to be nulls).

> Depending on the function in question, it could be valid to pass a struct
> column vs a record batch with different results.

Are there any concrete examples where this is the case?  The closest
example I can think of is something like the `drop_nulls` function, which,
given a record batch, would choose to drop rows where any column is null
and, given an array, only drops rows where the top-level struct is null.
However, it might be clearer to just give the two functions different names
anyways.

On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
 wrote:

> Thank you for the background!
>
> I still wonder if these distinctions are the responsibility of the
> ArrowSchema to communicate (although perhaps links to the specific
> discussions would help highlight use-cases that I am not envisioning).
> I think these distinctions are definitely important in the contexts
> you mentioned; however, I am not sure that the FFI layer is going to
> be helpful.
>
> > In the libcudf situation, it came up with what happens if you pass a
> non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
>
> I suppose that I would have expected two functions (one to create a
> table and one to create a column). As a consumer I can't envision a
> situation where I would want to import an ArrowDeviceArray but where I
> would want some piece of run-time information to decide what the
> return type of the function would be? (With apologies if I am missing
> a piece of the discussion).
>
> > If A and B have different lengths, this is invalid
>
> I believe several array implementations (e.g., numpy, R) are able to
> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> that would make that broadcast explicit without expanding the scalar.
>
> > Depending on the function in question, it could be valid to pass a
> struct column vs a record batch with different results.
>
> If this is an important distinction for an FFI signature of a UDF,
> there would probably be a struct definition for the UDF where there
> would be an opportunity to make this distinction (and perhaps others
> that are relevant) without loading this concept onto the existing
> structs.
>
> > If no flags are set, then the behavior shouldn't change
> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> > should error unless calling ImportRecordBatch.
>
> I am not sure I would have expected that (since a struct array has an
> unambiguous interpretation as a record batch and as a user I've very
> explicitly decided that I want one, since I'm using that function).
>
> In the other direction, I am not sure a producer would be able to set
> these flags without breaking backwards compatibility with earlier
> producers that did not set them (since earlier threads have suggested
> that it is good practice to error when an unsupported flag is
> encountered).
>
> On Sun, Apr 21, 2024 at 6:16 PM Matt Topol  wrote:
> >
> > First, I forgot a flag in my examples. There should also be an
> > ARROW_FLAG_SCALAR too!
> >
> > The motivation for this distinction came up from discussions during
> adding
> > support for ArrowDeviceArray to libcudf in order to better indicate the
> > difference between a cudf::table and a cudf::column which are handled
> quite
> > differently. This also relates to the fact that we currently need
> external
> > context like the explicit ImportArray() and ImportRecordBatch() functions
> > since we can't determine which a given ArrowArray is on its own. In the
> > libcudf situation, it came up with what happens if you pass a non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
> >
> > The other motivation for this distinction is with UDFs in an engine that
> > uses the C data interface. When dealing with queries and engines, it
> > becomes important to be able to distinguish between a record batch, a
> > column and a scalar. For example, take the expression A + B:
> >
> > If A and B have different lengths, this is invalid. unless one of
> them
> > is a Scalar. This is because Scalars are broadcastable, columns are not.
> >
> > Depending on the function in question, it could be valid to pass a struct
> > column vs a record batch with different results. It also resolves some
> > ambiguity for UDFs and processing. For instance, given a single
> ArrowArray
> > of length 1, which is a struct: Is that a Struct 

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace
> people generally find use in Arrow schemas independently of concrete data.

This makes sense.  I think we do want to encourage use of Arrow as a "type
system" even if there is no data involved.  And, given that we cannot
easily change a field's data type property to "optional" it makes sense to
use a dedicated type and I so I would be in favor of such a proposal (we
may eventually add an "unknown type" concept in Substrait as well, it's
come up several times, and so we could use this in that context).

I think that I would still prefer a canonical extension type (with storage
type null) over a new dedicated type.

On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou  wrote:

>
> Ah! Well, I think this could be an interesting proposal, but someone
> should put a more formal proposal, perhaps as a draft PR.
>
> Regards
>
> Antoine.
>
>
> Le 17/04/2024 à 11:57, David Li a écrit :
> > For an unsupported/other extension type.
> >
> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
> >> What is "this proposal"?
> >>
> >>
> >> Le 17/04/2024 à 10:38, David Li a écrit :
> >>> Should I take it that this proposal is dead in the water? While we
> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
> it might be useful to have a singular type for consumers to latch on to.
> >>>
> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>  I think an "Other" extension type is slightly different than an
>  arbitrary extension type, though: the latter may be understood
>  downstream but the former represents a point at which a component
>  explicitly declares it does not know how to handle a field. In this
>  example, the PostgreSQL ADBC driver might be able to provide a
>  representation regardless, but a different driver (or say, the JDBC
>  adapter, which cannot necessarily get a bytestring for an arbitrary
>  JDBC type) may want an Other type to signal that it would fail if
> asked
>  to provide particular columns.
> 
>  On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
> > Depending where your Arrow-encoded data is used, either extension
> > types or generic field metadata are options. We have this problem in
> > the ADBC Postgres driver, where we can convert *most* Postgres types
> > to an Arrow type but there are some others where we can't or don't
> > know or don't implement a conversion. Currently for these we return
> > opaque binary (the Postgres COPY representation of the value) but put
> > field metadata so that a consumer can implement a workaround for an
> > unsupported type. It would be arguably better to have implemented
> this
> > as an extension type; however, field metadata felt like less of a
> > commitment when I first worked on this.
> >
> > Cheers,
> >
> > -dewey
> >
> > On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
> >  wrote:
> >>
> >> I was using UUID as an example. It looks like extension types
> covers my original request.
> >> 
> >> From: Felipe Oliveira Carvalho 
> >> Sent: Thursday, April 11, 2024 7:15 AM
> >> To: dev@arrow.apache.org 
> >> Subject: Re: Unsupported/Other Type
> >>
> >> The OP used UUID as an example. Would that be enough or the request
> is for
> >> a flexible mechanism that allows the creation of one-off nominal
> types for
> >> very specific use-cases?
> >>
> >> —
> >> Felipe
> >>
> >> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Yes, JSON and UUID are obvious candidates for new canonical
> extension
> >>> types. XML also comes to mind, but I'm not sure there's much of a
> use
> >>> case for it.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>  In the past we have discussed adding a canonical type for UUID
> and JSON.
> >>> I
>  still think this is a good idea and could improve ergonomics in
> >>> downstream
>  language bindings (e.g. by exposing JSON querying function or
> >>> automatically
>  boxing UUIDs in built-in UUID types, like the Python uuid
> library). Has
>  anyone done any work on this to anyone's knowledge?
> 
>  On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
> emkornfi...@gmail.com>
>  wrote:
> 
> > Hi Norman,
> > Arrow has a concept of extension types [1] along with the
> possibility of
> > proposing new canonical extension types [2].  This seems to
> cover the
> > use-cases you mention but I might be misunderstanding?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> >>>
> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
> > [2]
> 

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace
> may want an Other type to signal that it would fail if asked to provide
particular columns.

I interpret "would fail" to mean we are still speaking in some kind of
"planning stage" and not yet actually creating arrays.  So I don't know
that this needs to be a data type.  In other words, I see this as
`std::optional` and not a unique instance of `DataType`.

However, if you did need to actually create an array, and you wanted some
way of saying "there is no data here because I failed to interpret the
type" then maybe you could create an extension type based on the null type?

On Wed, Apr 17, 2024 at 2:57 AM David Li  wrote:

> For an unsupported/other extension type.
>
> On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
> > What is "this proposal"?
> >
> >
> > Le 17/04/2024 à 10:38, David Li a écrit :
> >> Should I take it that this proposal is dead in the water? While we
> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
> it might be useful to have a singular type for consumers to latch on to.
> >>
> >> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
> >>> I think an "Other" extension type is slightly different than an
> >>> arbitrary extension type, though: the latter may be understood
> >>> downstream but the former represents a point at which a component
> >>> explicitly declares it does not know how to handle a field. In this
> >>> example, the PostgreSQL ADBC driver might be able to provide a
> >>> representation regardless, but a different driver (or say, the JDBC
> >>> adapter, which cannot necessarily get a bytestring for an arbitrary
> >>> JDBC type) may want an Other type to signal that it would fail if asked
> >>> to provide particular columns.
> >>>
> >>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>  Depending where your Arrow-encoded data is used, either extension
>  types or generic field metadata are options. We have this problem in
>  the ADBC Postgres driver, where we can convert *most* Postgres types
>  to an Arrow type but there are some others where we can't or don't
>  know or don't implement a conversion. Currently for these we return
>  opaque binary (the Postgres COPY representation of the value) but put
>  field metadata so that a consumer can implement a workaround for an
>  unsupported type. It would be arguably better to have implemented this
>  as an extension type; however, field metadata felt like less of a
>  commitment when I first worked on this.
> 
>  Cheers,
> 
>  -dewey
> 
>  On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>   wrote:
> >
> > I was using UUID as an example. It looks like extension types covers
> my original request.
> > 
> > From: Felipe Oliveira Carvalho 
> > Sent: Thursday, April 11, 2024 7:15 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: Unsupported/Other Type
> >
> > The OP used UUID as an example. Would that be enough or the request
> is for
> > a flexible mechanism that allows the creation of one-off nominal
> types for
> > very specific use-cases?
> >
> > —
> > Felipe
> >
> > On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou 
> wrote:
> >
> >>
> >> Yes, JSON and UUID are obvious candidates for new canonical
> extension
> >> types. XML also comes to mind, but I'm not sure there's much of a
> use
> >> case for it.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
> >>> In the past we have discussed adding a canonical type for UUID and
> JSON.
> >> I
> >>> still think this is a good idea and could improve ergonomics in
> >> downstream
> >>> language bindings (e.g. by exposing JSON querying function or
> >> automatically
> >>> boxing UUIDs in built-in UUID types, like the Python uuid
> library). Has
> >>> anyone done any work on this to anyone's knowledge?
> >>>
> >>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>> wrote:
> >>>
>  Hi Norman,
>  Arrow has a concept of extension types [1] along with the
> possibility of
>  proposing new canonical extension types [2].  This seems to cover
> the
>  use-cases you mention but I might be misunderstanding?
> 
>  Thanks,
>  Micah
> 
>  [1]
> 
> 
> >>
> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>  [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
> 
>  On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>   wrote:
> 
> > Problem Description
> >
> > Currently Arrow schemas can only contain columns of types
> supported by
> > Arrow. In some cases an Arrow schema maps to an external schema.
> This
> >> can
> > 

Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread Weston Pace
Congratulations!

On Thu, Apr 11, 2024 at 9:12 AM wish maple  wrote:

> Congrats!
>
> Best,
> Xuwei Fu
>
> Kevin Gurney  于2024年4月11日周四 23:22写道:
>
> > Congratulations, Sarah!! Well deserved!
> > 
> > From: Jacob Wujciak 
> > Sent: Thursday, April 11, 2024 11:14 AM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore
> >
> > Congratulations and welcome!
> >
> > Am Do., 11. Apr. 2024 um 17:11 Uhr schrieb Raúl Cumplido <
> > rau...@apache.org
> > >:
> >
> > > Congratulations Sarah!
> > >
> > > El jue, 11 abr 2024 a las 13:13, Sutou Kouhei ()
> > > escribió:
> > > >
> > > > Hi,
> > > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Sarah
> > > > Gilmore has accepted an invitation to become a committer on
> > > > Apache Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > >
> >
>


Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-08 Thread Weston Pace
> Probably major versions should match between C++ and PyArrow, but I guess
> we could have diverging minor and patch versions. Or at least patch
> versions given that
> a new minor version is usually cut for bug fixes too.

I believe even this would be difficult.  Stable ABIs are very finicky in
C++.  If the public API surface changes in any way then it can lead to
subtle bugs if pyarrow were to link against an older version.  I also am
not sure there is much advantage in trying to separate pyarrow from
arrow-cpp since they are almost always changing in lockstep (e.g. any
change to arrow-cpp enables functionality in pyarrow).

I think we should maybe focus on a few more obvious cases.

I think C#, JS, Java, and Go are the most obvious candidates to decouple.
Even then, we should probably only separate these candidates if they have
willing release managers.

C/GLib, python, and ruby are all tightly coupled to C++ at the moment and
should not be a first priority.  I would have guessed that R is also in
this list but Jacob reported in the original email that they are already
somewhat decoupled?

I don't know anything about swift or matlab.

On Mon, Apr 8, 2024 at 6:23 AM Alessandro Molina
 wrote:

> On Sun, Apr 7, 2024 at 3:06 PM Andrew Lamb  wrote:
>
> >
> > We have had separate releases / votes for Arrow Rust (and Arrow
> DataFusion)
> > and it has served us quite well. The version schemes have diverged
> > substantially from the monorepo (we are on version 51.0.0 in arrow-rs,
> for
> > example) and it doesn't seem to have caused any large confusion with
> users
> >
> >
> I think that versioning will require additional thinking for libraries like
> PyArrow, Java etc...
> For rust this is a non problem because there is no link to the C++ library,
>
> PyArrow instead is based on what the C++ library provides,
> so there is a direct link between the features provided by C++ in a
> specific version
> and the features provided in PyArrow at a specific version.
>
> More or less PyArrow 20 should have the same bug fixes that C++ 20 has,
> and diverging the two versions would lead to confusion easily.
> Probably major versions should match between C++ and PyArrow, but I guess
> we could have diverging minor and patch versions. Or at least patch
> versions given that
> a new minor version is usually cut for bug fixes too.
>


Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace
Forgot link:

[1]
https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory

On Tue, Apr 2, 2024 at 11:38 AM Weston Pace  wrote:

> Thanks for taking the time to address my concerns.
>
> > I've split the S3/HTTP URI flight pieces out into a separate document and
> > separate thing to vote on at the request of several people who wanted to
> > view these as two separate proposals to vote on. So this vote *only*
> covers
> > adopting the protocol spec as an "Experimental Protocol" so we can start
> > seeing real world usage to help refine and improve it. That said, I
> believe
> > all clients currently would reject any non-grpc URI.
>
> Ah, I was confused and my comments were mostly about the s3/http proposal.
>
> Regarding the proposal at hand, I went through it in more detail.  I don't
> know much about ucx so I considered two different use cases:
>
>  * The previously mentioned shared memory approach.  I think this is
> compelling as people have asked about shared memory communication from time
> to time and I've always suggested flight over unix sockets though that
> forces a copy.
>  * I think this could also form the basis for large transfers of arrow
> data over a wasm boundary.  Wasm has a concept of shared memory objects[1]
> and a wasm data library could use this to stream data into javascript
> without a copy.
>
> I've added a few more questions to the doc.  Either way, if we're only
> talking about an experimental protocol / suggested recommendation then I'm
> fine voting +1 on this (I'm not sure a formal vote is even needed).  I
> would want to see at least 2 implementations if we wanted to remove the
> experimental label.
>
> On Sun, Mar 31, 2024 at 2:43 PM Joel Lubinitsky 
> wrote:
>
>> +1 to the dissociated transports proposal
>>
>> On Sun, Mar 31, 2024 at 11:14 AM David Li  wrote:
>>
>> > +1 from me as before
>> >
>> > On Thu, Mar 28, 2024, at 18:06, Matt Topol wrote:
>> > >>  There is a word doc with no implementation or PR.  I think there
>> could
>> > > be an implementation / PR.
>> > >
>> > > In the word doc there is a link to a POC implementation[1] showing
>> this
>> > > protocol working with a flight service, ucx and libcudf. The key piece
>> > here
>> > > is that we're voting on adopting this protocol spec (i.e. I'll add it
>> to
>> > > the documentation website) rather than us explicitly providing full
>> > > implementations or abstractions around it. We can provide reference
>> > > implementations like the POC, but I don't think they should be in the
>> > Arrow
>> > > monorepo or else we run the risk of a lot of the same issues that
>> Flight
>> > > has: i.e. Adding anything to Flight in C++ requires fully wrapping the
>> > > grpc/flight primitives with Arrow equivalents to export which
>> increases
>> > the
>> > > maintenance burden on us and makes it more difficult for users to
>> > leverage
>> > > the underlying knobs and dials.
>> > >
>> > >> For example, does any ADBC client respect this protocol today?  If a
>> > > flight server responds with an S3/HTTP URI will the ADBC client
>> download
>> > > the files from the correct place?  Will it at least notice that the
>> URI
>> > is
>> > > not a GRPC URI and give a "I don't have a connector for downloading
>> from
>> > > HTTP/S3" error?
>> > >
>> > > I've split the S3/HTTP URI flight pieces out into a separate document
>> and
>> > > separate thing to vote on at the request of several people who wanted
>> to
>> > > view these as two separate proposals to vote on. So this vote *only*
>> > covers
>> > > adopting the protocol spec as an "Experimental Protocol" so we can
>> start
>> > > seeing real world usage to help refine and improve it. That said, I
>> > believe
>> > > all clients currently would reject any non-grpc URI.
>> > >
>> > >>   I was speaking with someone yesterday and they explained that
>> > > they ended up not choosing Flight for an internal project because
>> Flight
>> > > didn't support something called "cloud fetch" which I have now
>> learned is
>> > >
>> > > I was reading through that link, and it seems like it's pretty much
>> > > *identical* to Flight as it currently exists, except that it is using
>> > cloud
>> >

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace
hat the google doc was easier for iterating on the
> > protocol
> > > specification than a markdown PR for the Arrow documentation as I could
> > > more visually express things without a preview of the rendered
> markdown.
> > If
> > > it would get people to be more likely to vote on this, I can write up
> the
> > > documentation markdown now and create a PR rather than waiting until we
> > > decide we're even going to adopt this protocol as an "official" arrow
> > > protocol.
> > >
> > > Lemme know if there's any other unanswered questions!
> > >
> > > --Matt
> > >
> > > [1]: https://github.com/zeroshade/cudf-flight-ucx
> > > [2]:
> > >
> >
> https://docs.google.com/document/d/1-x7tHWDzpbgmsjtTUnVXeEO4b7vMWDHTu-lzxlK9_hE/edit#heading=h.ub6lgn7s75tq
> > >
> > > On Thu, Mar 28, 2024 at 4:53 PM Weston Pace 
> > wrote:
> > >
> > >> I'm sorry for the very late reply.  Until yesterday I had no real
> > concept
> > >> of what this was talking about and so I had stayed out.
> > >>
> > >> I'm +0 only because it isn't clear what we are voting on.  There is a
> > word
> > >> doc with no implementation or PR.  I think there could be an
> > implementation
> > >> / PR.  For example, does any ADBC client respect this protocol today?
> > If a
> > >> flight server responds with an S3/HTTP URI will the ADBC client
> download
> > >> the files from the correct place?  Will it at least notice that the
> URI
> > is
> > >> not a GRPC URI and give a "I don't have a connector for downloading
> from
> > >> HTTP/S3" error?  In general, I think we do want this in Flight (see
> > >> comments below) and I am very supportive of the idea.  However, if
> > adopting
> > >> this as an experimental proposal helps move this forward then I think
> > >> that's fine.
> > >>
> > >> That being said, I do want to express support for the proposal as a
> > >> concept, at least the "disassociated transports" portion (I can't
> speak
> > to
> > >> UCX/etc.).  I was speaking with someone yesterday and they explained
> > that
> > >> they ended up not choosing Flight for an internal project because
> Flight
> > >> didn't support something called "cloud fetch" which I have now learned
> > is
> > >> [1].  I had recalled looking at this proposal before and this person
> > seemed
> > >> interested and optimistic to know this was being considered for
> Flight.
> > >> This proposal, as I understand it, should make it possible for cloud
> > >> servers to support a cloud fetch style API.  From the discussion I got
> > the
> > >> impression that this cloud fetch approach is useful and generally
> > >> applicable.
> > >>
> > >> So a big +1 for the idea of disassociated transports but I'm not sure
> > why
> > >> we need a vote to start working on it (but I'm not opposed if a vote
> > helps)
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://www.databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html
> > >>
> > >> On Thu, Mar 28, 2024 at 1:04 PM Matt Topol 
> > wrote:
> > >>
> > >> > I'll keep this new vote open for at least the next 72 hours. As
> before
> > >> > please reply with:
> > >> >
> > >> > [ ] +1 Accept this Proposal
> > >> > [ ] +0
> > >> > [ ] -1 Do not accept this proposal because...
> > >> >
> > >> > Thanks everyone!
> > >> >
> > >> > On Wed, Mar 27, 2024 at 7:51 PM Benjamin Kietzman <
> > bengil...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > +1
> > >> > >
> > >> > > On Tue, Mar 26, 2024, 18:36 Matt Topol 
> > wrote:
> > >> > >
> > >> > > > Should I start a new thread for a new vote? Or repeat the
> original
> > >> vote
> > >> > > > email here?
> > >> > > >
> > >> > > > Just asking since there hasn't been any responses so far.
> > >> > > >
> > >> > > > --Matt
> > >> > > >
> > >> > > > On Thu, Mar 21, 2024 a

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Weston Pace
Wouldn't support for ADT require expressing more than 1 type id per
record?  In other words, if `put` has type id 1, `delete` has type id 2,
and `erase` has type id 3 then there is no way to express something is (for
example) both type id 1 and type id 3 because you can only have one type id
per record.

If that understanding is correct then it seems you can always encode world
2 into world 1 by exhaustively listing out the combinations.  In other
words, `put` is the LSB, `delete` is bit 2, and `erase` is bit 3 and you
have:

7 - put/delete/erase
6 - delete/erase
5 - erase/put
4 - erase
3 - put/delete
2 - delete
1 - put

On Tue, Apr 2, 2024 at 4:36 AM Finn Völkel  wrote:

> I also meant Algebraic Data Type not Abstract Data Type (too many
> acronymns).
>
> On Tue, 2 Apr 2024 at 13:28, Antoine Pitrou  wrote:
>
> >
> > Thanks. The Arrow spec does support multiple union members with the same
> > type, but not all implementations do. The C++ implementation should
> > support it, though to my surprise we do not seem to have any tests for
> it.
> >
> > If the Java implementation doesn't, then you can probably open an issue
> > for it (and even submit a PR if you would like to tackle it).
> >
> > I've also opened https://github.com/apache/arrow/issues/40947 to create
> > integration tests for this.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 02/04/2024 à 13:19, Finn Völkel a écrit :
> > >> Can you explain what ADT means ?
> > >
> > > Sorry about that. ADT stands for Abstract Data Type. What do I mean by
> an
> > > ADT style vector?
> > >
> > > Let's take an example from the project I am on. We have an `op` union
> > > vector with three child vectors `put`, `delete`, `erase`. `delete` and
> > > `erase` have the same type but represent different things.
> > >
> > > On Tue, 2 Apr 2024 at 13:16, Steve Kim  wrote:
> > >
> > >> Thank you for asking this question. I have the same question.
> > >>
> > >> I noted a similar problem in the c++/python implementation:
> > >> https://github.com/apache/arrow/issues/19157#issuecomment-1528037394
> > >>
> > >> On Tue, Apr 2, 2024, 04:30 Finn Völkel  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> my question primarily concerns the union layout described at
> > >>> https://arrow.apache.org/docs/format/Columnar.html#union-layout
> > >>>
> > >>> There are two ways to use unions:
> > >>>
> > >>> - polymorphic vectors (world 1)
> > >>> - ADT style vectors (world 2)
> > >>>
> > >>> In world 1 you have a vector that stores different types. In the ADT
> > >> world
> > >>> you could have multiple child vectors with the same type but
> different
> > >> type
> > >>> ids in the union type vector. The difference is apparent if you want
> to
> > >> use
> > >>> two BigIntVectors as children which doesn't exist in world 1. World 1
> > is
> > >> a
> > >>> subset of world 2.
> > >>>
> > >>> The spec (to my understanding) doesn’t explicitly forbid world 2, but
> > the
> > >>> implementation we have been using (Java) has been making the
> assumption
> > >> of
> > >>> being in world 1 (a union only having ONE child of each type). We
> > >> sometimes
> > >>> use union in the ADT style which has led to problems down the road.
> > >>>
> > >>> Could someone clarify what the specification allows and what it
> doesn’t
> > >>> allow? Could we tighten the specification after that clarification?
> > >>>
> > >>> Best, Finn
> > >>>
> > >>
> > >
> >
>


Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread Weston Pace
Congratulations Joel!

On Mon, Apr 1, 2024 at 1:16 PM Bryce Mecum  wrote:

> Congrats, Joel!
>
> On Mon, Apr 1, 2024 at 6:59 AM Matt Topol  wrote:
> >
> > On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky
> has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> and
> > thank you for your contributions!
> >
> > --Matt
>


Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-03-29 Thread Weston Pace
Thank you for bringing this up.  I'm in favor of this.  I think there are
several motivations but the main ones are:

 1. Decoupling the versions will allow components to have no release, or
only a minor release, when there are no breaking changes
 2. We do have some vote fatigue I think and we don't want to make that
more difficult.
 3. Anything we can do to ease the burden of release managers is good

If I understand what you are describing then I think it satisfies points 1
& 2.  I am not familiar enough with the release management process to speak
to #3.

> Voting in one thread on
> all components/a subset of components per voter and the surrounding
> technicalities is something I would like to hear some opinions on.

I am in favor of decoupling the version numbers.  I do think batched
quarterly releases are still a good thing to avoid vote fatigue.  Perhaps
we can have a single vote on a batch of version numbers (e.g. please vote
on the batched release containing CPP version X, Go version Y, JS version
Z).

> A more meta question is about the messaging that different versioning
> schemes carry, as it might no longer be obvious on first glance which
> versions are compatible or have the newest features.

I am not concerned about this.  One of the advantages of Arrow is that we
have a stable C ABI (C Data Interface) and a stable IPC mechanism (IPC
serialization) and this means that version compatibility is rarely a
difficulty or major concern.  Plus, regarding individual features, our
solution already requires a compatibility table (
https://arrow.apache.org/docs/status.html).  Changing the versioning
strategy will not make this any worse.

On Thu, Mar 28, 2024 at 1:42 PM Jacob Wujciak  wrote:

> Hello Everyone!
>
> I would like to resurface the discussion of separate
> versioning/releases/voting for monorepo components. We have previously
> touched on this topic mostly in the community meetings and spread across
> multiple, only tangential related threads. I think a focused discussion can
> be a bit more results oriented, especially now that we almost regularly
> deviate from the quarterly release cadence with minor releases. My hope is
> that discussing this and adapting our process can lower the amount of work
> required and ease the pressure on our release managers (Thank you Raúl and
> Kou!).
>
> I think the base of the topic is the separate versioning for components as
> otherwise separate releases only have limited value. From a technical
> perspective standalone implementations like Go or JS are the easiest to
> handle in that regard, they can just follow their ecosystem standards,
> which has been requested by users already (major releases in Go require
> manual editing across a code base as dependencies are usually pinned to a
> major version).
>
> For Arrow C++ bindings like Arrow R and PyArrow having distinct versions
> would require additional work to both enable the use of different versions
> and ensure version compatibility is monitored and potentially updated if
> needed.
>
> For Arrow R we have already implemented these changes for different reasons
> and have backwards compatibility with  libarrow >= 13.0.0. From a user
> standpoint of PyArrow this is likely irrelevant as most users get binary
> wheels from pypi, if a user regularly builds PyArrow from source they are
> also capable of managing potentially different libarrow version
> requirements as this is already necessary to build the package just with an
> exact version match.
>
> A more meta question is about the messaging that different versioning
> schemes carry, as it might no longer be obvious on first glance which
> versions are compatible or have the newest features. Though I would argue
> that this  a marginal concern at best as there is no guarantee of feature
> parity between different components with the same version. Breaking that
> implicit expectation with separate versions could be seen as clearer. If a
> component only receives dependency bumps or minor bug fixes, releasing this
> component with a patch version aligns much better with expectations than a
> major version bump. In addition there are already several differently
> versioned libraries in the apache/arrow-* ecosystem that are released
> outside of the monorepo release process.  A proper support policy for each
> component would also be required but could just default to 'current major
> release' as it is now.
>
> From an ASF perspective there is no requirement to release the entire
> repository at once as the actual release artifact is the source tarball. As
> long as that is verified and voted on by the PMC it is an official release.
>
> This brings me to the release process and voting. I think it is pretty
> clear that completely decoupling all components and their release processes
> isn't feasible at the moment, mainly from a technical perspective
> (crossbow) and would likely also lead to vote fatigue. We have made efforts
> to ease the 

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-03-28 Thread Weston Pace
I'm sorry for the very late reply.  Until yesterday I had no real concept
of what this was talking about and so I had stayed out.

I'm +0 only because it isn't clear what we are voting on.  There is a word
doc with no implementation or PR.  I think there could be an implementation
/ PR.  For example, does any ADBC client respect this protocol today?  If a
flight server responds with an S3/HTTP URI will the ADBC client download
the files from the correct place?  Will it at least notice that the URI is
not a GRPC URI and give a "I don't have a connector for downloading from
HTTP/S3" error?  In general, I think we do want this in Flight (see
comments below) and I am very supportive of the idea.  However, if adopting
this as an experimental proposal helps move this forward then I think
that's fine.

That being said, I do want to express support for the proposal as a
concept, at least the "disassociated transports" portion (I can't speak to
UCX/etc.).  I was speaking with someone yesterday and they explained that
they ended up not choosing Flight for an internal project because Flight
didn't support something called "cloud fetch" which I have now learned is
[1].  I had recalled looking at this proposal before and this person seemed
interested and optimistic to know this was being considered for Flight.
This proposal, as I understand it, should make it possible for cloud
servers to support a cloud fetch style API.  From the discussion I got the
impression that this cloud fetch approach is useful and generally
applicable.

So a big +1 for the idea of disassociated transports but I'm not sure why
we need a vote to start working on it (but I'm not opposed if a vote helps)

[1]
https://www.databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html

On Thu, Mar 28, 2024 at 1:04 PM Matt Topol  wrote:

> I'll keep this new vote open for at least the next 72 hours. As before
> please reply with:
>
> [ ] +1 Accept this Proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
> Thanks everyone!
>
> On Wed, Mar 27, 2024 at 7:51 PM Benjamin Kietzman 
> wrote:
>
> > +1
> >
> > On Tue, Mar 26, 2024, 18:36 Matt Topol  wrote:
> >
> > > Should I start a new thread for a new vote? Or repeat the original vote
> > > email here?
> > >
> > > Just asking since there hasn't been any responses so far.
> > >
> > > --Matt
> > >
> > > On Thu, Mar 21, 2024 at 11:46 AM Matt Topol 
> > > wrote:
> > >
> > > > Absolutely, it will be marked experimental until we see some people
> > using
> > > > it and can get more real-world feedback.
> > > >
> > > > There's also already a couple things that will be followed-up on
> after
> > > the
> > > > initial adoption for expansion which were discussed in the comments.
> > > >
> > > > On Thu, Mar 21, 2024, 11:42 AM David Li  wrote:
> > > >
> > > >> I think let's try again. Would it be reasonable to declare this
> > > >> 'experimental' for the time being, just as we did with Flight/Flight
> > > >> SQL/etc?
> > > >>
> > > >> On Tue, Mar 19, 2024, at 15:24, Matt Topol wrote:
> > > >> > Hey All, It's been another month and we've gotten a whole bunch of
> > > >> feedback
> > > >> > and engagement on the document from a variety of individuals.
> Myself
> > > >> and a
> > > >> > few others have proactively attempted to reach out to as many
> third
> > > >> parties
> > > >> > as we could, hoping to pull more engagement also. While it would
> be
> > > >> great
> > > >> > to get even more feedback, the comments have slowed down and we
> > > haven't
> > > >> > gotten anything in a few days at this point.
> > > >> >
> > > >> > If there's no objections, I'd like to try to open up for voting
> > again
> > > to
> > > >> > officially adopt this as a protocol to add to our docs.
> > > >> >
> > > >> > Thanks all!
> > > >> >
> > > >> > --Matt
> > > >> >
> > > >> > On Sat, Mar 2, 2024 at 6:43 PM Paul Whalen 
> > > wrote:
> > > >> >
> > > >> >> Agreed that it makes sense not to focus on in-place updating for
> > this
> > > >> >> proposal.  I’m not even sure it’s a great fit as a “general
> > purpose”
> > > >> Arrow
> > > >> >> protocol, because of all the assumptions and restrictions
> required
> > as
> > > >> you
> > > >> >> noted.
> > > >> >>
> > > >> >> I took another look at the proposal and don’t think there’s
> > anything
> > > >> >> preventing in-place updating in the future - ultimately the data
> > body
> > > >> could
> > > >> >> just be in the same location for subsequent messages.
> > > >> >>
> > > >> >> Thanks!
> > > >> >> Paul
> > > >> >>
> > > >> >> On Fri, Mar 1, 2024 at 5:28 PM Matt Topol <
> zotthewiz...@gmail.com>
> > > >> wrote:
> > > >> >>
> > > >> >> > > @pgwhalen: As a potential "end user developer," (and aspiring
> > > >> >> > contributor) this
> > > >> >> > immediately excited me when I first saw it.
> > > >> >> >
> > > >> >> > Yay! Good to hear that!
> > > >> >> >
> > > >> >> > > @pgwhalen: And it wasn't clear to me whether updating batches
> > in
> > > >> 

Re: Apache Arrow Flight - From Rust to Javascript (FlightData)

2024-03-21 Thread Weston Pace
> I don't think there is currently a direct equivalent to
> `FlightRecordBatchStream` in the arrow javascript library, but you should
> be able to combine the data header + body and then read it using the
> `fromIPC` functions since it's just the Arrow IPC format

The RecordBatchReader[1] _should_ support streaming.  That being said, I
haven't personally used it in that way.  I've been doing some JS/Rust
lately but utilizing FFI and not flight.  We convert each batch into a
complete IPC file (in an in-memory buffer) and then just use
`tableFromIPC`.  We do this here[2] (in case its at all useful).

[1]
https://arrow.apache.org/docs/js/classes/Arrow_dom.RecordBatchReader.html
[2]
https://github.com/lancedb/lancedb/blob/v0.4.13/nodejs/lancedb/query.ts#L22

On Wed, Mar 20, 2024 at 9:18 AM Matt Topol 
wrote:

> I don't think there is currently a direct equivalent to
> `FlightRecordBatchStream` in the arrow javascript library, but you should
> be able to combine the data header + body and then read it using the
> `fromIPC` functions since it's just the Arrow IPC format
>
> On Fri, Mar 15, 2024 at 5:39 AM Alexander Lerche Falk 
> wrote:
>
> > Hey Arrow Dev,
> >
> > First of all: thanks for your great work.
> >
> > Is there a way to go from the FlightData structure in Javascript back to
> > Record Batches? I have a small Rust application, implementing the
> > FlightService, and streaming the data back through the do_get method.
> >
> > Maybe something equivalent to the FlightRecordBatchStream (method in
> Rust)
> > in the arrow javascript library?
> >
> > Thanks in advance and have a nice day.
> >
> >
> > Alexander Falk
> >
> > Senior Software Architect
> >
> > +45 22 95 08 64
> >
> >
> > Backstage CPH
> >
>


Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread Weston Pace
Congratulations!

On Sun, Mar 17, 2024, 8:01 PM Jacob Wujciak  wrote:

> Congrats, well deserved!
>
> Nic Crane  schrieb am Mo., 18. März 2024, 03:24:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
> > accepted an invitation to become a committer on Apache Arrow. Welcome,
> and
> > thank you for your contributions!
> >
> > Nic
> >
>


Re: [DISCUSS] Looking for feedback on my Rust library

2024-03-14 Thread Weston Pace
Felipe's points are good.

I don't know that you need to adapt the entire ADBC, it sort of depends
what you're after.  I see what you've got right now as more of an SQL
abstraction layer.  For example, similar to things like [1][2][3] (though 3
is more of an ORM).  If you like the SQL interface that you've come up with
then you could add, in addition to your postgres / sqlite / etc. bindings,
an ADBC implementation.  This would adapt anything that implements ADBC to
your interface.  This way you could get, in theory, free support for
backends like flight sql or snowflake, and you could replace your duckdb /
postgres backends if you wanted.

I will pass on some feedback I received recently.  If your audience is
"regular developers" (e.g. not data engineers, people building webapps, ML
apps, etc.) then they often do not know or want to speak Arrow.  They see
it as an essential component, but one that is sort of a "database internal"
or a "data engineering thing".  For example, in the python / pyarrow world
people are happy to know that arrow data is traversing their network but,
when they want to actually work with it (e.g. display results to users),
they convert it to python lists or pandas (fortunately arrow makes this
easy).

For example, if you look at postgres' rust bindings you will see that
people process results like this:

```

for row in client.query("SELECT id, name, data FROM person", &[])? {
let id: i32 = row.get(0);
let name:  = row.get(1);
let data: Option<&[u8]> = row.get(2);

println!("found person: {} {} {:?}", id, name, data);
}

```

The `get` method can be templated to anything implementing the `FromSql`
trait.  This lets rust devs use types they are familiar with (e.g. ``,
`i32`, `&[u8]`) instead of having to learn a new technology (whatever
postgres is using internally)

On the other hand, if your audience is, in fact, data engineers, then that
sort of native row-based interface is going to be too efficient.  So there
are definitely uses for both.

[1] https://sequelize.org/v3/
[2] https://docs.rs/quaint/latest/quaint/
[3] https://www.sqlalchemy.org/

On Thu, Mar 14, 2024 at 4:19 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Two comments:
>
> ——
>
> Since this library is analogous to things like ADBC, ODBC, and JDBC, it’s
> more of a “driver” than a “connector”. This might make your life easier
> when explaining what it does.
>
> It’s not a black and white thing, but “connector” might imply networking to
> some people.
>
> I believe you delegate the networking bits of interacting with PostgreSQL
> to a Rust connector.
>
> ——
>
> This library would be more interesting if it could be a wrapper of
> language-agnostic database standards like ADBC and ODBC. The Rust compiler
> can call and expose functions that follow the C ABI — the only true code
> interface standard available on every OS/Architecture pair.
>
> This would mean that any database that exposes ADBC/ODBC can be used from
> your driver. You would still offer a rich Rust interface, but everything
> would translate to well-defined operations that vendors implement. This
> also reduces the chances of you providing things that are heavily biased
> towards the way the databases you supported initially work.
>
> —
> Felipe
>
>
>
> On Tue, 12 Mar 2024 at 09:28 Aljaž Eržen  wrote:
>
> > Hello good folks of Apache Arrow! I come looking for feedback on my
> > Rust crate connector_arrow [1], which is an Arrow database client that
> > is able to connect to multiple databases over their native protocols.
> >
> > It is very similar to ADBC, but better adapted for the Rust ecosystem,
> > as it can be compiled with plain cargo and uses established crates for
> > connecting to the databases.
> >
> > The main feedback I need is the API exposed by the library [2]. I've
> > tried to keep it minimal and it turned out much more concise than the
> > api exposed by ADBC. Have I missed important features?
> >
> > Aljaž Mur Eržen
> >
> > [1]: https://docs.rs/connector_arrow/latest/connector_arrow/
> > [2]:
> >
> https://github.com/aljazerzen/connector_arrow/blob/main/connector_arrow/src/api.rs
> >
>


Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Weston Pace
+1 (binding)

On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb  wrote:

> Hello,
>
> As we have discussed[1][2] I would like to vote on the proposal to
> create a new Apache Top Level Project for DataFusion. The text of the
> proposed resolution and background document is copy/pasted below
>
> If the community is in favor of this, we plan to submit the resolution
> to the ASF board for approval with the next Arrow report (for the
> April 2024 board meeting).
>
> The vote will be open for at least 7 days.
>
> [ ] +1 Accept this Proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
> Andrew
>
> [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> [2] https://github.com/apache/arrow-datafusion/discussions/6475
>
> -- Proposed Resolution -
>
> Resolution to Create the Apache DataFusion Project from the Apache
> Arrow DataFusion Sub Project
>
> =
>
> X. Establish the Apache DataFusion Project
>
> WHEREAS, the Board of Directors deems it to be in the best
> interests of the Foundation and consistent with the
> Foundation's purpose to establish a Project Management
> Committee charged with the creation and maintenance of
> open-source software related to an extensible query engine
> for distribution at no charge to the public.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee (PMC), to be known as the "Apache DataFusion Project",
> be and hereby is established pursuant to Bylaws of the
> Foundation; and be it further
>
> RESOLVED, that the Apache DataFusion Project be and hereby is
> responsible for the creation and maintenance of software
> related to an extensible query engine; and be it further
>
> RESOLVED, that the office of "Vice President, Apache DataFusion" be
> and hereby is created, the person holding such office to
> serve at the direction of the Board of Directors as the chair
> of the Apache DataFusion Project, and to have primary responsibility
> for management of the projects within the scope of
> responsibility of the Apache DataFusion Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and
> hereby are appointed to serve as the initial members of the
> Apache DataFusion Project:
>
> * Andy Grove (agr...@apache.org)
> * Andrew Lamb (al...@apache.org)
> * Daniël Heres (dhe...@apache.org)
> * Jie Wen (jake...@apache.org)
> * Kun Liu (liu...@apache.org)
> * Liang-Chi Hsieh (vii...@apache.org)
> * Qingping Hou: (ho...@apache.org)
> * Wes McKinney(w...@apache.org)
> * Will Jones (wjones...@apache.org)
>
> RESOLVED, that the Apache DataFusion Project be and hereby
> is tasked with the migration and rationalization of the Apache
> Arrow DataFusion sub-project; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache
> Arrow DataFusion sub-project encumbered upon the
> Apache Arrow Project are hereafter discharged.
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
> be appointed to the office of Vice President, Apache DataFusion, to
> serve in accordance with and subject to the direction of the
> Board of Directors and the Bylaws of the Foundation until
> death, resignation, retirement, removal or disqualification,
> or until a successor is appointed.
> =
>
>
> ---
>
>
> Summary:
>
> We propose creating a new top level project, Apache DataFusion, from
> an existing sub project of Apache Arrow to facilitate additional
> community and project growth.
>
> Abstract
>
> Apache Arrow DataFusion[1]  is a very fast, extensible query engine
> for building high-quality data-centric systems in Rust, using the
> Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
> APIs, excellent performance, built-in support for CSV, Parquet, JSON,
> and Avro, extensive customization, and a great community.
>
> [1] https://arrow.apache.org/datafusion/
>
>
> Proposal
>
> We propose creating a new top level ASF project, Apache DataFusion,
> governed initially by a subset of the Apache Arrow project’s PMC and
> committers. The project’s code is in five existing git repositories,
> currently governed by Apache Arrow which would transfer to the new top
> level project.
>
> Background
>
> When DataFusion was initially donated to the Arrow project, it did not
> have a strong enough community to stand on its own. It has since grown
> significantly, and benefited immensely from being part of Arrow and
> nurturing of the Apache Way, and now has a community strong enough to
> stand on its own and that would benefit from focused governance
> attention.
>
> The community has discussed this idea publicly for more than 6 months
> https://github.com/apache/arrow-datafusion/discussions/6475  and
> briefly on the Arrow PMC mailing list
> https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As
> of the time of this writing both had exclusively positive reactions.
>

Re: [ANNOUNCE] New Arrow committer: Jay Zhan

2024-02-16 Thread Weston Pace
Congrats!

On Fri, Feb 16, 2024 at 3:07 AM Raúl Cumplido  wrote:

> Congratulations!!
>
> El vie, 16 feb 2024 a las 12:02, Daniël Heres
> () escribió:
> >
> > Congratulations!
> >
> > On Fri, Feb 16, 2024, 11:33 Metehan Yıldırım <
> metehan.yildi...@synnada.ai>
> > wrote:
> >
> > > Congrats!
> > >
> > > On Fri 16. Feb 2024 at 13:26, Andrew Lamb 
> wrote:
> > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Jay Zhan
> > > > has accepted an invitation to become a committer on Apache
> > > > Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > Andrew
> > > >
> > >
>


Re: [DISC] Improve Arrow Release verification process

2024-01-21 Thread Weston Pace
+1.  There have been a few times I've attempted to run the verification
scripts.  They have failed, but I was pretty confident it was a problem
with my environment mixing with the verification script and not a problem
in the software itself and I didn't take the time to debug the verification
script issues.  So even if there were a true issue I doubt the manual
verification process would help me catch it.

Also, most devs seem to be on fairly consistent development environments
(Ubuntu or Macbook).  So rather than spend time allowing many people to
verify Ubuntu works we could probably spend that time building extra CI
environments that  provide more coverage.

On Fri, Jan 19, 2024 at 1:49 PM Jacob Wujciak-Jens
 wrote:

> I concur, a minimally scoped verification script for the actual voting
> process without any binary verification etc. should be created. The ease in
> verifying a release will lower the burden to participate in the vote which
> is good for the community and will even be necessary if we ever want to
> increase release cadance as previously discussed.
>
> In my opinion it will also mean that the binaries are no longer part of the
> release, which will help in situations similar to the release of Python
> 3.12 just after 14.0.0 was released and lots of users were running into
> issues because there were no 14.0.0 wheels for 3.12.
>
> While it would still be nice to potentially make reproduction of CI errors
> easier by having better methods to restart a failed script, this is of much
> lower importance then improving the release process.
>
> Jacob
>
> On Fri, Jan 19, 2024 at 7:38 PM Andrew Lamb  wrote:
>
> > I would second this notion that manually running tests that are already
> > covered as part of CI as part of the release process is of (very) limited
> > value.
> >
> > While we do the same thing (compile and run some tests) as part of the
> Rust
> > release this has never caught any serious defect I am aware of and we
> only
> > run a subset of tests (e.g. not tests for integration with other arrow
> > versions)
> >
> > Reducing the burden for releases I think would benefit everyone.
> >
> > Andrew
> >
> > On Fri, Jan 19, 2024 at 1:08 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Well, if the main objective is to just follow the ASF Release
> > > guidelines, then our verification process can be simplified
> drastically.
> > >
> > > The ASF indeed just requires:
> > > """
> > > Every ASF release MUST contain one or more source packages, which MUST
> > > be sufficient for a user to build and test the release provided they
> > > have access to the appropriate platform and tools. A source release
> > > SHOULD not contain compiled code.
> > > """
> > >
> > > So, basically, if the source tarball is enough to compile Arrow on a
> > > single platform with a single set of tools, then we're ok. :-)
> > >
> > > The rest is just an additional burden that we voluntarily inflict to
> > > ourselves. *Ideally*, our continuous integration should be able to
> > > detect any build-time or runtime issue on supported platforms. There
> > > have been rare cases where some issues could be detected at release
> time
> > > thanks to the release verification script, but these are a tiny portion
> > > of all issues routinely detected in the form of CI failures. So, there
> > > doesn't seem to be a reason to believe that manual release verification
> > > is bringing significant benefits compared to regular CI.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 19/01/2024 à 11:42, Raúl Cumplido a écrit :
> > > > Hi,
> > > >
> > > > One of the challenges we have when doing a release is verification
> and
> > > voting.
> > > >
> > > > Currently the Arrow verification process is quite long, tedious and
> > > error prone.
> > > >
> > > > I would like to use this email to get feedback and user requests in
> > > > order to improve the process.
> > > >
> > > > Several things already on my mind:
> > > >
> > > > One thing that is quite annoying is that any flaky failure makes us
> > > > restart the process and possibly requires downloading everything
> > > > again. It would be great to have some kind of retry mechanism that
> > > > allows us to keep going from where it failed and doesn't have to redo
> > > > the previous successful jobs.
> > > >
> > > > We do have a bunch of flags to do specific parts but that requires
> > > > knowledge and time to go over the different flags, etcetera so the UX
> > > > could be improved.
> > > >
> > > > Based on the ASF release policy [1] in order to cast a +1 vote we
> have
> > > > to validate the source code packages but it is not required to
> > > > validate binaries locally. Several binaries are currently tested
> using
> > > > docker images and they are already tested and validated on CI. Our
> > > > documentation for release verification points to perform binary
> > > > validation. I plan to update the documentation and move it to the
> > > > official docs instead of the 

Re: [DISCUSS] Semantics of extension types

2023-12-14 Thread Weston Pace
I agree engines can use their own strategy.  Requiring explicit casts is
probably ok as long as it is well documented but I think I lean slightly
towards implicitly falling back to the storage type.  I do think think
people still shy away from extension types.  Adding the extension type to
an implicit cast registry is another hurdle to their use, albeit a small
one.

Substrait has a similar consideration for extension types.  They can be
declared "inherits" (meaning the storage type can be used implicitly in
compute functions) or "separate" (meaning the storage type cannot be used
implicitly in compute functions).  This would map nicely to an Arrow
metadata field.

Unfortunately, I think the truth is more nuanced than a simple
separate/inherits flag.  Take UUID for example (everyone's favorite fixed
size binary extension type).  We would definitely want to implicitly reuse
the hash, equality, and sorting functions.

However, for other functions it gets trickier.  Imagine you have a
`replace_slice` function.  Should it return a new UUID (change some bytes
in a UUID and you have a new UUID) or not (once you start changing bytes in
a UUID you no longer have a UUID).  Or what if there was a `slice`
function.  This function should either be prohibited (you can't slice a
UUID) or should return a fixed size binary string (you can still slice it
but you no longer have a UUID).

Given the complication I think users will always need to carefully consider
each use of an extension function no matter how smart a system is.  I'm not
convinced any metadata exists that could define the right approach in a
consistent number of cases.  This means our choice is whether we force
users to explicitly declare each such decision or we just trust that they
are doing the proper consideration when they design their plan.  I'm not
sure there is a right answer.  One can point to the vast diversity of ways
that programming languages have approached implicit vs explicit integer
casts.

My last concern is that we rely on compute functions in operators other
than project/filter.  For example, to use a column as a key for a hash-join
we need to be able to compute the hash value and calculate equality.  To
use a column as a key for sorting we need an ordering function.  These are
places where it might be unexpected for users to insert explicit casts.  An
engine would need to make sure the error message in these cases was very
clear.

On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou  wrote:

>
> Hi,
>
> For now, I would suggest that each implementation decides on their own
> strategy, because we don't have a clear idea of which is better (and
> extension types are probably not getting a lot of use yet).
>
> Regards
>
> Antoine.
>
>
> Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > The main problem I see with adding properties to ExtensionType is I'm not
> > sure where that information would reside. Allowing type authors to
> declare
> > in which ways the type is equivalent (or not) to its storage is
> appealing,
> > but it seems to need an official extension field like
> > ARROW:extension:semantics. Otherwise I think each extension type's
> > semantics would need to be maintained within every implementation as well
> > as in a central reference (probably in Columnar.rst), which seems
> > unreasonable to expect of extension type authors. I'm also skeptical that
> > useful information could be packed into an ARROW:extension:semantics
> field;
> > even if the type can declare that ordering-as-with-storage is invalid I
> > don't think it'd be feasible to specify the correct ordering.
> >
> > If we cannot attach this information to extension types, the question
> > becomes which defaults are most reasonable for engines and how can the
> > engine most usefully be configured outside those defaults. My own
> > preference would be to refuse operations other than selection or
> > casting-to-storage, with a runtime extensible registry of allowed
> implicit
> > casts. This will allow users of the engine to configure their extension
> > types as they need, and the error message raised when an implicit
> > cast-to-storage is not allowed could include the suggestion to register
> the
> > implicit cast. For applications built against a specific engine, this
> > approach would allow recovering much of the advantage of attaching
> > properties to an ExtensionType by including registration of implicit
> casts
> > in the ExtensionType's initialization.
> >
> > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
> > wrote:
> >
> >> Hello all,
> >>
> >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> from
> >> any extension type to its storage type in acero. This raises questions
> >> about the validity of applying operations to an extension array's
> storage.
> >> For example, some extension type authors may intend different ordering
> for
> >> arrays of their new type than would be applied to the array's storage or
> >> may not 

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Weston Pace
+1 (binding)

On Fri, Dec 8, 2023 at 1:43 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> On Fri, Dec 8, 2023 at 1:27 PM Antoine Pitrou  wrote:
> >
> > +1 (binding)
> >
> >
> > Le 08/12/2023 à 20:42, David Li a écrit :
> > > Let's start a formal vote just so we're on the same page now that
> we've discussed a few things.
> > >
> > > I would like to propose we remove 'experimental' from Flight SQL and
> make it stable:
> > >
> > > - Remove the 'experimental' option from the Protobuf definitions (but
> leave the option definition for future additions)
> > > - Update specifications/documentation/implementations to no longer
> refer to Flight SQL as experimental, and describe what stable means (no
> backwards-incompatible changes)
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1
> > > [ ] +0
> > > [ ] -1 Keep Flight SQL experimental because...
> > >
> > > On Fri, Dec 8, 2023, at 13:37, Weston Pace wrote:
> > >> +1
> > >>
> > >> On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb 
> wrote:
> > >>>
> > >>>> I agree it is time to "promote" ArrowFlightSQL to the same level as
> other
> > >>>> standards in Arrow
> > >>>>
> > >>>> Now that it is used widely (we use and count on it too at
> InfluxData) I
> > >>>> agree it makes sense to remove the experimental label from the
> overall
> > >>>> spec.
> > >>>>
> > >>>> It would make sense to leave experimental / caveats on any places
> (like
> > >>>> extension APIs) that are likely to change
> > >>>>
> > >>>> Andrew
> > >>>>
> > >>>> On Fri, Dec 8, 2023 at 11:39 AM David Li 
> wrote:
> > >>>>
> > >>>>> Yes, I think we can continue marking new features (like the bulk
> > >>>>> ingest/session proposals) as experimental but remove it from
> anything
> > >>>>> currently in the spec.
> > >>>>>
> > >>>>> On Fri, Dec 8, 2023, at 11:36, Laurent Goujon wrote:
> > >>>>>> I'm the author of the initial pull request which triggered the
> > >>>>> discussion.
> > >>>>>> I was focusing first on the comment in Maven pom.xml files which
> show
> > >>>> up
> > >>>>> in
> > >>>>>> Maven Central and other places, and which got some people confused
> > >>>> about
> > >>>>>> the state of the driver/code. IMHO this would apply to the current
> > >>>>>> Flight/Flight SQL protocol and code as it is today. Protocol
> > >>> extensions
> > >>>>>> should be still deemed experimental if still in their incubating
> > >>> phase?
> > >>>>>>
> > >>>>>> Laurent
> > >>>>>>
> > >>>>>> On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield <
> > >>> emkornfi...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> This applies to mostly existing APIs (e.g. recent additions are
> > >>> still
> > >>>>>>> experimental)? Or would it apply to everything going forward?
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Micah
> > >>>>>>>
> > >>>>>>> On Thu, Dec 7, 2023 at 2:25 PM David Li 
> > >>> wrote:
> > >>>>>>>
> > >>>>>>>> Yes, we'd update the docs, the Protobuf definitions, and
> anything
> > >>>> else
> > >>>>>>>> referring to Flight SQL as experimental.
> > >>>>>>>>
> > >>>>>>>> On Thu, Dec 7, 2023, at 17:14, Joel Lubinitsky wrote:
> > >>>>>>>>> The message types defined in FlightSql.proto are all marked
> > >>>>>>> experimental
> > >>>>>>>> as
> > >>>>>>>>> well. Would this include changes to any of those?
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Dec 7, 2023 at 16:43 Laurent Goujon
> > >>>>>  > >>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> we have been using it with Dremio for a while now, and we
> > >>>> consider
> > >>>>> it
> > >>>>>>>>>> stable
> > >>>>>>>>>>
> > >>>>>>>>>> +1 (not binding)
> > >>>>>>>>>>
> > >>>>>>>>>> Laurent
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Dec 6, 2023 at 4:52 PM Matt Topol
> > >>>>>>>  > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> +1, I agree with everyone else
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Dec 6, 2023 at 7:49 PM James Duong
> > >>>>>>>>>>>  wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> +1 from me. It's used in a good number of databases now.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Get Outlook for Android<https://aka.ms/AAb9ysg>
> > >>>>>>>>>>>> 
> > >>>>>>>>>>>> From: David Li 
> > >>>>>>>>>>>> Sent: Wednesday, December 6, 2023 9:59:54 AM
> > >>>>>>>>>>>> To: dev@arrow.apache.org 
> > >>>>>>>>>>>> Subject: [DISCUSS] Flight SQL as experimental
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Flight SQL has been marked 'experimental' since the
> > >>>> beginning.
> > >>>>>>> Given
> > >>>>>>>>>> that
> > >>>>>>>>>>>> it's now used by a few systems for a few years now, should
> > >>> we
> > >>>>>>> remove
> > >>>>>>>>>> this
> > >>>>>>>>>>>> qualifier? I don't expect us to be making breaking changes
> > >>>>>>> anymore.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This came up in a GitHub PR:
> > >>>>>>>>>> https://github.com/apache/arrow/pull/39040
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -David
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
>


Re: [DISCUSS] Flight SQL as experimental

2023-12-08 Thread Weston Pace
+1

On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield 
wrote:

> +1
>
> On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb  wrote:
>
> > I agree it is time to "promote" ArrowFlightSQL to the same level as other
> > standards in Arrow
> >
> > Now that it is used widely (we use and count on it too at InfluxData) I
> > agree it makes sense to remove the experimental label from the overall
> > spec.
> >
> > It would make sense to leave experimental / caveats on any places (like
> > extension APIs) that are likely to change
> >
> > Andrew
> >
> > On Fri, Dec 8, 2023 at 11:39 AM David Li  wrote:
> >
> > > Yes, I think we can continue marking new features (like the bulk
> > > ingest/session proposals) as experimental but remove it from anything
> > > currently in the spec.
> > >
> > > On Fri, Dec 8, 2023, at 11:36, Laurent Goujon wrote:
> > > > I'm the author of the initial pull request which triggered the
> > > discussion.
> > > > I was focusing first on the comment in Maven pom.xml files which show
> > up
> > > in
> > > > Maven Central and other places, and which got some people confused
> > about
> > > > the state of the driver/code. IMHO this would apply to the current
> > > > Flight/Flight SQL protocol and code as it is today. Protocol
> extensions
> > > > should be still deemed experimental if still in their incubating
> phase?
> > > >
> > > > Laurent
> > > >
> > > > On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > >> This applies to mostly existing APIs (e.g. recent additions are
> still
> > > >> experimental)? Or would it apply to everything going forward?
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> On Thu, Dec 7, 2023 at 2:25 PM David Li 
> wrote:
> > > >>
> > > >> > Yes, we'd update the docs, the Protobuf definitions, and anything
> > else
> > > >> > referring to Flight SQL as experimental.
> > > >> >
> > > >> > On Thu, Dec 7, 2023, at 17:14, Joel Lubinitsky wrote:
> > > >> > > The message types defined in FlightSql.proto are all marked
> > > >> experimental
> > > >> > as
> > > >> > > well. Would this include changes to any of those?
> > > >> > >
> > > >> > > On Thu, Dec 7, 2023 at 16:43 Laurent Goujon
> > >  > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > >> we have been using it with Dremio for a while now, and we
> > consider
> > > it
> > > >> > >> stable
> > > >> > >>
> > > >> > >> +1 (not binding)
> > > >> > >>
> > > >> > >> Laurent
> > > >> > >>
> > > >> > >> On Wed, Dec 6, 2023 at 4:52 PM Matt Topol
> > > >>  > > >> > >
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > +1, I agree with everyone else
> > > >> > >> >
> > > >> > >> > On Wed, Dec 6, 2023 at 7:49 PM James Duong
> > > >> > >> >  wrote:
> > > >> > >> >
> > > >> > >> > > +1 from me. It's used in a good number of databases now.
> > > >> > >> > >
> > > >> > >> > > Get Outlook for Android
> > > >> > >> > > 
> > > >> > >> > > From: David Li 
> > > >> > >> > > Sent: Wednesday, December 6, 2023 9:59:54 AM
> > > >> > >> > > To: dev@arrow.apache.org 
> > > >> > >> > > Subject: [DISCUSS] Flight SQL as experimental
> > > >> > >> > >
> > > >> > >> > > Flight SQL has been marked 'experimental' since the
> > beginning.
> > > >> Given
> > > >> > >> that
> > > >> > >> > > it's now used by a few systems for a few years now, should
> we
> > > >> remove
> > > >> > >> this
> > > >> > >> > > qualifier? I don't expect us to be making breaking changes
> > > >> anymore.
> > > >> > >> > >
> > > >> > >> > > This came up in a GitHub PR:
> > > >> > >> https://github.com/apache/arrow/pull/39040
> > > >> > >> > >
> > > >> > >> > > -David
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread Weston Pace
Congratulations Felipe!

On Thu, Dec 7, 2023 at 8:38 AM wish maple  wrote:

> Congrats Felipe!!!
>
> Best,
> Xuwei Fu
>
> Benjamin Kietzman  于2023年12月7日周四 23:42写道:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira
> > Carvalho
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > Ben Kietzman
> >
>


Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread Weston Pace
Congrats Andy!

On Mon, Nov 27, 2023, 7:31 PM wish maple  wrote:

> Congrats Andy!
>
> Best,
> Xuwei Fu
>
> Andrew Lamb  于2023年11月27日周一 20:47写道:
>
> > I am pleased to announce that the Arrow Project has a new PMC chair and
> VP
> > as per our tradition of rotating the chair once a year. I have resigned
> and
> > Andy Grove was duly elected by the PMC and approved unanimously by the
> > board.
> >
> > Please join me in congratulating Andy Grove!
> >
> > Thanks,
> > Andrew
> >
>


Re: [ANNOUNCE] New Arrow committer: James Duong

2023-11-17 Thread Weston Pace
Congratulations James

On Fri, Nov 17, 2023 at 6:07 AM Metehan Yıldırım <
metehan.yildi...@synnada.ai> wrote:

> Congratulations!
>
> On Thu, Nov 16, 2023 at 10:45 AM Sutou Kouhei  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that James Duong
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > --
> > kou
> >
> >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Weston Pace
Congratulations Raúl!

On Mon, Nov 13, 2023 at 1:34 PM Ben Harkins 
wrote:

> Congrats, Raúl!!
>
> On Mon, Nov 13, 2023 at 4:30 PM Bryce Mecum  wrote:
>
> > Congrats, Raúl!
> >
> > On Mon, Nov 13, 2023 at 10:28 AM Andrew Lamb 
> > wrote:
> > >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Raúl Cumplido  to become a PMC member and we are pleased to announce
> > > that  Raúl Cumplido has accepted.
> > >
> > > Please join me in congratulating them.
> > >
> > > Andrew
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Weston Pace
+1 for the original proposal as well.

---

The (minor) problem I see with flags is that there isn't much point to this
feature if you are gating on a flag.  I'm assuming the goal is what Dewey
originally mentioned which is making buffer calculations easier.  However,
if you're gating the feature with a flag then you are either:

 * Rejecting input from producers that don't support this feature
(undesirable, better to align on one use model if we can)
 * Doing all the work anyways to handle producers that don't support the
feature

Maybe it makes sense for a long term migration (e.g. we all agree this is
something we want to move towards but we need to handle old producers in
the meantime) but we can always discuss that separately and I don't think
the benefit here is worth the confusion.

On Tue, Nov 7, 2023 at 7:46 AM Will Jones  wrote:

> I agree with the approach originally proposed by Ben. It seems like the
> most straightforward way to implement within the current protocol.
>
> On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
>  wrote:
>
> > In the absence of a general solution to the C data interface omitting
> > buffer sizes, I think the original proposal is the best way
> > forward...this is the first type to be added whose buffer sizes cannot
> > be calculated without looping over every element of the array; the
> > buffer sizes are needed to efficiently serialize the imported array to
> > IPC if imported by a consumer that cares about buffer sizes.
> >
> > Using a schema's flags to indicate something about a specific paired
> > array (particularly one that, if misinterpreted, would lead to a
> > crash) is a precedent that is probably not worth introducing for just
> > one type. Currently a schema is completely independent of any
> > particular ArrowArray, and I think that is a feature that is worth
> > preserving. My gripes about not having buffer sizes on the CPU to more
> > efficiently copy between devices is a concept almost certainly better
> > suited to the ArrowDeviceArray struct.
> >
> > On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman 
> > wrote:
> > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value.
> > >
> > > It seems to me that ignoring unknown flags is the primary case to
> > consider
> > > at
> > > this point, since consumers may ignore unknown flags. Since that is the
> > > case,
> > > it seems adding any flag which would break such a consumer would be
> > > tantamount to an ABI breakage. I don't think this can be averted unless
> > all
> > > consumers are required to error out on unknown flag values.
> > >
> > > In the specific case of Utf8View it seems certain that consumers would
> > add
> > > support for the buffer sizes flag simultaneously with adding support
> for
> > the
> > > new type (since Utf8View is difficult to import otherwise), so any
> > consumer
> > > which would error out on the new flag would already be erroring out on
> an
> > > unsupported data type.
> > >
> > > > I might be the only person who has implemented
> > > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > > schema's flag value
> > >
> > > I think passing a schema's flag value including unknown flags is an
> > error.
> > > The ABI defines moving structures but does not define deep copying. I
> > think
> > > in order to copy deeply in terms of operations which *are* specified:
> we
> > > import then export the schema. Since this includes an export step, it
> > > should not
> > > include flags which are not supported by the exporter.
> > >
> > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > > >> Is this buffer lengths buffer only present if the array type is
> > > > Utf8View?
> > > > >
> > > > > IIUC, the proposal would add the buffer lengths buffer for all
> types
> > if
> > > > the
> > > > > schema's
> > > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> > avoid
> > > > > the special case and that `n_buffers` would continue to be
> consistent
> > > > with
> > > > > IPC.
> > > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value. We haven't specified that unknown flag values should be
> > > > ignored, so a consumer could judiciously choose to error out instead
> of
> > > > potentially misinterpreting the data.
> > > >
> > > > All in all, personally I'd rather we make a special case for Utf8View
> > > > instead of adding a flag that can lead to worse interoperability.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Weston Pace
Is this buffer lengths buffer only present if the array type is Utf8View?
Or are you suggesting that other types might want to adopt this as well?

On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington
 wrote:

> > I expect C code to not be much longer then this :-)
>
> nanoarrow's buffer-length-calculation and validation concepts are
> (perhaps inadvisably) intertwined...even with both it is not that much
> code (perhaps I was remembering how much time it took me to figure out
> which 35 lines to write :-))
>
> > That sounds a bit hackish to me.
>
> Including only *some* buffer sizes in array->buffers[array->n_buffers]
> special-cased for only two types (or altering the number of buffers
> required by the IPC format vs. the number of buffers required by the C
> Data interface) seem equally hackish to me (not that I'm opposed to
> either necessarily...the alternatives really are very bad).
>
> > How can you *not* care about buffer sizes, if you for example need to
> send the buffers over IPC?
>
> I think IPC is the *only* operation that requires that information?
> (Other than perhaps copying to another device?) I don't think there's
> any barrier to accessing the content of all the array elements but I
> could be mistaken.
>
> On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou  wrote:
> >
> >
> > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > > The lack of buffer sizes is something that has come up for me a few
> > > times working with nanoarrow (which dedicates a significant amount of
> > > code to calculating buffer sizes, which it uses to do validation and
> > > more efficient copying).
> >
> > By the way, this is a bit surprising since it's really 35 lines of code
> > in C++ currently:
> >
> >
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
> >
> > I expect C code to not be much longer then this :-)
> >
> > Regards
> >
> > Antoine.
>


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Weston Pace
Congratulations Xuwei!

On Mon, Oct 23, 2023 at 3:38 AM wish maple  wrote:

> Thanks kou and every nice person in arrow community!
>
> I've learned a lot during learning and contribution to arrow and
> parquet. Thanks for everyone's help.
> Hope we can bring more fancy features in the future!
>
> Best,
> Xuwei Fu
>
> Sutou Kouhei  于2023年10月23日周一 12:48写道:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > --
> > kou
> >
>


Re: Apache Arrow file format

2023-10-21 Thread Weston Pace
> Of course, what I'm really asking for is to see how Lance would compare
;-)

> P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
> and GPU performance (in Section 5.9). It also cites Lance as one of the
> future formats (in Section 5.6.2).

Disclaimer: I work for LanceDb and am in no way an unbiased party.
However, since you asked:

TL;DR: Lance performs 10-20x better than orc or parquet when retrieving a
small scattered selection of rows from a large dataset.

I went ahead and reproduced the experiment in the second paper using
Lance.  Specifically, a vector search for 10 elements against the first 100
million rows of the laion 5b dataset.  There were a few details missing in
the paper (specifically around what index they used) and I ended up
training a rather underwhelming index but the performance of the index is
unrelated to the file format and so irrelevant for this discussion anyways.

Vector searches perform a CPU intensive index search against a relatively
small index (in the paper this index was kept in memory and so I did the
same for my experiment).  This identifies the rows of interest.  We then
need to perform a take operation to select those rows from storage.  This
is the part where the file format matters. So all we are really measuring
here is how long it takes to select N rows at random from a dataset.  This
is one of the use cases Lance was designed for and so it is no surprise
that it performs better.

Note that Lance stores data uncompressed.  However, it probably doesn't
matter in this case.  100 million rows of Laion 5B requires ~320GB.  Only
20GB of this is metadata.  The remaining 300GB is text & image embeddings.
These embeddings are, by design, not very compressible.  The entire lance
dataset required 330GB.

# Results:

The chart in the paper is quite small and uses a log scale.  I had to infer
the performance numbers for parquet & orc as best I could.  The numbers for
lance are accurate as that is what I measured.  These results are averaged
from 64 randomized queries (each iteration ran a single query to return 10
results) with the kernel's disk cache cleared (same as the paper I believe).

## S3:

Parquet: ~12,000ms
Orc: ~80,000ms
Lance: 1,696ms

## Local Storage (SSD):

Parquet: ~800ms
Orc: ~65ms (10ms spent searching index)
Lance: 61ms (59ms spent searching index)

At first glance it may seem like Lance performs about the same as Orc with
an SSD.  However, this is likely because my index was suboptimal (I did not
spend any real time tuning it since I could just look at the I/O times
directly).  The lance format spent only 2ms on I/O compared with ~55ms
spent on I/O by Orc.

# Boring Details:

Index: IVF/PQ with 1000 IVF partitions (probably should have been 10k
partitions but I'm not patient enough) and 96 PQ subvectors (1 byte per
subvector)
Hardware: Tests were performed on an r6id.2xlarge (using the attached NVME
for the SSD tests) in the same region as the S3 storage

Minor detail: The embeddings provided with the laion 5b dataset (clip
vit-l/14) were provided as float16.  Lance doesn't yet support float16 and
so I inflated these to float32 (that doubles the amount of data retrieved
so, if anything, it's just making things harder on lance)

On Thu, Oct 19, 2023 at 9:55 AM Aldrin  wrote:

> And the first paper's reference of arrow (in the references section) lists
> 2022 as the date of last access.
>
> Sent from Proton Mail  for iOS
>
>
> On Thu, Oct 19, 2023 at 18:51, Aldrin  > wrote:
>
> For context, that second referenced paper has Wes McKinney as a co-author,
> so they were much better positioned to say "the right things."
>
> Sent from Proton Mail  for iOS
>
>
> On Thu, Oct 19, 2023 at 18:38, Jin Shang  > wrote:
>
> Honestly I don't understand why this VLDB paper [1] chooses to include
> Feather in their evaluations. This paper studies OLAP DBMS file formats.
> Feather is clearly not optimized for the workload and performs badly in
> most of their benchmarks. This paper also has several inaccurate or
> outdated claims about Arrow, e.g. Arrow has no run length encoding, Arrow's
> dictionary encoding only supports string types (Table 3 and 5), Feather is
> Arrow plus dictionary encoding and compression (Section 3.2) etc. Moreover,
> the two optimizations it proposes for Arrow (in Section 8.1.1 and 8.1.3)
> are actually just two new APIs for working with Arrow data that require no
> change to the Arrow format itself. I fear that this paper may actually
> discourage DB people from using Arrow as their *in memory *format, even
> though it's the Arrow *file* format that performs badly for their workload.
>
> There is another paper "An Empirical Evaluation of Columnar Storage
> Formats" [2] covering essentially the same topic. It however chooses not to
> evaluate Arrow (in Section 2) because "Arrow is not meant for long-term
> disk storage", citing Wes McKinney's blog post [3] from six 

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-15 Thread Weston Pace
Congratulations Jon!

On Sun, Oct 15, 2023, 1:51 PM Neal Richardson 
wrote:

> Congratulations!
>
> On Sun, Oct 15, 2023 at 1:35 PM Bryce Mecum  wrote:
>
> > Congratulations, Jon!
> >
> > On Sat, Oct 14, 2023 at 9:24 AM Andrew Lamb 
> wrote:
> > >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Jonathan Keane to become a PMC member and we are pleased to announce
> > > that Jonathan Keane has accepted.
> > >
> > > Congratulations and welcome!
> > >
> > > Andrew
> >
>


Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread Weston Pace
Congratulations!

On Sun, Oct 15, 2023, 8:51 AM Gang Wu  wrote:

> Congrats!
>
> On Sun, Oct 15, 2023 at 10:49 PM David Li  wrote:
>
> > Congrats & welcome Curt!
> >
> > On Sun, Oct 15, 2023, at 09:03, wish maple wrote:
> > > Congratulations!
> > >
> > > Raúl Cumplido  于2023年10月15日周日 20:48写道:
> > >
> > >> Congratulations and welcome!
> > >>
> > >> El dom, 15 oct 2023, 13:57, Ian Cook  escribió:
> > >>
> > >> > Congratulations Curt!
> > >> >
> > >> > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb 
> > wrote:
> > >> >
> > >> > > On behalf of the Arrow PMC, I'm happy to announce that Curt
> > Hagenlocher
> > >> > > has accepted an invitation to become a committer on Apache
> > >> > > Arrow. Welcome, and thank you for your contributions!
> > >> > >
> > >> > > Andrew
> > >> > >
> > >> >
> > >>
> >
>


Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Weston Pace
> I feel the broader question here is what is Arrow's intended use case -
interchange or execution

The line between interchange and execution is not always clear.  For
example, I think we would like Arrow to be considered as a standard for UDF
libraries.

On Fri, Oct 6, 2023 at 7:34 AM Mark Raasveldt  wrote:

> For the index vs pointer question - DuckDB went with pointers as they are
> more flexible, and DuckDB was designed to consume data (and strings) from a
> wide variety of formats in a wide variety of languages. Pointers allows us
> to easily zero-copy from e.g. Python strings, R strings, Arrow strings,
> etc. The flip side of pointers is that ownership has to be handled very
> carefully. Our vector format is an execution-only format, and never leaves
> the internals of the engine. This greatly simplifies ownership as we are in
> complete control of what happens inside the engine. For an interchange
> format that is intended for handing data between engines, I can see this
> being more complicated and having verification being more important.
>
> As for the actual change:
>
> From an interchange perspective from DuckDB's side - the proposed
> zero-copy integration would definitely speed up the conversion of DuckDB
> string vectors to Arrow string vectors. In a recent benchmark that we have
> performed we have found string conversion to Arrow vectors to be a
> bottleneck in certain workloads, although we have not sufficiently
> researched if this could be improved in other ways. It is possible this can
> be alleviated without requiring changes to Arrow.
>
> However - in general, a new string vector format is only useful if
> consumers also support the format. If the consumer immediately converts the
> strings back into the standard Arrow string representation then there is no
> benefit. The change will only move where the conversion happens (from
> inside DuckDB to inside the consumer). As such, this change is only useful
> if the broader Arrow ecosystem moves towards supporting the new string
> format.
>
> From an execution perspective from DuckDB's side - it is unlikely that we
> will switch to using Arrow as an internal format at this stage of the
> project. While this change increases Arrow's utility as an intermediate
> execution format, that is more relevant to projects that are currently
> using Arrow in this manner or are planning to use Arrow in this manner.
>
> I feel the broader question here is what is Arrow's intended use case -
> interchange or execution - as they are opposed in this discussion. This
> change improves Arrow's utility as an execution format at the expense of
> more stability in the interchange format. From my perspective Arrow is more
> useful as an interchange format. When different tools communicate with each
> other a standard is required. An execution format is generally not exposed
> outside of the internals of the execution engine. Engines can do whatever
> they want here - and a standard is perhaps not as useful.
>
> On 2023/10/02 13:21:59 Andrew Lamb wrote:
> > > I don't think "we have to adjust the Arrow format so that existing
> > > internal representations become Arrow-compliant without any
> > > (re-)implementation effort" is a reasonable design principle.
> >
> > I agree with this statement from Antoine -- given the Arrow community has
> > standardized an addition to the format with StringView, I think it would
> > help to get some input from those at DuckDB and Velox on their
> perspective
> >
> > Andrew
> >
> >
> >
> >
> > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Oh I'm with you on it being a precedent we want to be very careful
> about
> > > setting, but if there isn't a meaningful performance difference, we may
> > > be able to sidestep that discussion entirely.
> > >
> > > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > > >
> > > > Even if performance were significant better, I don't think it's a
> good
> > > > enough reason to add these representations to Arrow. By construction,
> > > > a standard cannot continuously chase the performance state of art, it
> > > > has to weigh the benefits of performance improvements against the
> > > > increased cost for the ecosystem (for example the cost of adapting to
> > > > frequent standard changes and a growing standard size).
> > > >
> > > > We have extension types which could reasonably be used for
> > > > non-standard data types, especially the kind that are motivated by
> > > > leading-edge performance research and innovation and come with
> unusual
> > > > constraints (such as requiring trusting and dereferencing raw
> pointers
> > > > embedded in data buffers). There could even be an argument for making
> > > > some of them canonical extension types if there's enough anteriority
> > > > in favor.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > > >> I think what would really help would be 

Re: [Discuss][C++] A framework for contextual/implicit/ambient vars

2023-08-24 Thread Weston Pace
In other languages I have seen this called "async local"[1][2][3].  I'm not
sure of any C++ implementations.  Folly's fibers claim to have fiber-local
variables[4] but I can't find the actual code to use them.  I can't seem to
find reference to the concept in boost's asio or cppcoro.

I've also seen libraries that had higher level contexts.  For example, when
working with HTTP servers there is often a pattern to have "request
context" (e.g. you'll often find things like the IP address of the
requestor, the identity of the calling user, etc.)  I believe these are
typically built on top of thread-local or async-local variables.

I think some kind of async-local concept is a good idea.  We already do
some of this context propagation for open telemetry if I recall correctly.

> - TaskHints (though that doesn't seem to be used currently?)

I'm not aware of any usage of this.  My understanding is that this was
originally intended to provide hints to some kind of scheduler on how to
prioritize a task.  I think this concept can probably be removed.

[1]
https://learn.microsoft.com/en-us/dotnet/api/system.threading.asynclocal-1?view=net-7.0
[2] https://docs.rs/async-local/latest/async_local/
[3] https://nodejs.org/api/async_context.html
[4] https://github.com/facebook/folly/blob/main/folly/fibers/README.md



On Thu, Aug 24, 2023 at 6:22 AM Antoine Pitrou  wrote:

>
> Hello,
>
> Arrow C++ comes with execution facilities (such as thread pools, async
> generators...) meant to unlock higher performance by hiding IO latencies
> and exploiting several CPU cores. These execution facilities also
> obscure the context in which a task is executed: you cannot simply use
> local, global or thread-local variables to store ancillary parameters.
>
> Over the years we have started adding optional metadata that can be
> associated with tasks:
>
> - StopToken
> - TaskHints (though that doesn't seem to be used currently?)
> - some people have started to ask about IO tags:
> https://github.com/apache/arrow/issues/37267
>
> However, any such additional metadata must currently be explicitly
> passed to all tasks that might make use of them.
>
> My questions are thus:
>
> - do we want to continue using the explicit passing style?
> - on the contrary, do we want to switch to a paradigm where those, once
> set, are propagated implicitly along the task dependency flow (e.g. from
> the caller of Executor::Submit to the task submitted)
> - are there useful or insightful precedents in the C++ ecosystem?
>
> (note: a similar facility in Python is brought by "context vars":
> https://docs.python.org/3/library/contextvars.html)
>
> Regards
>
> Antoine.
>


Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread Weston Pace
+1

Thanks to all for the discussion and thanks to Ben for all of the great
work.


On Mon, Aug 21, 2023 at 9:16 AM wish maple  wrote:

> +1 (non-binding)
>
> It would help a lot when processing UTF-8 related data!
>
> Xuwei
>
> Andrew Lamb  于2023年8月22日周二 00:11写道:
>
> > +1
> >
> > This is a great example of collaboration
> >
> > On Sat, Aug 19, 2023 at 4:10 PM Chao Sun  wrote:
> >
> > > +1 (non-binding)!
> > >
> > > On Fri, Aug 18, 2023 at 12:59 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com> wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Fri, 18 Aug 2023 at 18:48 Jacob Wujciak-Jens
> > > >  wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On Fri, Aug 18, 2023 at 6:04 PM L. C. Hsieh 
> > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Fri, Aug 18, 2023 at 5:53 AM Neal Richardson
> > > > > >  wrote:
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Thanks all for the thoughtful discussions here.
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > > On Fri, Aug 18, 2023 at 4:14 AM Raphael Taylor-Davies
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Despite my earlier misgivings, I think this will be a
> valuable
> > > > > addition
> > > > > > > > to the specification.
> > > > > > > >
> > > > > > > > To clarify I've interpreted this as a vote on both Utf8View
> and
> > > > > > > > BinaryView as in the linked PR.
> > > > > > > >
> > > > > > > > On 28/06/2023 20:34, Benjamin Kietzman wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I'd like to propose adding Utf8View arrays to the arrow
> > format.
> > > > > > > > > Previous discussion in [1], columnar format description in
> > [2],
> > > > > > > > > flatbuffers changes in [3].
> > > > > > > > >
> > > > > > > > > There are implementations available in both C++[4] and
> Go[5]
> > > > which
> > > > > > > > > exercise the new type over IPC. Utf8View format
> > demonstrates[6]
> > > > > > > > > significant performance benefits over Utf8 in common tasks.
> > > > > > > > >
> > > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > > >
> > > > > > > > > [ ] +1 add the proposed Utf8View type to the Apache Arrow
> > > format
> > > > > > > > > [ ] -1 do not add the proposed Utf8View type to the Apache
> > > Arrow
> > > > > > format
> > > > > > > > > because...
> > > > > > > > >
> > > > > > > > > Sincerely,
> > > > > > > > > Ben Kietzman
> > > > > > > > >
> > > > > > > > > [1]
> > > > > https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
> > > > > > > > > [4] https://github.com/apache/arrow/pull/35628
> > > > > > > > > [5] https://github.com/apache/arrow/pull/35769
> > > > > > > > > [6]
> > > > > >
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Acero and Substrait: How to select struct field from a struct column?

2023-08-07 Thread Weston Pace
> But I can't figure out how to express "select struct field 0 from field 2
> of the original table where field 2 is a struct column"
>
> Any idea how the substrait message should look like for the above?

I believe it would be:

```
"expression": {
  "selection": {
"direct_reference": {
  "struct_field" {
"field": 2,
"child" {
  "struct_field" {  "field": 0 }
}
  }
}
"root_reference": { }
  }
}
```

To get the above I used the following python (requires [1] which could use
a review and you need some way to convert the binary substrait to json, I
used a script I have lying around):

```
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> schema = pa.schema([pa.field("points", pa.struct([pa.field("x",
pa.float64()), pa.field("y", pa.float64())]))])
>>> expr = pc.field(("points", "x"))
>>> expr.to_substrait(schema)

```

[1] https://github.com/apache/arrow/pull/34834

On Tue, Aug 1, 2023 at 1:45 PM Li Jin  wrote:

> Hi,
>
> I am recently trying to do
> (1) assign a struct type column s
> (2) flatten the struct columns (by assign v1=s[v1], v2=s[v2] and drop the s
> column)
>
> via Substrait and Acero.
>
> However, I ran into the problem where I don't know the proper substrait
> message to encode this (for (2))
>
> Normally, if I select a column from the origin table, it would look like
> this (e.g, select column index 1 from the original table):
>
> selection {
>   direct_reference {
> struct_field {
> 1
> }
>   }
> }
>
> But I can't figure out how to express "select struct field 0 from field 2
> of the original table where field 2 is a struct column"
>
> Any idea how the substrait message should look like for the above?
>


Re: [DISCUSS] Canonical alternative layout proposal

2023-08-02 Thread Weston Pace
> I would welcome a draft PR showcasing the changes necessary in the IPC
> format definition, and in the C Data Interface specification (no need to
> actually implement them for now :-)).

I've proposed something at [1].

> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).

I think this is similar to the proposal with the exception that your
suggestion would require amending existing types that happen to be
alternatives to each other.  I'm not opposed to it but I think it's
compatible and we don't necessarily need all of the complexity just yet
(feel free to correct me if I'm wrong).  I don't think we need to introduce
the concept of "kind".  We already have a concept of "logical type" in the
spec.  I think what you are stating is that a single logical type may have
multiple physical layouts.  I agree.  E.g. variable size list<32>, variable
size list<64>, and REE are the physical layouts that, combined with the
logical type "string", give you "string", "large string", and "ree"

[1] https://github.com/apache/arrow/pull/37000

On Tue, Aug 1, 2023 at 1:51 AM Felipe Oliveira Carvalho 
wrote:

> A major difficulty in making the Arrow array types open for extension [1]
> is that as soon as we define an (a) universal representation* or (b)
> abstract interface, we close the door for vectorization. (a) prevents
> having new vectorization friendly formats and (b) limits the implementation
> of new vectorized operations. This is an instance of the “expression
> problem” [2].
>
> The way Arrow currently “solves” the data abstraction problem is by having
> no data abstraction — every operation takes a type and should provide
> specializations for every type. Sometimes it’s possible to re-use the same
> kernel for different types, but the general approach is that we specialize
> (in the case of C++, we sometimes can specialize by just instantiating a
> template, but that’s still an specialization).
>
> Given these constraints, what could be done?
>
> One sketch of an idea: define sets of types that we can call “kinds”**
> (e.g. “string kind” = {string, string view, large string, ree…},
> “list kind” = {list, large_list, list_view, large_list_view…}).
>
> Then when different implementations have to communicate or interoperate,
> they have to only be up to date on the list of Arrow Kinds and before data
> is moved a conversion step between types within the same kind is performed
> if required to make that communication possible.
>
> Example: a system that has a string_view Array and needs to send that array
> to a system that only understands large_string instances of the string kind
> MUST perform a conversion. This means that as long as all Arrow
> implementations understand one established type on each of the kinds, they
> can communicate.
>
> This imposes a reasonable requirement on new types: when introduced, they
> should come with conversions to the previously specified types on that
> kind.
>
> Any thoughts?
>
> —
> Felipe
> Voltron Data
>
>
> [1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
> [2] https://en.wikipedia.org/wiki/Expression_problem
>
> * “an array is a list of buffers and child arrays” doesn’t qualify as
> “universal representation” because it doesn’t make a commitment on what all
> the buffers and child arrays mean universally
>
> ** if kind is already taken to mean scalar/array, we can use the term
> “sort”
>
> On Mon, 31 Jul 2023 at 04:39 Gang Wu  wrote:
>
> > I am also in favor of the idea of an alternative layout. IIRC, a new
> > alternative
> > layout still goes into a process of standardization though it is the
> choice
> > of
> > each implementation to decide support now or later. I'd like to ask if we
> > can
> > provide the flexibility for implementations or downstream projects to
> > actually
> > implement a new alternative layout by means of a pluggable interface
> before
> > starting the standardization process. This is similar to promoting a
> > popular
> > extension type implemented by many users to a canonical extension type.
> > I know this is more complicated as extension type simply reuses existing
> > layout but alternative layout usually means a brand new one. For example,
> > if two projects speak Arrow and now they want to share a new layout, they
> > can simply implement a pluggable alternative layout before Arrow adopts
> it.
> > This can unblock projects to evolve and help Arrow not to be fragmented.
> >
> > Best,
> > Gang
> >
> > On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Hello,
> > >
> > > I'm trying to reason about the advantages and drawbacks of this
> > > proposal, but it seems to me that it lacks definition.
> > >
> > > I would welcome a draft PR showcasing the changes necessary in the IPC
> > > format definition, and in the C Data Interface specification (no need
> 

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-31 Thread Weston Pace
Thanks.  This is a very helpful reproduction.

I was able to reproduce and diagnose the problem.  There is a bug on our
end and I have filed [1] to address it.  There are a lot more details in
the ticket if you are interested.  In the meantime, the only workaround I
can think of is probably to slow down the data source enough that the queue
doesn't fill up.

[1] https://github.com/apache/arrow/issues/36951


On Sun, Jul 30, 2023 at 8:15 PM Wenbo Hu  wrote:

> Hi,
> The following code should reproduce the problem.
>
> ```
>
> import pyarrow as pa
> import pyarrow.fs, pyarrow.dataset
>
> schema = pa.schema([("id", pa.utf8()), ("bucket", pa.uint8())])
>
>
> def rb_generator(buckets, rows, batches):
> batch = pa.record_batch(
> [[f"id-{i}" for i in range(rows)], [i % buckets for i in
> range(rows)]],
> schema=schema,
> )
>
> for i in range(batches):
> yield batch
> print(f"yielding {i}")
>
>
> if __name__ == "__main__":
> pa.set_io_thread_count(1)
> reader = pa.RecordBatchReader.from_batches(schema,
> rb_generator(64, 32768, 100))
> local_fs = pa.fs.LocalFileSystem()
>
> pa.dataset.write_dataset(
> reader,
> "/tmp/data_f",
> format="feather",
> partitioning=["bucket"],
> filesystem=local_fs,
> existing_data_behavior="overwrite_or_ignore"
> )
>
> ```
>
> Wenbo Hu  于2023年7月30日周日 15:30写道:
> >
> > Hi,
> >Then another question is that "why back pressure not working on the
> > input stream of write_dataset api?". If back pressure happens on the
> > end of the acero stream for some reason (on queue stage or write
> > stage), then the input stream should backpressure as well? It should
> > keep the memory to a stable level so that the input speed would match
> > the output speed.
> >
> > Then, I made some other experiments with various io_thread_count
> > values and bucket_size (partitions/opening files).
> >
> > 1. for bucket_size to 64 and io_thread_count/cpu_count to 1, the cpu
> > is up to 100% after transferring done, but there is a very interesting
> > output.
> > * flow transferring from client to server at the very first few
> > batches are slow, less than 0.01M rows/s, then it speeds up to over 4M
> > rows/s very quickly.
> > * I think at the very beginning stage, the backpressure works
> > fine, until sometime, like the previous experiments, the backpressure
> > makes the stream into a blackhole, then the io thread input stream
> > stuck at some slow speed. (It's still writing, but takes a lot of time
> > on waiting upstream CPU partitioning threads to push batches?)
> > * from iotop, the total disk write is dropping down very slowly
> > after transferring done. But it may change over different experiments
> > with the same configuration. I think the upstream backpressure is not
> > working as expected, which makes the downstream writing keep querying.
> > I think it may reveal something, maybe at some point, the slow writing
> > enlarge the backpressure on the whole process (the write speed is
> > dropping slowly), but the slow reason of writing is the upstream is
> > already slow down.
> >
> > 2. Then I set cpu_count to 64
> > 2.1 io_thread_count to 4.
> > 2.1.1 , for bucket_size to 2/4/6, The system works fine. CPU is less
> > than 100%. Backpressure works fine, memory will not accumulated much
> > before the flow speed becomes stable.
> > 2.1.2  when bucket_size to 8, the bug comes back. After transferring
> > done, CPU is about 350% (only io thread is running?) and write from
> > iotop is about 40M/s, then dropping down very slowly.
> >
> > 2.2. then I set both io_thread to 6,
> > 2.2.1 for bucket_size to 6/8/16, The system works fine. CPU is about
> > 100%. like 2.1.1
> > 2.2.2 for bucket_size to 32, the bug comes back. CPU halts at 550%.
> >
> > 2.3 io_thread_count to 8
> > 2.3.1 for bucket_size to 16, it fails somehow. After transferring
> > done, the memory accumulated over 3G, but write speed is about 60M/s,
> > which makes it possible to wait. CPU is about 600~700%. After the
> > accumulated memory writing, CPU becomes normal.
> > 2.3.2 for bucket_size to 32, it still fails. CPU halts at 800%.
> > transferring is very fast (over 14M rows/s). the backpressure is not
> > working at all.
> >
> >
> > Weston Pace  于2023年7月29日周六 01:08写道:
> > >
> > > > How many io threads are writing concurr

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-28 Thread Weston Pace
 to 1, io_thread_count to 128, CPU goes to 800% much
> slower than record2 (due to slower flow speed?), RES is growing to 30G
> to the end of transferring, average flow speed is 1.62M rows/s. Same
> happens as record 1 after transferring done.
>
> Then I'm trying to limit the flow speed before writing queue got full
> with custom flow control (sleep on reader iteration based on available
> memory) But the sleep time curve is not accurate, sometimes flow slows
> down, but the queue got full anyway.
> Then the interesting thing happens, before the queue is full (memory
> quickly grows up), the CPU is not fully used. When memory grows up
> quickly, CPU goes up as well, to 800%.
> 1. Sometimes, the writing queue can overcome, CPU will goes down after
> the memory accumulated. The writing speed recoved and memory back to
> normal.
> 2. Sometimes, it can't. IOBPS goes down sharply, and CPU never goes
> down after that.
>
> How many io threads are writing concurrently in a single write_dataset
> call? How do they schedule? The throttle code seems only one task got
> running?
> What else can I do to inspect the problem?
>
> Weston Pace  于2023年7月28日周五 00:33写道:
> >
> > You'll need to measure more but generally the bottleneck for writes is
> > usually going to be the disk itself.  Unfortunately, standard OS buffered
> > I/O has some pretty negative behaviors in this case.  First I'll describe
> > what I generally see happen (the last time I profiled this was a while
> back
> > but I don't think anything substantial has changed).
> >
> > * Initially, writes are very fast.  The OS `write` call is simply a
> memcpy
> > from user space into kernel space.  The actual flushing the data from
> > kernel space to disk happens asynchronously unless you are using direct
> I/O
> > (which is not currently supported).
> > * Over time, assuming the data arrival rate is faster than the data write
> > rate, the data will accumulate in kernel memory.  For example, if you
> > continuously run the Linux `free` program you will see the `free` column
> > decrease and the `buff/cache` column decreases.  The `available` column
> > should generally stay consistent (kernel memory that is in use but can
> > technically be flushed to disk if needed is still considered "available"
> > but not "free")
> > * Once the `free` column reaches 0 then a few things happen.  First, the
> > calls to `write` are no longer fast (the write cannot complete until some
> > existing data has been flushed to disk).  Second, other processes that
> > aren't in use might start swapping their data to disk (you will generally
> > see the entire system, if it is interactive, grind to a halt).  Third, if
> > you have an OOM-killer active, it may start to kill running applications.
> > It isn't supposed to do so but there are sometimes bugs[1].
> > * Assuming the OOM killer does not kill your application then, because
> the
> > `write` calls slow down, the number of rows in the dataset writer's queue
> > will start to fill up (this is captured by the variable
> > `rows_in_flight_throttle`.
> > * Once the rows_in_flight_throttle is full it will pause and the dataset
> > writer will return an unfinished future (asking the caller to back off).
> > * Once this happens the caller will apply backpressure (if being used in
> > Acero) which will pause the reader.  This backpressure is not instant and
> > generally each running CPU thread still delivers whatever batch it is
> > working on.  These batches essentially get added to an asynchronous
> > condition variable waiting on the dataset writer queue to free up.  This
> is
> > the spot where the ThrottledAsyncTaskScheduler is used.
> >
> > The stack dump that you reported is not exactly what I would have
> expected
> > but it might still match the above description.  At this point I am just
> > sort of guessing.  When the dataset writer frees up enough to receive
> > another batch it will do what is effectively a "notify all" and all of
> the
> > compute threads are waking up and trying to add their batch to the
> dataset
> > writer.  One of these gets through, gets added to the dataset writer, and
> > then backpressure is applied again and all the requests pile up once
> > again.  It's possible that a "resume sending" signal is sent and this
> might
> > actually lead to RAM filling up more.  We could probably mitigate this by
> > adding a low water mark to the dataset writer's backpressure throttle (so
> > it doesn't send the resume signal as soon as the queue has room but waits
> > until the queu

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-27 Thread Weston Pace
You'll need to measure more but generally the bottleneck for writes is
usually going to be the disk itself.  Unfortunately, standard OS buffered
I/O has some pretty negative behaviors in this case.  First I'll describe
what I generally see happen (the last time I profiled this was a while back
but I don't think anything substantial has changed).

* Initially, writes are very fast.  The OS `write` call is simply a memcpy
from user space into kernel space.  The actual flushing the data from
kernel space to disk happens asynchronously unless you are using direct I/O
(which is not currently supported).
* Over time, assuming the data arrival rate is faster than the data write
rate, the data will accumulate in kernel memory.  For example, if you
continuously run the Linux `free` program you will see the `free` column
decrease and the `buff/cache` column decreases.  The `available` column
should generally stay consistent (kernel memory that is in use but can
technically be flushed to disk if needed is still considered "available"
but not "free")
* Once the `free` column reaches 0 then a few things happen.  First, the
calls to `write` are no longer fast (the write cannot complete until some
existing data has been flushed to disk).  Second, other processes that
aren't in use might start swapping their data to disk (you will generally
see the entire system, if it is interactive, grind to a halt).  Third, if
you have an OOM-killer active, it may start to kill running applications.
It isn't supposed to do so but there are sometimes bugs[1].
* Assuming the OOM killer does not kill your application then, because the
`write` calls slow down, the number of rows in the dataset writer's queue
will start to fill up (this is captured by the variable
`rows_in_flight_throttle`.
* Once the rows_in_flight_throttle is full it will pause and the dataset
writer will return an unfinished future (asking the caller to back off).
* Once this happens the caller will apply backpressure (if being used in
Acero) which will pause the reader.  This backpressure is not instant and
generally each running CPU thread still delivers whatever batch it is
working on.  These batches essentially get added to an asynchronous
condition variable waiting on the dataset writer queue to free up.  This is
the spot where the ThrottledAsyncTaskScheduler is used.

The stack dump that you reported is not exactly what I would have expected
but it might still match the above description.  At this point I am just
sort of guessing.  When the dataset writer frees up enough to receive
another batch it will do what is effectively a "notify all" and all of the
compute threads are waking up and trying to add their batch to the dataset
writer.  One of these gets through, gets added to the dataset writer, and
then backpressure is applied again and all the requests pile up once
again.  It's possible that a "resume sending" signal is sent and this might
actually lead to RAM filling up more.  We could probably mitigate this by
adding a low water mark to the dataset writer's backpressure throttle (so
it doesn't send the resume signal as soon as the queue has room but waits
until the queue is half full).

I'd recommend watching the output of `free` (or monitoring memory in some
other way) and verifying the above.  I'd also suggest lowering the number
of CPU threads and see how that affects performance.  If you lower the CPU
threads enough then you should eventually get it to a point where your
supply of data is slower then your writer and I wouldn't expect memory to
accumulate.  These things are solutions but might give us more clues into
what is happening.

[1]
https://unix.stackexchange.com/questions/300106/why-is-the-oom-killer-killing-processes-when-swap-is-hardly-used

On Thu, Jul 27, 2023 at 4:53 AM Wenbo Hu  wrote:

> Hi,
> I'm using flight to receive streams from client and write to the
> storage with python `pa.dataset.write_dataset` API. The whole data is
> 1 Billion rows, over 40GB with one partition column ranges from 0~63.
> The container runs at 8-cores CPU and 4GB ram resources.
> It can be done about 160s (6M rows/s, each record batch is about
> 32K rows) for completing transferring and writing almost
> synchronously, after setting 128 for io_thread_count.
>  Then I'd like to find out the bottleneck of the system, CPU or
> RAM or storage?
> 1. I extend the ram into 32GB, then the server consumes more ram,
> the writing progress works at the beginning, then suddenly slow down
> and the data accumulated into ram until OOM.
> 2. Then I set the ram to 64GB, so that the server will not killed
> by OOM. Same happens, also, after all the data is transferred (in
> memory), the server consumes all CPU shares (800%), but still very
> slow on writing (not totally stopped, but about 100MB/minute).
> 2.1 I'm wondering if the io thread is stuck, or the computation
> task is stuck. After setting both io_thread_count and cpu_count to 32,
> I wrapped the 

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-26 Thread Weston Pace
Also, if you haven't seen it yet, the 13.0.0 release adds considerably more
documentation around Acero, including the scheduler:

https://arrow.apache.org/docs/dev/cpp/acero/developer_guide.html#scheduling-and-parallelism

On Wed, Jul 26, 2023 at 10:13 AM Li Jin  wrote:

> Thanks Weston! Very helpful explanation.
>
> On Tue, Jul 25, 2023 at 6:41 PM Weston Pace  wrote:
>
> > 1) As a rule of thumb I would probably prefer `async_scheduler`.  It's
> more
> > feature rich and simpler to use and is meant to handle "long running"
> tasks
> > (e.g. 10s-100s of ms or more).
> >
> > The scheduler is a bit more complex and is intended for very fine-grained
> > scheduling.  It's currently only used in a few nodes, I think the
> hash-join
> > and the hash-group-by for things like building the hash table (after the
> > build data has been accumulated).
> >
> > 2) Neither scheduler manages threads.  Both of them rely on the executor
> in
> > ExecContext::executor().  The scheduler takes a "schedule task callback"
> > which it expects to do the actual executor submission.  The async
> scheduler
> > uses futures and virtual classes.  A "task" is something that can be
> called
> > which returns a future that will be completed when the task is complete.
> > Most of the time this is done by submitting something to an executor (in
> > return for a future).  Sometimes this is done indirectly, for example, by
> > making an async I/O call (which under the hood is usually implemented by
> > submitting something to the I/O executor).
> >
> > On Tue, Jul 25, 2023 at 2:56 PM Li Jin  wrote:
> >
> > > Hi,
> > >
> > > I am reading Acero and got confused about the use of
> > > QueryContext::scheduler() and QueryContext::async_scheduler(). So I
> have
> > a
> > > couple of questions:
> > >
> > > (1) What are the different purposes of these two?
> > > (2) Does scheduler/aysnc_scheduler own any threads inside their
> > respective
> > > classes or do they use the thread pool from ExecContext::executor()?
> > >
> > > Thanks,
> > > Li
> > >
> >
>


Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace
> Replacing ... with ... works as expected

This is, I think, because the RecordBatchSourceNode defaults to implicit
ordering (note the RecordBatchSourceNode is a SchemaSourceNode):

```
struct SchemaSourceNode : public SourceNode {
  SchemaSourceNode(ExecPlan* plan, std::shared_ptr schema,
   arrow::AsyncGenerator>
generator)
  : SourceNode(plan, schema, generator, Ordering::Implicit()) {}
```

It seems a little inconsistent that the RecordBatchSourceNode defaults to
implicit and the RecordBatchReader source does not.  It would be nice to
fix those (as I described above it is probably ok to assume an implicit
ordering in many cases).

On Wed, Jul 26, 2023 at 8:18 AM Weston Pace  wrote:

> > I think the key problem is that the input stream is unordered. The
> > input stream is a ArrowArrayStream imported from python side, and then
> > declared to a "record_batch_reader_source", which is a unordered
> > source node. So the behavior is expected.
> >   I think the RecordBatchReaderSourceOptions should add an ordering
> > parameter to indicate the input stream ordering. Otherwise, I need to
> > convert the record_batch_reader_source into a "record_batch_source"
> > with a record_batch generator.
>
> I agree.  Also, keep in mind that there is the "implicit order" which
> means "this source node has some kind of deterministic ordering even if it
> isn't reflected in any single column".  In other words, if you have a CSV
> file for example it will always be implicitly ordered (by line number) even
> if the line number isn't a column.  This should allow us to do things like
> "grab the first 10 rows" and get behavior the user expects even if their
> data isn't explicitly ordered.  In most cases we can assume that data has
> some kind of batch order.  The only time it does not is if the source
> itself is non-deterministic.  For example, maybe the source is some kind of
> unordered scan from an external SQL source.
>
> > Also, I'd like to have a discuss on dataset scanner, is it produce a
> > stable sequence of record batches (as an implicit ordering) when the
> > underlying storage is not changed?
>
> Yes, both the old and new scan node are capable of doing this.  The
> implicit order is given by the order of the fragments in the dataset (which
> we assume will always be consistent and, in the case of a
> FileSystemDataset, it is).  In the old scan node you need to set the
> property `require_sequenced_output` on the ScanNodeOptions to true (I
> believe the new scan node will always sequence output but this property may
> eventually exist there too).
>
> > For my situation, the downstream
> > executor may crush, then it would request to continue from a
> > intermediate state (with a restart offset). I'd like to make it into a
> > fetch node to skip heading rows, but it seems not an optimized way.
>
> Regrettably the old scan node does not have skip implemented.  It is a
> little tricky since we do not have a catalog and thus do not know how many
> rows every single file has.  So we have to calculate the skip at runtime.
> I am planning to support this in the new scan node.
>
> > Maybe I should inspect fragments in the dataset, to skip reading
> > unnecessary files, and build a FlieSystemDataset on the fly?
>
> Yes, this should work today.
>
>
> On Tue, Jul 25, 2023 at 10:37 PM Wenbo Hu  wrote:
>
>> Replacing
>> ```
>> ac::Declaration source{"record_batch_reader_source",
>> ac::RecordBatchReaderSourceNodeOptions{std::move(input)}};
>> ```
>> with
>> ```
>> ac::RecordBatchSourceNodeOptions rb_source_options{
>> input->schema(), [input]() { return
>> arrow::MakeFunctionIterator([input] { return input->Next(); }); }};
>> ac::Declaration source{"record_batch_source",
>> std::move(rb_source_options)};
>> ```
>> Works as expected.
>>
>> Wenbo Hu  于2023年7月26日周三 10:22写道:
>> >
>> > Hi,
>> >   I'll open a issue on the DeclareToReader problem.
>> >   I think the key problem is that the input stream is unordered. The
>> > input stream is a ArrowArrayStream imported from python side, and then
>> > declared to a "record_batch_reader_source", which is a unordered
>> > source node. So the behavior is expected.
>> >   I think the RecordBatchReaderSourceOptions should add an ordering
>> > parameter to indicate the input stream ordering. Otherwise, I need to
>> > convert the record_batch_reader_source into a "record_batch_source"
>> > with a record_batch generator.
>> >   Also, I'd like to have a di

Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace
> I think the key problem is that the input stream is unordered. The
> input stream is a ArrowArrayStream imported from python side, and then
> declared to a "record_batch_reader_source", which is a unordered
> source node. So the behavior is expected.
>   I think the RecordBatchReaderSourceOptions should add an ordering
> parameter to indicate the input stream ordering. Otherwise, I need to
> convert the record_batch_reader_source into a "record_batch_source"
> with a record_batch generator.

I agree.  Also, keep in mind that there is the "implicit order" which means
"this source node has some kind of deterministic ordering even if it isn't
reflected in any single column".  In other words, if you have a CSV file
for example it will always be implicitly ordered (by line number) even if
the line number isn't a column.  This should allow us to do things like
"grab the first 10 rows" and get behavior the user expects even if their
data isn't explicitly ordered.  In most cases we can assume that data has
some kind of batch order.  The only time it does not is if the source
itself is non-deterministic.  For example, maybe the source is some kind of
unordered scan from an external SQL source.

> Also, I'd like to have a discuss on dataset scanner, is it produce a
> stable sequence of record batches (as an implicit ordering) when the
> underlying storage is not changed?

Yes, both the old and new scan node are capable of doing this.  The
implicit order is given by the order of the fragments in the dataset (which
we assume will always be consistent and, in the case of a
FileSystemDataset, it is).  In the old scan node you need to set the
property `require_sequenced_output` on the ScanNodeOptions to true (I
believe the new scan node will always sequence output but this property may
eventually exist there too).

> For my situation, the downstream
> executor may crush, then it would request to continue from a
> intermediate state (with a restart offset). I'd like to make it into a
> fetch node to skip heading rows, but it seems not an optimized way.

Regrettably the old scan node does not have skip implemented.  It is a
little tricky since we do not have a catalog and thus do not know how many
rows every single file has.  So we have to calculate the skip at runtime.
I am planning to support this in the new scan node.

> Maybe I should inspect fragments in the dataset, to skip reading
> unnecessary files, and build a FlieSystemDataset on the fly?

Yes, this should work today.


On Tue, Jul 25, 2023 at 10:37 PM Wenbo Hu  wrote:

> Replacing
> ```
> ac::Declaration source{"record_batch_reader_source",
> ac::RecordBatchReaderSourceNodeOptions{std::move(input)}};
> ```
> with
> ```
> ac::RecordBatchSourceNodeOptions rb_source_options{
> input->schema(), [input]() { return
> arrow::MakeFunctionIterator([input] { return input->Next(); }); }};
> ac::Declaration source{"record_batch_source",
> std::move(rb_source_options)};
> ```
> Works as expected.
>
> Wenbo Hu  于2023年7月26日周三 10:22写道:
> >
> > Hi,
> >   I'll open a issue on the DeclareToReader problem.
> >   I think the key problem is that the input stream is unordered. The
> > input stream is a ArrowArrayStream imported from python side, and then
> > declared to a "record_batch_reader_source", which is a unordered
> > source node. So the behavior is expected.
> >   I think the RecordBatchReaderSourceOptions should add an ordering
> > parameter to indicate the input stream ordering. Otherwise, I need to
> > convert the record_batch_reader_source into a "record_batch_source"
> > with a record_batch generator.
> >   Also, I'd like to have a discuss on dataset scanner, is it produce a
> > stable sequence of record batches (as an implicit ordering) when the
> > underlying storage is not changed? For my situation, the downstream
> > executor may crush, then it would request to continue from a
> > intermediate state (with a restart offset). I'd like to make it into a
> > fetch node to skip heading rows, but it seems not an optimized way.
> > Maybe I should inspect fragments in the dataset, to skip reading
> > unnecessary files, and build a FlieSystemDataset on the fly?
> >
> > Weston Pace  于2023年7月25日周二 23:44写道:
> > >
> > > > Reading the source code of exec_plan.cc, DeclarationToReader called
> > > > DeclarationToRecordBatchGenerator, which ignores the sequence_output
> > > > parameter in SinkNodeOptions, also, it calls validate which should
> > > > fail if the SinkNodeOptions honors the sequence_output. Then it seems
> > > > that DeclarationToReader cannot follow the input batch order?
> > 

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Weston Pace
1) As a rule of thumb I would probably prefer `async_scheduler`.  It's more
feature rich and simpler to use and is meant to handle "long running" tasks
(e.g. 10s-100s of ms or more).

The scheduler is a bit more complex and is intended for very fine-grained
scheduling.  It's currently only used in a few nodes, I think the hash-join
and the hash-group-by for things like building the hash table (after the
build data has been accumulated).

2) Neither scheduler manages threads.  Both of them rely on the executor in
ExecContext::executor().  The scheduler takes a "schedule task callback"
which it expects to do the actual executor submission.  The async scheduler
uses futures and virtual classes.  A "task" is something that can be called
which returns a future that will be completed when the task is complete.
Most of the time this is done by submitting something to an executor (in
return for a future).  Sometimes this is done indirectly, for example, by
making an async I/O call (which under the hood is usually implemented by
submitting something to the I/O executor).

On Tue, Jul 25, 2023 at 2:56 PM Li Jin  wrote:

> Hi,
>
> I am reading Acero and got confused about the use of
> QueryContext::scheduler() and QueryContext::async_scheduler(). So I have a
> couple of questions:
>
> (1) What are the different purposes of these two?
> (2) Does scheduler/aysnc_scheduler own any threads inside their respective
> classes or do they use the thread pool from ExecContext::executor()?
>
> Thanks,
> Li
>


Re: how to make acero output order by batch index

2023-07-25 Thread Weston Pace
> Reading the source code of exec_plan.cc, DeclarationToReader called
> DeclarationToRecordBatchGenerator, which ignores the sequence_output
> parameter in SinkNodeOptions, also, it calls validate which should
> fail if the SinkNodeOptions honors the sequence_output. Then it seems
> that DeclarationToReader cannot follow the input batch order?

These methods should not be ignoring sequence_output.  Do you want to open
a bug?  This should be a straightforward one to fix.

> Then how the substrait works in this scenario? Does it output
> disorderly as well?

Probably.  Much of internal Substrait testing is probably using
DeclarationToTable or DeclarationToBatches.  The ordered execution hasn't
been adopted widely yet because the old scanner doesn't set the batch index
and the new scanner isn't ready yet.  This limits the usefulness to data
that is already in memory (the in-memory sources do set the batch index).

I think your understanding of the concept is correct however.  Can you
share a sample plan that is not working for you?  If you use
DeclarationToTable do you get consistently ordered results?

On Tue, Jul 25, 2023 at 7:06 AM Wenbo Hu  wrote:

> Reading the source code of exec_plan.cc, DeclarationToReader called
> DeclarationToRecordBatchGenerator, which ignores the sequence_output
> parameter in SinkNodeOptions, also, it calls validate which should
> fail if the SinkNodeOptions honors the sequence_output. Then it seems
> that DeclarationToReader cannot follow the input batch order?
> Then how the substrait works in this scenario? Does it output
> disorderly as well?
>
> Wenbo Hu  于2023年7月25日周二 19:12写道:
> >
> > Hi,
> > I'm trying to zip two streams with same order but different
> processes.
> > For example, the original stream comes with two column 'id' and
> > 'age', and splits into two stream processed distributedly using acero:
> > 1. hash the 'id' into a stream with single column 'bucket_id' and 2.
> > classify 'age' into ['child', 'teenage', 'adult',...]. And then zip
> > into a single stream.
> >
> >[  'id'  |  'age'  | many other columns]
> > |  ||
> >['bucket_id']   ['classify']|
> >  |  |   |
> >   [zipped_stream | many_other_columns]
> > I was expecting both bucket_id and classify can keep the same order as
> > the orginal stream before they are zipped.
> > According to document, "ordered execution" is using batch_index to
> > indicate the order of batches.
> > but acero::DeclarationToReader with a QueryOptions that sequce_output
> > is set to true does not mean that it keeps the order if the input
> > stream is not ordered. But it doesn't fail during the execution
> > (bucket_id and classify are not specify any ordering). Then How can I
> > make the acero produce a stream that keep the order as the original
> > input?
> > --
> > -
> > Best Regards,
> > Wenbo Hu,
>
>
>
> --
> -
> Best Regards,
> Wenbo Hu,
>


Re: hashing Arrow structures

2023-07-24 Thread Weston Pace
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?

It's not user-configurable.  The hash-join and hash-group-by always use the
32-bit variant.  The asof-join always uses the 64-bit variant.  I wouldn't
stress too much about the hash-join.  It is a very memory intensive
operation and my guess is that by the time you have enough keys to worry
about hash uniqueness you should probably be doing an out-of-core join
anyways.  The hash-join implementation is also fairly tolerant to duplicate
keys anyways.  I believe our hash-join performance is unlikely to be the
bottleneck in most cases.

It might make more sense to use the 64-bit variant for the group-by, as we
are normally only storing the hash-to-group-id table itself in those
cases.  Solid benchmarking would probably be needed regardless.

On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou  wrote:

>
> Hi,
>
> Le 21/07/2023 à 15:58, Yaron Gvili a écrit :
> > A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> Ouch... so if N/A does have the same hash value as a common non-null
> value (0), this should be fixed.
>
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?
>
> I don't think 32-bit hashing is a good idea when operating on large
> data. Unless the hash function is exceptionally good, you may get lots
> of hash collisions. It's nice to have a SIMD-accelerated hash table, but
> less so if access times degenerate to O(n)...
>
> So IMHO we should only have one hashing variant with a 64-bit output.
> And make sure it doesn't have trivial collisions on common data patterns
> (such as nulls and zeros, or clustered integer ranges).
>
> > A second approach I found is by serializing the Arrow structures
> (possibly by streaming) and hashing using functions in `util/hashing.h`. I
> didn't yet look into what properties these hash functions have except for
> the documented high performance. In particular, I don't know whether they
> have unfortunate hash collisions and, more generally, what is the
> probability of hash collision. I also don't know whether they are designed
> for efficient use in the context of joining.
>
> Those hash functions shouldn't have unfortunate hash, but they were not
> exercised on real-world data at the time. I have no idea whether they
> are efficient in the context of joining, as they have been written much
> earlier than our joining implementation.
>
> Regards
>
> Antoine.
>


Re: hashing Arrow structures

2023-07-21 Thread Weston Pace
Yes, those are the two main approaches to hashing in the code base that I
am aware of as well.  I haven't seen any real concrete comparison and
benchmarks between the two.  If collisions between NA and 0 are a problem
it would probably be ok to tweak the hash value of NA to something unique.
I suspect these collisions aren't an inevitable fact of the design but more
just something that has not been considered yet.

There is a third way currently in use as well by
arrow::compute::GrouperImpl.  In this class the key values from each row
are converted into a single "row format" which is stored in a std::string.
A std::unordered_map is then used for the hashing.  The GrouperFastImpl
class was created to (presumably) be faster than this.  It uses the
Hashing32 routines and stores the groups in an arrow::compute::SwissTable.
However, I think there was some benchmarking done that showed the
GrouperFastIimpl to be faster than the GrouperImpl.

On Fri, Jul 21, 2023 at 6:59 AM Yaron Gvili  wrote:

> Hi,
>
> What are the recommended ways to hash Arrow structures? What are the pros
> and cons of each approach?
>
> Looking a bit through the code, I've so far found two different hashing
> approaches, which I describe below. Are there any others?
>
> A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> A second approach I found is by serializing the Arrow structures (possibly
> by streaming) and hashing using functions in `util/hashing.h`. I didn't yet
> look into what properties these hash functions have except for the
> documented high performance. In particular, I don't know whether they have
> unfortunate hash collisions and, more generally, what is the probability of
> hash collision. I also don't know whether they are designed for efficient
> use in the context of joining.
>
>
> Cheers,
> Yaron.
>


Re: Need help on ArrayaSpan and writing C++ udf

2023-07-17 Thread Weston Pace
> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.

I agree with Jin.  You should probably be incrementing `out` by 32 each
time `VisitValue` is called.

On Mon, Jul 17, 2023 at 6:38 AM Aldrin  wrote:

> Oh wait, I see now that you're incrementing with a uint8_t*. That could be
> fine for your own use, but you might want to make sure it aligns with the
> type of your output (Int64Array vs Int32Array).
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 06:20, Aldrin  > wrote:
>
> Hi Wenbo,
>
> An ArraySpan is like an ArrayData but does not own the data, so the
> ColumnarFormat doc that Jon shared is relevant for both.
>
> In the case of a binary format, the output ArraySpan must have at least 2
> buffers: the offsets and the contiguous binary data (values). If the output
> of your UDF is something like an Int32Array with no nulls, then I think
> you're writing output correctly.
>
> But, since your pointer is a uint8_t, I think Jin is right and `++` is
> going to move your pointer 1 byte instead of 32 bytes like you intend.
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 05:06, Wenbo Hu  > wrote:
>
> Hi Jin,
>
> > but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards?
> I'm implementing the sha256 function as a scalar function, but it
> always inputs with an array, so on visitor pattern, I'll write a 32
> byte hash into the pointer and move to the next for next visit.
> Something like:
> ```
>
> struct BinarySha256Visitor {
> BinarySha256Visitor(uint8_t **out) {
> this->out = out;
> }
> arrow::Status VisitNull() {
> return arrow::Status::OK();
> }
>
> arrow::Status VisitValue(std::string_view v) {
>
> uint8_t hash[32];
> sha256(v, hash);
>
> memcpy(*out++, hash, 32);
>
> return arrow::Status::OK();
> }
>
> uint8_t ** out;
> };
>
> arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
> , cp::ExecResult *out) {
> arrow::ArraySpanVisitor visitor;
>
> auto *out_values = out->array_span_mutable()->GetValues(1);
> BinarySha256Visitor visit(out_values);
> ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, ));
>
> return arrow::Status::OK();
> }
> ```
> Is it as expected?
>
> Jin Shang  于2023年7月17日周一 19:44写道:
> >
> > Hi Wenbo,
> >
> > I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > > `1` means when `GetValues` called?
> >
> > The meaning of buffers in an ArraySpan depends on the layout of its data
> > type. FixedSizeBinary is a fixed-size primitive type, so it has two
> > buffers, one validity buffer and one data buffer. So GetValues(1) would
> > return a pointer to the data buffer.
> > Layouts of data types can be found here[1].
> >
> > what is the actual type should I get from `GetValues`?
> > >
> > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> > any type to suit your need. The template parameter for GetValue is simply
> > forwarded to reinterpret_cast. There are discussions[2] on the soundness
> of
> > using uint8_t to represent bytes but it is what we use now. Since you are
> > only doing a memcpy, uint8_t should be good.
> >
> > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > I may be missing something, but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards? Otherwise I agree this
> is
> > the way to go.
> >
> > [1]
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> > [2] https://github.com/apache/arrow/issues/36123
> >
> >
> > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu  wrote:
> >
> > > Hi,
> > > I'm using Acero as the stream executor to run large scale data
> > > transformation. The core data used in UDF is `ArraySpan` in
> > > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > > `GetValues` called?
> > > For input data, I can use a `ArraySpanVisitor` to iterator over
> > > different input types. But for output data, I don't know how to write
> > > to the`array_span_mutable()` if it is not a simple c_type.
> > > For example, I'm implementing a sha256 udf, which input is
> > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > > how can I directly write to the out buffers and what is the actual
> > > type should I get from `GetValues`?
> > > Maybe, `auto *out_values =
> > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > > `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > > --
> > > -
> > > Best Regards,
> > > Wenbo Hu,
> > >
>
>
>
> --
> -
> Best Regards,
> Wenbo Hu,
>
>


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-11 Thread Weston Pace
be as interoperable as those
> implemented using Arrow. To what extent an engine favours their own
> format(s) over Arrow will be an engineering trade-off they will have to
> make, but DataFusion has found exclusively using Arrow as the
> interchange format between operators to work well.
>
> > There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> I agree 100% that this sort of interoperability is what makes Arrow so
> compelling and something we should work very hard to preserve. This is
> the crux of my concern with standardising alternative layouts. I
> definitely hope that with time Arrow will penetrate deeper into these
> engines, perhaps in a similar manner to DataFusion, as opposed to
> primarily existing at the surface-level.
>
> [1]: https://github.com/apache/arrow-datafusion/pull/6800
>
> On 10/07/2023 11:38, Weston Pace wrote:
> >> The point I was trying to make, albeit very badly, was that these
> >> operations are typically implemented using some sort of row format [1]
> >> [2], and therefore their performance is not impacted by the array
> >> representations. I think it is both inevitable, and in fact something to
> >> be encouraged, that query engines will implement their own in-memory
> >> layouts and data structures outside of the arrow specification for
> >> specific operators, workloads, hardware, etc... This allows them to make
> >> trade-offs based on their specific application domain, whilst also
> >> ensuring that new ideas and approaches can continue to be incorporated
> >> and adopted in the broader ecosystem. However, to then seek to
> >> standardise these layouts seems to be both potentially unbounded scope
> >> creep, and also somewhat counter productive if the goal of
> >> standardisation is improved interoperability?
> > FWIW, I believe this formats are very friendly for row representation as
> > well, especially when stored as a payload (e.g. in a join).
> >
> > For your more general point though I will ask the same question I asked
> on
> > the ArrayView discussion:
> >
> > Is Arrow meant to only be used in between systems (in this case query
> > engines) or is it also meant to be used in between components of a query
> > engine?
> >
> > For example, if someone (datafusion, velox, etc.) were to come up with a
> > framework for UDFs then would batches be passed in and out of those UDFs
> in
> > the Arrow format?  If every engine has its own bespoke formats internally
> > then it seems we are placing a limit on how far things can be decomposed.
> >  From the C++ perspective, I would personally like to see Arrow be usable
> > within components.  There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> >
> > On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
> >  wrote:
> >
> >>> Thus the approach you
> >>> describe for validating an entire character buffer as UTF-8 then
> checking
> >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >> The difference here is that it is perhaps expected for Utf8View to have
> >> gaps in the underlying data that are not referenced as part of any
> >> value, as I had understood this to be one of its benefits over the
> >> current encoding. I think it would therefore be problematic to enforce
> >> these gaps be UTF-8.
> >>
> >>> Furthermore unlike an explicit
> >>> selection vector a kernel may decide to copy and densify dynamically if
> >> it
> >>> detects that output is getting sparse or fragmented
> >> I don't see why you couldn't do something similar to materialize a
> >> sparse selection vector, if anything being able to centralise this logic
> >> outside specific kernels would be advantageous.
> >>
> >>> Specifically sorting and equality comparison
> >>> benefit significantly from the prefix comparison fast path,
> >>> so I'd anticipate that multi column sorting and aggregations would as
> >> well
> >>
> >>

Re: Confusion on substrait AggregateRel::groupings and Arrow consumer

2023-07-10 Thread Weston Pace
Yes, that is correct.

What Substrait calls "groupings" is what is often referred to in SQL as
"grouping sets".  These allow you to compute the same aggregates but group
by different criteria.  Two very common ways of creating grouping sets are
"group by cube" and "group by rollup".  Snowflake's documentation for
rollup[1] describes the motivation quite well:

> You can think of rollup as generating multiple result sets, each
> of which (after the first) is the aggregate of the previous result
> set. So, for example, if you own a chain of retail stores, you
> might want to see the profit for:
>  * Each store.
>  * Each city (large cities might have multiple stores).
>  * Each state.
>  * Everything (all stores in all states).

Acero does not currently handle more than one grouping set.


[1] https://docs.snowflake.com/en/sql-reference/constructs/group-by-rollup

On Mon, Jul 10, 2023 at 2:22 PM Li Jin  wrote:

> Hi,
>
> I am looking at the substrait protobuf for AggregateRel as well the Acero
> substrait consumer code:
>
>
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/engine/substrait/relation_internal.cc#L851
>
> https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L209
>
> Looks like in subtrait, AggregateRel can have multiple groupings and each
> grouping can have multiple expressions. Let's say now I want to "compute
> sum and mean on column A group by column B, C, D" (for Acero to execute).
> Is the right way to create one grouping with 3 expressions (direct
> reference) for "column B, C, D"?
>
> Thanks,
> Li
>


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Weston Pace
ietzman
> >
> > On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
> >   wrote:
> >
> >>> I would be interested in hearing some input from the Rust community.
> >>   A couple of thoughts:
> >>
> >> The variable number of buffers would definitely pose some challenges for
> >> the Rust implementation, the closest thing we currently have is possibly
> >> UnionArray, but even then the number of buffers is still determined
> >> statically by the DataType. I therefore also wonder about the
> possibility
> >> of always having a single backing buffer that stores the character data,
> >> including potentially a copy of the prefix. This would also avoid
> forcing a
> >> branch on access, which I would have expected to hurt performance for
> some
> >> kernels quite significantly.
> >>
> >> Whilst not really a concern for Rust, which supports unsigned types, it
> >> does seem inconsistent to use unsigned types where the rest of the
> format
> >> encourages the use of signed offsets, etc...
> >>
> >> It isn't clearly specified whether a null should have a valid set of
> >> offsets, etc... I think it is an important property of the current array
> >> layouts that, with exception to dictionaries, the data in null slots is
> >> arbitrary, i.e. can take any value, but not undefined. This allows for
> >> separate handling of the null mask and values, which can be important
> for
> >> some kernels and APIs.
> >>
> >> More an observation than an issue, but UTF-8 validation for StringArray
> >> can be done very efficiently by first verifying the entire buffer, and
> then
> >> verifying the offsets correspond to the start of a UTF-8 codepoint. This
> >> same approach would not be possible for StringView, which would need to
> >> verify individual values and would therefore be significantly more
> >> expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
> >> validation is perhaps more important for Rust than for other languages.
> >>
> >> I presume that StringView will behave similarly to dictionaries in that
> >> the selection kernels will not recompute the underlying value buffers. I
> >> think this is fine, but it is perhaps worth noting this has caused
> >> confusion in the past, as people somewhat reasonably expect an array
> >> post-selection to have memory usage reflecting the smaller selection.
> This
> >> is then especially noticeable if the data is written out to IPC, and
> still
> >> contains data that was supposedly filtered out. My 2 cents is that
> explicit
> >> selection vectors are a less surprising way to defer selection than
> baking
> >> it into the array, but I also don't have any workloads where this is the
> >> major bottleneck so can't speak authoritatively here.
> >>
> >> Which leads on to my major concern with this proposal, that it adds
> >> complexity and cognitive load to the specification and implementations,
> >> whilst not meaningfully improving the performance of the operators that
> I
> >> commonly encounter as performance bottlenecks, which are multi-column
> sorts
> >> and aggregations, or the expensive string operations such as matching or
> >> parsing. If we didn't already have a string representation I would be
> more
> >> onboard, but as it stands I'm definitely on the fence, especially given
> >> selection performance can be improved in less intrusive ways using
> >> dictionaries or selection vectors.
> >>
> >> Kind Regards,
> >>
> >> Raphael Taylor-Davies
> >>
> >> On 02/07/2023 11:46, Andrew Lamb wrote:
> >>
> >>   * This is the first layout where the number of buffers depends on the
> >>
> >> data
> >>
> >> and not the schema. I think this is the most architecturally significant
> >> fact. I
> >>
> >>   I have spent some time reading the initial proposal -- thank you for
> >> that. I now understand what Weston was saying about the "variable
> numbers
> >> of buffers". I wonder if you considered restricting such arrays to a
> single
> >> buffer (so as to make them more similar to other arrow array types that
> >> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston
> >> Pace  <mailto:weston.p...@gmail.com>  wrote:
> >>
> >> Before I say anything else I'll say that I am in favor of this new
> layout.
> >

Re: Do we need CODEOWNERS ?

2023-07-04 Thread Weston Pace
I agree the experiment isn't working very well.  I've been meaning to
change my listing from `compute` to `acero` for a while.  I'd be +1 for
just removing it though.

On Tue, Jul 4, 2023, 6:44 AM Dewey Dunnington 
wrote:

> Just a note that for me, the main problem is that I get automatic
> review requests for PRs that have nothing to do with R (I think this
> happens when a rebase occurs that contained an R commit). Because that
> happens a lot, it means I miss actual review requests and sometimes
> mentions because they blend in. I think CODEOWNERS results in me
> reviewing more PRs than if I had to set up some kind of custom
> notification filter but I agree that it's not perfect.
>
> Cheers,
>
> -dewey
>
> On Tue, Jul 4, 2023 at 10:04 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > Some time ago we added a `.github/CODEOWNERS` file in the main Arrow
> > repo. The idea is that, when specific files or directories are touched
> > by a PR, specific people are asked for review.
> >
> > Unfortunately, it seems that, most of the time, this produces the
> > following effects:
> >
> > 1) the people who are automatically queried for review don't show up
> > (perhaps they simply ignore those automatic notifications)
> > 2) when several people are assigned for review, each designated reviewer
> > seems to hope that the other ones will be doing the work, instead of
> > doing it themselves
> > 3) contributors expect those people to show up and are therefore
> > bewildered when nobody comes to review their PR
> >
> > Do we want to keep CODEOWNERS? If we still think it can be beneficial,
> > we should institute a policy where people who are listed in that file
> > promise to respond to review requests: 1) either by doing a review 2) or
> > by de-assigning themselves, and if possible pinging another core
> developer.
> >
> > What do you think?
> >
> > Regards
> >
> > Antoine.
>


Re: [ANNOUNCE] New Arrow committer: Kevin Gurney

2023-07-03 Thread Weston Pace
Congratulations Kevin!

On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Kevin Gurney
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> --
> kou
>


Re: Question about large exec batch in acero

2023-07-03 Thread Weston Pace
> is this overflow considered a bug? Or is large exec batch something that
should be avoided?

This is not a bug and it is something that should be avoided.

Some of the hash-join internals expect small batches.  I actually thought
the limit was 32Ki and not 64Ki because I think there may be some places we
are using int16_t as an index.  The reasoning is that the hash-join is
going to make multiple passes through the data (e.g. first to calculate the
hashes from the key columns and then again the encode the key columns,
etc.) and you're going to get better performance when your batches are
small enough that they fit into the CPU cache.  [1] is often given as a
reference for this idea.  Since this is the case there is not much need for
operating on larger batches.

> And does acero have any logic preventing that from happening

Yes, in the source node we take (potentially large) batches from the I/O
side of things and slice them into medium sized batches (about 1Mi rows) to
distribute across threads and then each thread iterates over that medium
sized batch in even smaller batches (32Ki rows) for actual processing.
This all happens here[2].

[1] https://db.in.tum.de/~leis/papers/morsels.pdf
[2]
https://github.com/apache/arrow/blob/6af660f48472b8b45a5e01b7136b9b040b185eb1/cpp/src/arrow/acero/source_node.cc#L120

On Mon, Jul 3, 2023 at 6:50 AM Ruoxi Sun  wrote:

> Hi folks,
>
> I've encountered a bug when doing swiss join using a big exec batch, say,
> larger than 65535 rows, on the probe side. It turns out to be that in the
> algorithm, it is using `uint16_t` to represent the index within the probe
> exec batch (the materialize_batch_ids_buf
> <
> https://github.com/apache/arrow/blob/f951f0c42040ba6f584831621864f5c23e0f023e/cpp/src/arrow/acero/swiss_join.cc#L1897C8-L1897C33
> >),
> and row id larger than 65535 will be silently overflow and cause the result
> nonsense.
>
> One thing to note is that I'm not exactly using the acero "the acero way".
> Instead I carve out some pieces of code from acero and run them
> individually. So I'm just wondering that, is this overflow considered a
> bug? Or is large exec batch something that should be avoided? (And does
> acero have any logic preventing that from happening, e.g., some wild man
> like me just throws it an arbitrary large exec batch?)
>
> Thanks.
>
> *Rossi*
>


Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-29 Thread Weston Pace
Is your use case to operate on a batch of graphs?  For example, do you have
hundreds or thousands of graphs that you need to run these algorithms on at
once?

Or is your use case to operate on a single large graph?  If it's the
single-graph case then how many nodes do you have?

If it's one graph and the graph itself is pretty small and fits into cache,
then I'm not sure the in-memory representation will matter much (though
maybe the search space is large enough to justify a different
representation)


On Thu, Jun 29, 2023 at 6:22 PM Bechir Ben Daadouch 
wrote:

> Dear Apache Arrow Dev Community,
>
> My name is Bechir, I am currently working on a project that involves
> implementing graph algorithms in Apache Arrow.
>
> The initial plan was to construct a node structure and a subsequent graph
> that would encompass all the nodes. However, I quickly realized that due to
> Apache Arrow's columnar format, this approach was not feasible.
>
> I tried a couple of things, including the implementation of the
> shortest-path algorithm. However, I rapidly discovered that manipulating
> arrow objects, particularly when applying graph algorithms, proved more
> complex than anticipated and it became very clear that I would need to
> resort to some data structures outside of what arrow offers (i.e.: Heapq
> wouldn't be possible using arrow).
>
> I also gave a shot at doing it similar to a certain SQL method (see:
> https://ibb.co/0rPGB42 ), but ran into some roadblocks there too and I
> ended up having to resort to using Pandas for some transformations.
>
> My next course of action is to experiment with compressed sparse rows,
> hoping to execute Matrix Multiplication using this method. But honestly,
> with what I know right now, I remain skeptical about the feasibility
> of it. However,
> before committing to this approach, I would greatly appreciate your opinion
> based on your experience with Apache Arrow.
>
> Thank you very much for your time.
>
> Looking forward to potentially discussing this further.
>
> Many thanks,
> Bechir
>


Re: [C++] Dealing with third party method that raises exception

2023-06-29 Thread Weston Pace
We do this quite a bit in the Arrow<->Parquet bridge if IIUC. There are
macros defined like this:

```
#define BEGIN_PARQUET_CATCH_EXCEPTIONS try {
#define END_PARQUET_CATCH_EXCEPTIONS   \
  }\
  catch (const ::parquet::ParquetStatusException& e) { \
return e.status(); \
  }\
  catch (const ::parquet::ParquetException& e) {   \
return ::arrow::Status::IOError(e.what()); \
  }
```

That being said, I'm not particularly fond of macros, but it works.

On Thu, Jun 29, 2023 at 8:09 AM Li Jin  wrote:

> Thanks Antoine - the examples are useful - I can use the same pattern for
> now. Thanks for the quick response!
>
> On Thu, Jun 29, 2023 at 10:47 AM Antoine Pitrou 
> wrote:
>
> >
> > Hi Li,
> >
> > There is not currently, but it would probably be a useful small utility.
> > If you look for `std::exception` in the codebase, you'll find that there
> > a couple of places where we turn it into a Status already.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/06/2023 à 16:20, Li Jin a écrit :
> > > Hi,
> > >
> > > IIUC, most of the Arrow C++ code doesn't not use exceptions. My
> question
> > is
> > > are there some Arrow utility / macro that wrap the function/code that
> > might
> > > raise an exception and turn that into code that returns an arrow error
> > > Status?
> > >
> > > Thanks!
> > > Li
> > >
> >
>


Re: Question about nested columnar validity

2023-06-29 Thread Weston Pace
>> 2. For StringView and ArrayView, if the parent has `validity = false`.
>>  If they have `validity = true`, there offset might point to a
invalid
>>  position

>I have no idea, but I hope not. Ben Kietzman might want to answer more
>precisely here.

I think, for view arrays, the offsets & lengths are indeed "unspecified".
Imagine we are creating a view array programmatically from two child arrays
of starts and lengths and one of those child arrays has a null in it so we
propagate that and create a null view.  We would set the validity bit to
0.  We could probably mandate the "length == 0" if needed but I'm not sure
there is anything that would make sense to require for the "offset". It
would seem the simplest thing to do is allow it to be unspecified.  Then
the entire operation can be:

 * Set validity to lengths.validity & offsets.validity
 * Set lengths to lengths.values
 * Set offsets to offsets.values

On Thu, Jun 29, 2023 at 6:51 AM Antoine Pitrou  wrote:

>
> Le 29/06/2023 à 15:16, wish maple a écrit :
> > Sorry for being misleading. "valid" offset means that:
> > 1. For Binary Like [1] format, and List formats [2], even if the parent
> >  has `validity = false`. Their offset should be well-defined.
>
> Yes.
>
> > 2. For StringView and ArrayView, if the parent has `validity = false`.
> >  If they have `validity = true`, there offset might point to a
> invalid
> >  position
>
> I have no idea, but I hope not. Ben Kietzman might want to answer more
> precisely here.
>
> Regards
>
> Antoine.
>


Re: Question about nested columnar validity

2023-06-28 Thread Weston Pace
I agree with Antoine but I get easily confused by "valid, as in
structurally correct" and "valid, as in not null" so I want to make sure I
understand:

> The child of a nested
> array should be valid itself, independently of the parent's validity
bitmap.

A child must be "structurally correct" (e.g. validate correctly) but it can
have null values.

For example, given:
  * we have an array of type struct<{list}>
  * the third element of the parent struct array is null
Then
 * The third element in the list array could be null
 * The third element in the list array could be valid (not-null)
 * In all cases the offsets must exist for the third element in the list
array

> If it's a BinaryArray, when when it parent is not valid, would a validity
> member point to a undefined address?
>
> And if it's ListArray[3], when it parent is not valid, should it offset
and
> size be valid?

When a binary array or a list array element is null the cleanest thing to
do is to set the offsets to be the same.  So, for example, given a list
array with 5 elements, if second item is null, the offsets could be 0, 8,
8, 12, 20, 50.

Question for Antoine: is it correct for the offsets of a null element to be
different? For the above example, could the offsets be 0, 8, 10, 12, 20,
50?  I think the answer is "yes, this is correct".  However, if this is the
case, then the list's values array must still have 50 items.

On Wed, Jun 28, 2023 at 8:18 AM Antoine Pitrou  wrote:

>
> Hi!
>
> Le 28/06/2023 à 17:03, wish maple a écrit :
> > Hi,
> >
> > By looking at the arrow standard, when it comes to nested structure, like
> > StructArray[1] or FixedListArray[2], when parent is not valid, the
> > correspond child leaves "undefined".
> >
> > If it's a BinaryArray, when when it parent is not valid, would a validity
> > member point to a undefined address?
> >
> > And if it's ListArray[3], when it parent is not valid, should it offset
> and
> > size be valid?
>
> Offsets and sizes should always need to be valid. The child of a nested
> array should be valid itself, independently of the parent's validity
> bitmap.
>
> You can also check this with ValidateFull(): it should catch such problems.
>
> Regards
>
> Antoine.
>


Re: Enabling apache/arrow GitHub dependency graph with vcpkg

2023-06-28 Thread Weston Pace
Thanks for reaching out.  This sounds like a useful tool and I'm happy to
hear about more development around establishing supply chain awareness.
However, Arrow is an Apache Software Project and, as such, we don't manage
all of the details of our Github repository.  Some of these (including, I
believe, selection of integrations) are managed by the ASF infrastructure
team[1].

We can contact them to request this integration, but if your interest is
primarily in getting feedback around the setup and configuration of
integration, then I'm not sure we'd be very helpful as the process would be
pretty opaque to us.  You may instead want to contact the Infra team
directly.

[1] https://infra.apache.org/



On Tue, Jun 27, 2023 at 1:57 PM Michael Price
 wrote:

> Hello Apache Arrow project,
>
>
>
> The Microsoft C++ team has been working with our partners at GitHub to
> improve the C and C++ user experience on their platform. As a part of that
> effort, we have added vcpkg support for the GitHub dependency graph
> feature. We are looking for feedback from GitHub repositories, like
> apache/arrow, that are using vcpkg so we can identify improvements to this
> new feature.
>
>
>
> Enabling this feature for your repositories brings a number of benefits,
> now and in the future:
>
>
>
>   *   Visibility - Users can easily see which packages you depend on and
> their versions. This includes transitive dependencies not listed in your
> vcpkg.json manifest file.
>   *   Compliance - Generate an SBOM from GitHub that includes C and C++
> dependencies as well as other supported ecosystems.
>   *   Networking - A fully functional dependency graph allows you to not
> only see your dependencies, but also other GitHub projects that depend on
> you, letting you get an idea of how many people depend on your efforts. We
> want to hear from you if we should prioritize enabling this.
>   *   Security - The intention is to enable GitHub's secure supply chain
> features<
> https://docs.github.com/code-security/supply-chain-security/understanding-your-software-supply-chain/about-supply-chain-security>.
> Those features are not available yet, but when they are, you'll already be
> ready to use them on day one.
>
>
>
> What's Involved?
>
>
>
> If you decide to help us out, here's how that would look:
>
>   *   Enable the integration following our documentation. See GitHub
> integrations - The GitHub dependency graph<
> https://aka.ms/vcpkg-dependency-graph> more information.
>   *   Send us a follow-up email letting us know if the documentation
> worked and was clear, and what missing functionality is most important to
> you.
>   *   If you have problem enabling the integration, we'll work directly
> with you to resolve your issue.
>   *   We will schedule a brief follow-up call (15-20) with you after the
> feature is enabled to discuss your feedback.
>   *   When we make improvements, we'd like you to try them out to let us
> know if we are solving the important problems.
>   *   Eventually, we'd like to get a "thumbs up" or "thumbs down" on
> whether or not you think the feature is complete enough to no longer be an
> experiment.
>   *   We'll credit you for your help when we make the move out of
> experimental and blog about the transition to fully supported.
>
>
>
> If you are interested in collaborating with us, let us know by replying to
> this email.
>
>
>
> Thanks,
>
>
>
> Michael Price
> Product Manager, Microsoft C++ Team
>
>
>


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Weston Pace
> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.

In my mind there is a distinction between the "compute domain" (e.g. a
pandas dataframe or something like ibis or SQL) and the "data domain" (e.g.
pyarrow datasets).  I think, in a perfect world, you could push any and all
compute up and down the chain as far as possible.  However, in practice, I
think there is a healthy set of tools and libraries that say "simple column
projection and filtering is good enough".  I would argue that there is room
for both APIs and while the temptation is always present to "shove as much
compute as you can" I think pyarrow datasets seem to have found a balance
between the two that users like.

So I would argue that this protocol may never become a general-purpose
unmaterialized dataframe and that isn't necessarily a bad thing.

> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.

Just to clarify, the proposal currently only requires the fragments to be
serializable correct?

On Fri, Jun 23, 2023 at 11:48 AM Will Jones  wrote:

> Thanks Ian for your extensive feedback.
>
> I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions.
> >
>
> I would agree with this point, if we were starting from scratch. But one of
> my goals is for this protocol to be descriptive of the existing dataset
> integrations in the ecosystem, which all currently rely on PyArrow
> expressions. For example, you'll notice in the PR that there are unit tests
> to verify the current PyArrow Dataset classes conform to this protocol,
> without changes.
>
> I think there's three routes we can go here:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> Substrait-based alternatives we deprecate the PyArrow expression support.
> This is what I intended with the current design, and I think it provides
> the most obvious migration paths for existing producers and consumers.
> 2. We keep the overall dataset API, but don't introduce the filter and
> projection arguments until we have Substrait support. I'm not sure what the
> migration path looks like for producers and consumers, but I think this
> just implicitly becomes the same as (1), but with worse documentation.
> 3. We write a protocol completely from scratch, that doesn't try to
> describe the existing dataset API. Producers and consumers would then
> migrate to use the new protocol and deprecate their existing dataset
> integrations. We could introduce a dunder method in that API (sort of like
> __arrow_array__) that would make the migration seamless from the end-user
> perspective.
>
> *Which do you all think is the best path forward?*
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset.
> >
>
> This is a good point. I can add a section describing the differences. The
> main ones I can think of are that: (1) Datasets are "pruneable": one can
> select a subset of columns and apply a filter on rows to avoid IO and (2)
> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.
>
> Best,
>
> Will Jones
>
> On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:
>
> > Thanks Will for this proposal!
> >
> > For anyone familiar with PyArrow, this idea has a clear intuitive
> > logic to it. It provides an expedient solution to the current lack of
> > a practical means for interchanging "unmaterialized dataframes"
> > between different Python libraries.
> >
> > To elaborate on 

Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Weston Pace
Congrats Dewey!

On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou  wrote:

>
> Welcome to the PMC Dewey!
>
>
> Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit :
> > Congrats Dewey!
> >
> > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens
> >  wrote:
> >>
> >> Well deserved! Congratulations Dewey!
> >>
> >> Ian Cook  schrieb am Fr., 23. Juni 2023, 16:32:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 10:03 AM Matt Topol 
> >>> wrote:
> 
>  Congrats Dewey!!
> 
>  On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin
> 
>  wrote:
> 
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane 
> wrote:
> >
> >> Well-deserved Dewey, congratulations!
> >>
> >> On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon  >
> >> wrote:
> >>
> >>> Congratulations Dewey!
> >>>
> >>> On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim <
> >>> ale...@voltrondata.com
> >>> .invalid>
> >>> wrote:
> >>>
>  Congratulations Dewey!! 
> 
>  On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > raulcumpl...@gmail.com
> >>>
>  wrote:
> 
> > Congratulations Dewey!
> >
> > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> >>> escribió:
> >
> >> The Project Management Committee (PMC) for Apache Arrow has
> > invited
> >> Dewey Dunnington (paleolimbot) to become a PMC member and we
> >>> are
>  pleased
> > to
> >> announce
> >> that Dewey Dunnington has accepted.
> >>
> >> Congratulations and welcome!
> >>
> >
> 
> >>>
> >>
> >
> >>>
>


Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Weston Pace
One small difference seems to be that Close is idempotent and Cancel is not.

> void cancel()
>  throws SQLException
>
> Cancels this Statement object if both the DBMS and driver support
aborting an SQL statement. This method can be used by one thread to cancel
a statement that is being executed by another thread.
>
> Throws:
> SQLException - if a database access error occurs or this method is
called on a closed Statement

In other words, with cancel, you can display an error to the user if the
statement is already finished (and thus was not able to be canceled).
However, I don't know if that is significant at all.

On Fri, Jun 23, 2023 at 12:17 AM Sutou Kouhei  wrote:

> Hi,
>
> Thanks for sharing your thoughts.
>
> OK. I'll change the current specifications/implementations
> to the followings:
>
> * Remove CloseFlightInfo (if nobody objects it)
> * RefreshFlightEndpoint ->
>   RenewFlightEndpoint
> * RenewFlightEndpoint(FlightEndpoint) ->
>   RenewFlightEndpoint(RenewFlightEndpointRequest)
> * CancelFlightInfo(FlightInfo) ->
>   CancelFlightInfo(CancelFlightInfoRequest)
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Format][Flight] Result set expiration support" on Thu, 22
> Jun 2023 12:51:55 -0400,
>   Matt Topol  wrote:
>
> >> That said, I think it's reasonable to only have Cancel at the protocol
> > level.
> >
> > I'd be in favor of only having Cancel too. In theory calling Cancel on
> > something that has already completed should just be equivalent to calling
> > Close anyways rather than requiring a client to guess and call Close if
> > Cancel errors or something.
> >
> >> So this may not be needed for now. How about accepting a
> >> specific request message instead of FlightEndpoint directly
> >> as "PersistFlightEndpoint" input?
> >
> > I'm also in favor of this.
> >
> >> I think Refresh was fine, but if there's confusion, I like Kou's
> > suggestion of Renew the best.
> >
> > I'm in the same boat as David here, I think Refresh was fine but like the
> > suggestion of Renew best if we want to avoid any confusion.
> >
> >
> >
> > On Thu, Jun 22, 2023 at 2:55 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Doesn't protobuf ensure forwards compatibility? Why would it break?
> >>
> >> At worse, you can include the changes necessary for it to compile
> >> cleanly, without adding support for the new fields/methods?
> >>
> >>
> >> Le 22/06/2023 à 02:16, Sutou Kouhei a écrit :
> >> > Hi,
> >> >
> >> > The following part in the original e-mail is the one:
> >> >
> >> >> https://github.com/apache/arrow/pull/36009 is an
> >> >> implementation of this proposal. The pull requests has the
> >> >> followings:
> >> >>
> >> >> 1. Format changes:
> >> >> * format/Flight.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
> >> >> * format/FlightSql.proto
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-fd4e5266a841a2b4196aadca76a4563b6770c91d400ee53b6235b96da628a01e
> >> >>
> >> >> 2. Documentation changes:
> >> >> docs/source/format/Flight.rst
> >> >>
> >>
> https://github.com/apache/arrow/pull/36009/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
> >> >
> >> > We can split the part to a separated pull request. But if we
> >> > split the part and merge the pull requests for format
> >> > related changes and implementation related changes
> >> > separately, our CI will be broken temporary. Because our
> >> > implementations use auto-generated sources that are based on
> >> > *.proto.
> >> >
> >> >
> >> > Thanks,
> >>
>


Re: Question about `minibatch`

2023-06-20 Thread Weston Pace
Those goals are somewhat compatible.  Sasha can probably correct me if I
get this wrong but my understanding is that the minibatch is just large
enough to ensure reliable vectorized execution.  It is used in some
innermost critical sections to both keep the working set small (fit in L1)
and allocation should be avoided.

In addition to ensuring things fit in L1 there is also, I believe, a side
benefit of using small loops to increase the chances of encountering
special cases (e.g. all values null or no values null) which can sometimes
save you from more complex logic.

On Tue, Jun 20, 2023 at 7:32 PM Ruoxi Sun  wrote:

> Hi,
>
> By looking at acero code, I'm curious about the concept `minibatch` being
> used in swiss join and grouper.
> I wonder if its purpose is to proactively limit the memory size of the
> working set? Or is it the consequence of that the temp vector should be
> fix-sized (to avoid costly memory allocation)? Additionally, what's the
> impact of choosing the size of the minibatch?
>
> Really appreciate if someone can help me to clear this.
>
> Thanks.
>
> *Rossi*
>


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-20 Thread Weston Pace
Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements.

Compared to some of the other layouts we've discussed recently (REE, list
veiw) I do think this layout is more unique and fundamentally different.
Perhaps most fundamentally different:

 * This is the first layout where the number of buffers depends on the data
and not the schema.  I think this is the most architecturally significant
fact.  It does require a (backwards compatible) change to the IPC format
itself, beyond just adding new type codes.  It also poses challenges in
places where we've assumed there will be at most 3 buffers (e.g. in
ArraySpan, though, as you have shown, we can work around this using a raw
pointers representation internally in those spots).

I think you've done some great work to integrate this well with Arrow-C++
and I'm convinced it can work.

I would be interested in hearing some input from the Rust community.

Ben, at one point there was some discussion that this might be a c-data
only type.  However, I believe that was based on the raw pointers
representation.  What you've proposed here, if I understand correctly, is
an index + offsets representation and it is suitable for IPC correct?
(e.g. I see that you have changes and examples in the IPC reader/writer)

On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman 
wrote:

> Hi Gang,
>
> I'm not sure what you mean, sorry if my answers are off base:
>
> Parquet's ByteArray will be unaffected by the addition of the string view
> type;
> all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
> with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
> during serialization to parquet [1].
>
> If you mean that encoding of arrow::Type::STRING_VIEW will not be as fast
> as encoding of equivalent arrow::Type::STRING, that's something I haven't
> benchmarked so I can't answer definitively. I would expect it to be faster
> than
> first converting STRING_VIEW->STRING then encoding to parquet; direct
> encoding avoids allocating and populating temporary buffers. Of course this
> only applies to cases where you need to encode an array of STRING_VIEW to
> parquet- encoding of STRING to parquet will be unaffected.
>
> Sincerely,
> Ben
>
> [1]
>
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
>
> On Thu, Jun 15, 2023 at 10:34 PM Gang Wu  wrote:
>
> > Hi Ben,
> >
> > The posted benchmark [1] looks pretty good to me. However, I want to
> > raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
> > uses a customized parquet::ByteArray type [2] for string/binary, I would
> > expect some regression of conversions between parquet reader/writer
> > and the proposed string view array, especially when some strings use
> > short form and others use long form.
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> > [2] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 16, 2023 at 3:58 AM Will Jones 
> > wrote:
> >
> > > Cool. Thanks for doing that!
> > >
> > > On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman 
> > > wrote:
> > >
> > > > I've added https://github.com/apache/arrow/issues/36112 to track
> > > > deduplication of buffers on write.
> > > > I don't think it would require modification of the IPC format.
> > > >
> > > > Ben
> > > >
> > > > On Thu, Jun 15, 2023 at 1:30 PM Matt Topol 
> > > wrote:
> > > >
> > > > > Based on my understanding, in theory a buffer *could* be shared
> > within
> > > a
> > > > > batch since the flatbuffers message just uses an offset and length
> to
> > > > > identify the buffers.
> > > > >
> > > > > That said, I don't believe any current implementation actually does
> > > this
> > > > or
> > > > > takes advantage of this in any meaningful way.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Thu, Jun 15, 2023 at 1:00 PM Will Jones <
> will.jones...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ben,
> > > > > >
> > > > > > It's exciting to see this move along.
> > > > > >
> > > > > > The buffers will be duplicated. If buffer duplication is becomes
> a
> > > > > concern,
> > > > > > > I'd prefer to handle
> > > > > > > that in the ipc writer. Then buffers which are duplicated could
> > be
> > > > > > detected
> > > > > > > by checking
> > > > > > > pointer identity and written only once.
> > > > > >
> > > > > >
> > > > > > Question: to be able to write buffer only once and reference in
> > > > multiple
> > > > > > arrays, does that require a change to the IPC format? Or is
> sharing
> > > > > buffers
> > > > > > within the same batch already allowed in the IPC format?
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Will Jones
> > > > > >
> > > > > > On Thu, Jun 15, 2023 at 9:03 AM 

Re: [ANNOUNCE] New Arrow PMC member: Ben Baumgold,

2023-06-20 Thread Weston Pace
Congratulations Ben!

On Tue, Jun 20, 2023 at 7:38 AM Jacob Quinn  wrote:

> Yay! Congrats Ben! Love to see more Julia folks here!
>
> -Jacob
>
> On Tue, Jun 20, 2023 at 4:15 AM Andrew Lamb  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Ben Baumgold, to become a PMC member and we are pleased to announce
> > that Ben Baumgold has accepted.
> >
> > Congratulations and welcome!
> >
>


Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Weston Pace
Note that you can ask pyarrow how much memory it thinks it is using with
the pyarrow.total_allocated_bytes[1] function.  This can be very useful for
tracking memory leaks.

I see that memory-profiler now has support for different backends. Sadly,
it doesn't look like you can register a custom backend.  Might be a fun
project if someone wanted to add a pyarrow backend for it :)

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.total_allocated_bytes.html

On Thu, Jun 15, 2023 at 9:16 AM Antoine Pitrou  wrote:

>
> Hi Alex,
>
> I think you're misinterpreting the results. Yes, the RSS memory (as
> reported by memory_profiler) doesn't seem to decrease. No, it doesn't
> mean that Arrow doesn't release memory. It's actually common for memory
> allocators (such as jemalloc, or the system allocator) to keep
> deallocated pages around, because asking the kernel to recycle them is
> expensive.
>
> Unless your system is running low on memory, you shouldn't care about
> this. Trying to return memory to the kernel can actually make
> performance worse if you ask Arrow to allocate memory soon after.
>
> That said, you can try to call MemoryPool.release_unused() if these
> numbers are important to you:
>
> https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused
>
> Regards
>
> Antoine.
>
>
>
> Le 15/06/2023 à 17:39, Jerald Alex a écrit :
> > Hi Experts,
> >
> > I have come across the memory pool configurations using an environment
> > variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and
> > test it.
> >
> > I could observe improvements on macOS with the *system* memory pool but
> no
> > change on linux os. I have captured more details on GH issue
> > https://github.com/apache/arrow/issues/36100... If any one can
> highlight or
> > suggest a way to overcome this problem will be helpful. Appreciate your
> > help.!
> >
> > Regards,
> > Alex
> >
> > On Wed, Jun 14, 2023 at 9:35 PM Jerald Alex  wrote:
> >
> >> Hi Experts,
> >>
> >> Pyarrow *Table.from_pylist* does not release memory until the program
> >> terminates. I created a sample script to highlight the issue. I have
> also
> >> tried setting up `pa.jemalloc_set_decay_ms(0)` but it didn't help much.
> >> Could you please check this and let me know if there are potential
> issues /
> >> any workaround to resolve this?
> >>
> > pyarrow.__version__
> >> '12.0.0'
> >>
> >> OS Details:
> >> OS: macOS 13.4 (22F66)
> >> Kernel Version: Darwin 22.5.0
> >>
> >>
> >>
> >> Sample code to reproduce. (it needs memory_profiler)
> >>
> >> #file_name: test_exec.py
> >> import pyarrow as pa
> >> import time
> >> import random
> >> import string
> >>
> >> from memory_profiler import profile
> >>
> >> def get_sample_data():
> >>  record1 = {}
> >>  for col_id in range(15):
> >>  record1[f"column_{col_id}"] = string.ascii_letters[10 :
> >> random.randint(17, 49)]
> >>
> >>  return [record1]
> >>
> >> def construct_data(data):
> >>  count = 1
> >>  while count < 10:
> >>  pa.Table.from_pylist(data * 10)
> >>  count += 1
> >>  return True
> >>
> >> @profile
> >> def main():
> >>  data = get_sample_data()
> >>  construct_data(data)
> >>  print("construct data completed!")
> >>
> >> if __name__ == "__main__":
> >>  main()
> >>  time.sleep(600)
> >>
> >>
> >> memory_profiler output:
> >>
> >> Filename: test_exec.py
> >>
> >> Line #Mem usageIncrement  Occurrences   Line Contents
> >> =
> >>  41 65.6 MiB 65.6 MiB   1   @profile
> >>  42 def main():
> >>  43 65.6 MiB  0.0 MiB   1   data =
> get_sample_data()
> >>  44203.8 MiB138.2 MiB   1   construct_data(data)
> >>  45203.8 MiB  0.0 MiB   1   print("construct
> data
> >> completed!")
> >>
> >> Regards,
> >> Alex
> >>
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
> Can't implementations add support as needed? I assume that the "depending
on what support [it] aspires to" implies this, but if a feature isn't used
in a community then it can leave it unimplemented. On the flip side, if it
is used in a community (e.g. C++) is there no way to upstream it without
the support of every community?

I think that is something that is more tolerable for something like REE or
dictionary support which is purely additive (e.g. JS and C# don't support
unions yet and can get around to it when it is important).

The challenge for this kind of "alternative layout" is that you start to
get a situation where some implementations choose "option A" and others
choose "option B" and it's not clearly a case of "this is a feature we
haven't added support for yet".

On Wed, Jun 14, 2023 at 2:01 PM Antoine Pitrou  wrote:

>
> So each community would have its own version of the Arrow format?
>
>
> Le 14/06/2023 à 22:47, Aldrin a écrit :
> >  > Arrow has at least 7 native "official" implementations... 5 bindings
> > on C++... and likely other implementations (like arrow2 in rust)
> >
> >> I think it is worth remembering that depending on what level of support
> > ListView aspires to, such an addition could require non trivial changes
> to
> > many / all of those implementations (and the APIs they expose).
> >
> > Can't implementations add support as needed? I assume that the
> > "depending on what support [it] aspires to" implies this, but if a
> > feature isn't used in a community then it can leave it unimplemented. On
> > the flip side, if it is used in a community (e.g. C++) is there no way
> > to upstream it without the support of every community?
> >
> >
> >
> > Sent from Proton Mail for iOS
> >
> >
> > On Wed, Jun 14, 2023 at 13:06, Raphael Taylor-Davies
> > mailto:On Wed, Jun 14, 2023 at
> > 13:06, Raphael Taylor-Davies <> wrote:
> >> Even something relatively straightforward becomes a huge implementation
> >> effort when multiplied by a large number of codebases, users and
> >> datasets. Parquet is a great source of historical examples of the
> >> challenges of incremental changes that don't meaningfully unlock new
> >> use-cases. To take just one, Int96 was deprecated almost a decade ago,
> >> in favour of some additional metadata over an existing physical layout,
> >> and yet Int96 is still to the best of my knowledge used by Spark by
> >> default.
> >>
> >> That's not to say that I think the arrow specification should ossify and
> >> we should never change it, but I'm not hugely enthusiastic about adding
> >> encodings that are only incremental improvements over existing
> encodings.
> >>
> >> I therefore wonder if there are some new use-cases I am missing that
> >> would be unlocked by this change, and that wouldn't be supported by the
> >> dictionary proposal? Perhaps you could elaborate here? Whilst I do agree
> >> using dictionaries as proposed is perhaps a less elegant solution, I
> >> don't see anything inherently wrong with it, and if it ain't broke we
> >> really shouldn't be trying to fix it.
> >>
> >> Kind Regards,
> >>
> >> Raphael Taylor-Davies
> >>
> >> On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho
> >>  wrote:
> >>
> >> General approach to alternative formats aside, in the specific case
> >> of ListView, I think the implementation complexity is being
> >> overestimated in these discussions. The C++ Arrow implementation
> >> shares a lot of code between List and LargeList. And with some
> >> tweaks, I'm able to share that common infrastructure for ListView as
> >> well. [1] ListView is similar to list: it doesn't require offsets to
> >> be sorted and adds an extra buffer containing sizes. For symmetry
> >> with the List and LargeList types (FixedSizeList not included), I'm
> >> going to propose we add a LargeListView. That is not part of the
> >> draft implementation yet, but seems like an obvious thing to have
> >> now that I implemented the `if_else` specialization. [2] David Li
> >> asked about this above and I can confirm now that 64-bit version of
> >> ListView (LargeListView) is in the plans. Trying to avoid
> >> re-implementing some kernels is not a good goal to chase, IMO,
> >> because kernels need tweaks to take advantage of the format. [1]
> >> https://github.com/apache/arrow/pull/35345 [2]
> >> https://github.com/felipecrv/arrow/comm

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
jects that are implementing Arrow today
> >> are not aiming to provide complete coverage of Arrow; rather they are
> >> adopting Arrow because of its role as a standard and they are
> >> implementing only as much of the Arrow standard as they require to
> >> achieve some goal. I believe that such projects are important Arrow
> >> stakeholders, and I believe that this proposed notion of canonical
> >> alternative layouts will serve them well and will create efficiencies
> >> by standardizing implementations around a shared set of alternatives.
> >>
> >> However I think that the documentation for canonical alternative
> >> layouts should strongly encourage implementers to default to using the
> >> primary layouts defined in the core spec and only use alternative
> >> layouts in cases where the primary layouts do not meet their needs.
> >>
> >>
> >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield 
> >> wrote:
> >>>
> >>> This sounds reasonable to me but my main concern is, I'm not sure there
> >> is
> >>> a great mechanism to enforce canonical layouts don't somehow become
> >> default
> >>> (or the only implementation).
> >>>
> >>> Even for these new layouts, I think it might be worth rethinking
> binding
> >> a
> >>> layout into the schema versus having a different concept of encoding
> (and
> >>> changing some of the corresponding data structures).
> >>>
> >>>
> >>> On Mon, May 22, 2023 at 10:37 AM Weston Pace 
> >> wrote:
> >>>
> >>>> Trying to settle on one option is a fruitless endeavor.  Each type has
> >> pros
> >>>> and cons.  I would also predict that the largest existing usage of
> >> Arrow is
> >>>> shuttling data from one system to another.  The newly proposed format
> >>>> doesn't appear to have any significant advantage for that use case (if
> >>>> anything, the existing format is arguably better as it is more
> >> compact).
> >>>>
> >>>> I am very biased towards historical precedent and avoiding breaking
> >>>> changes.
> >>>>
> >>>> We have "canonical extension types", perhaps it is time for "canonical
> >>>> alternative layouts".  We could define it as such:
> >>>>
> >>>>   * There are one or more primary layouts
> >>>> * Existing layouts are automatically considered primary layouts,
> >> even if
> >>>> they wouldn't
> >>>>   have been primary layouts initially (e.g. large list)
> >>>>   * A new layout, if it is semantically equivalent to another, is
> >> considered
> >>>> an alternative layout
> >>>>   * An alternative layout still has the same requirements for adoption
> >> (two
> >>>> implementations and a vote)
> >>>> * An implementation should not feel pressured to rush and
> implement
> >> the
> >>>> new layout.
> >>>>   It would be good if they contribute in the discussion and
> >> consider the
> >>>> layout and vote
> >>>>   if they feel it would be an acceptable design.
> >>>>   * We can define and vote and approve as many canonical alternative
> >> layouts
> >>>> as we want:
> >>>> * A canonical alternative layout should, at a minimum, have some
> >>>>   reasonable justification, such as improved performance for
> >> algorithm X
> >>>>   * Arrow implementations MUST support the primary layouts
> >>>>   * An Arrow implementation MAY support a canonical alternative,
> >> however:
> >>>> * An Arrow implementation MUST first support the primary layout
> >>>> * An Arrow implementation MUST support conversion to/from the
> >> primary
> >>>> and canonical layout
> >>>> * An Arrow implementation's APIs MUST only provide data in the
> >>>> alternative
> >>>>   layout if it is explicitly asked for (e.g. schema inference
> should
> >>>> prefer the primary layout).
> >>>>   * We can still vote for new primary layouts (e.g. promoting a
> >> canonical
> >>>> alternative) but, in these
> >>>>  votes we don't only consider the value (e.g. performance) of the
> >&g

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Weston Pace
Are you looking for something in C++ or python?  We have a thing called the
"grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which
(if memory serves) is the heart of the functionality in C++.  It would be
nice to add some python bindings for this functionality as this ask comes
up from pyarrow users pretty regularly.

The grouper is used in src/arrow/dataset/partition.h to partition a record
batch into groups of batches.  This is how the dataset writer writes a
partitioned dataset.  It's a good example of how you would use the grouper
for a "one batch in, one batch per group out" use case.

The grouper can also be used in a streaming situation (many batches in, one
batch per group out).  In fact, the grouper is what is used by the group by
node.  I know you recently added [1] and I'm maybe a little uncertain what
the difference is between this ask and the capabilities added in [1].

[1] https://github.com/apache/arrow/pull/35514

On Tue, Jun 13, 2023 at 8:23 AM Li Jin  wrote:

> Hi,
>
> I am trying to write a function that takes a stream of record batches
> (where the last column is group id), and produces k record batches, where
> record batches k_i contain all the rows with group id == i.
>
> Pseudocode is sth like:
>
> def group_rows(batches, k) -> array[RecordBatch] {
>   builders = array[RecordBatchBuilder](k)
>   for batch in batches:
># Assuming last column is the group id
>group_ids = batch.column(-1)
>for i in batch.num_rows():
> k_i = group_ids[i]
> builders[k_i].append(batch[i])
>
>batches = array[RecordBatch](k)
>for i in range(k):
>batches[i] = builders[i].build()
>return batches
> }
>
> I wonder if there is some existing code that does something like this?
> (Specially I didn't find code that can append row/rows to a
> RecordBatchBuilder (either one row given an row index, or multiple rows
> given a list of row indices)
>


Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread Weston Pace
Congratulations

On Tue, Jun 13, 2023, 1:28 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Congratulations!
>
> On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido 
> wrote:
> >
> > Congratulations Jie!!!
> >
> > El lun, 12 jun 2023, 20:35, Matt Topol 
> escribió:
> >
> > > Congrats Jie!
> > >
> > > On Sun, Jun 11, 2023 at 9:20 AM Andrew Lamb 
> wrote:
> > >
> > > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > > Jie Wen to become a PMC member and we are pleased to announce
> > > > that Jie Wen has accepted.
> > > >
> > > > Congratulations and welcome!
> > > >
> > >
>


Re: [Python] Dataset scanner fragment skip options.

2023-06-12 Thread Weston Pace
> I would like to know if it is possible to skip the specific set of
batches,
> for example, the first 10 batches and read from the 11th Batch.

This sort of API does not exist today.  You can skip files by making a
smaller dataset with fewer files (and I think, with parquet, there may even
be a way to skip row groups by creating a fragment per row group with
ParquetFileFragment).  However, there is no existing datasets API for
skipping batches or rows.

> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?

fragment_scan_options is the spot for configuring format-specific scan
options.  For example, with parquet, you often don't need to bother with
this and can just use the defaults (I can't remember if nullptr is fine or
if you need to set this to FileFormat::default_fragment_scan_options but I
would hope it's ok to just use nullptr.

On the other hand, formats like CSV tend to need more configuration and
tuning.  For example, setting the delimiter, skipping some header rows,
etc.  Parquet is pretty self-describing and you would only need to use the
fragment_scan_options if, for example, you need to decryption or custom
control over which columns are encoded as dictionary, etc.

On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex  wrote:

> Hi Experts,
>
> I have been using dataset.scanner to read the data with specific filter
> conditions and batch_size of 1000 to read the data.
>
> ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches()
>
> I would like to know if it is possible to skip the specific set of batches,
> for example, the first 10 batches and read from the 11th Batch.
>
>
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner
> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?
>
> Really appreciate any input. thanks!
>
> Regards,
> Alex
>


Re: [ANNOUNCE] New Arrow committer: Mehmet Ozan Kabak

2023-06-08 Thread Weston Pace
Congratulations!

On Thu, Jun 8, 2023, 5:36 PM Mehmet Ozan Kabak  wrote:

> Thanks everybody. Looking to collaborate further!
>
> > On Jun 8, 2023, at 9:52 AM, Matt Topol  wrote:
> >
> > Congrats! Welcome Ozan!
> >
> > On Thu, Jun 8, 2023 at 8:53 AM Raúl Cumplido 
> wrote:
> >
> >> Congratulations and welcome!
> >>
> >> El jue, 8 jun 2023 a las 14:45, Metehan Yıldırım
> >> () escribió:
> >>>
> >>> Congrats Ozan!
> >>>
> >>> On Thu, Jun 8, 2023 at 1:09 PM Andrew Lamb 
> wrote:
> >>>
>  On behalf of the Arrow PMC, I'm happy to announce that  Mehmet Ozan
> >> Kabak
>  has accepted an invitation to become a committer on Apache
>  Arrow. Welcome, and thank you for your contributions!
> 
>  Andrew
> 
> >>
>
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
> This implies that each canonical alternative layout would codify a
> primary layout as its "fallback."

Yes, that was part of my proposal:

>  * A new layout, if it is semantically equivalent to another, is
considered an alternative layout

Or, to phrase it another way.  If there is not a "fallback" then it is not
an alternative layout.  It's a brand new primary layout.  I'd expect this
to be quite rare.  I can't really even hypothesize any examples.  I think
the only truly atomic layouts are fixed-width, list, and struct.

> This seems reasonable but it opens
> up some cans of worms, such as how two components communicating
> through an Arrow interface would negotiate which layout is supported

Most APIs that I'm aware of already do this.  For example,
pyarrow.parquet.read_table has a "read_dictionary" property that can be
used to control whether or not a column is returned with the dictionary
encoding.  There is no way (that I'm aware of) to get a column in REE
encoding today without explicitly requesting it.  In fact, this could be as
simple as a boolean "use_advanced_features" flag although I would
discourage something so simplistic.  The point is that arrow-compatible
software should, by default, emit types that are supported by all arrow
implementations.

Of course, there is no way to enforce this, it's just a guideline / strong
recommendation on how software should behave if it wants to state "arrow
compatible" as a feature.

On Tue, Jun 6, 2023 at 3:33 PM Ian Cook  wrote:

> Thanks Weston. That all sounds reasonable to me.
>
> >  with the caveat that the primary layout must be emitted if the user
> does not specifically request the alternative layout
>
> This implies that each canonical alternative layout would codify a
> primary layout as its "fallback." This seems reasonable but it opens
> up some cans of worms, such as how two components communicating
> through an Arrow interface would negotiate which layout is supported.
> I suppose such details should be discussed in a separate thread, but I
> raise this here just to point out that it implies an expansion in the
> scope of what Arrow interfaces can do.
>
> On Tue, Jun 6, 2023 at 6:17 PM Weston Pace  wrote:
> >
> > From Micah:
> >
> > > This sounds reasonable to me but my main concern is, I'm not sure
> there is
> > > a great mechanism to enforce canonical layouts don't somehow become
> > default
> > > (or the only implementation).
> >
> > I'm not sure I understand.  Is the concern that an alternative layout is
> > eventually
> > used more and more by implementations until it is used more often than
> the
> > primary
> > layouts?  In that case I think that is ok and we can promote the
> alternative
> > to a primary layout.
> >
> > Or is the concern that some applications will only support the
> alternative
> > layouts and
> > not the primary layout?  In that case I would argue the application is
> not
> > "arrow compatible".
> > I don't know that we prevent or enforce this today either.  An author can
> > always falsely
> > claim they support Arrow even if they are using their own bespoke format.
> >
> > From Ian:
> >
> > > It seems to me that most projects that are implementing Arrow today
> > > are not aiming to provide complete coverage of Arrow; rather they are
> > > adopting Arrow because of its role as a standard and they are
> > > implementing only as much of the Arrow standard as they require to
> > > achieve some goal. I believe that such projects are important Arrow
> > > stakeholders, and I believe that this proposed notion of canonical
> > > alternative layouts will serve them well and will create efficiencies
> > > by standardizing implementations around a shared set of alternatives.
> > >
> > > However I think that the documentation for canonical alternative
> > > layouts should strongly encourage implementers to default to using the
> > > primary layouts defined in the core spec and only use alternative
> > > layouts in cases where the primary layouts do not meet their needs.
> >
> > I'd maybe take a slightly harsher stance.  I don't think an application
> > needs to
> > support all types.  For example, an Arrow-native string processing
> library
> > might
> > not worry about the integer types.  That would be fine.  I think it would
> > still
> > be fair to call it an "arrow compatible string processing library".
> >
> > However, an application must support primary layouts in addition to
> > alternative
> > layouts.  For exam

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
ecome
> default
> > (or the only implementation).
> >
> > Even for these new layouts, I think it might be worth rethinking binding
> a
> > layout into the schema versus having a different concept of encoding (and
> > changing some of the corresponding data structures).
> >
> >
> > On Mon, May 22, 2023 at 10:37 AM Weston Pace 
> wrote:
> >
> > > Trying to settle on one option is a fruitless endeavor.  Each type has
> pros
> > > and cons.  I would also predict that the largest existing usage of
> Arrow is
> > > shuttling data from one system to another.  The newly proposed format
> > > doesn't appear to have any significant advantage for that use case (if
> > > anything, the existing format is arguably better as it is more
> compact).
> > >
> > > I am very biased towards historical precedent and avoiding breaking
> > > changes.
> > >
> > > We have "canonical extension types", perhaps it is time for "canonical
> > > alternative layouts".  We could define it as such:
> > >
> > >  * There are one or more primary layouts
> > >* Existing layouts are automatically considered primary layouts,
> even if
> > > they wouldn't
> > >  have been primary layouts initially (e.g. large list)
> > >  * A new layout, if it is semantically equivalent to another, is
> considered
> > > an alternative layout
> > >  * An alternative layout still has the same requirements for adoption
> (two
> > > implementations and a vote)
> > >* An implementation should not feel pressured to rush and implement
> the
> > > new layout.
> > >  It would be good if they contribute in the discussion and
> consider the
> > > layout and vote
> > >  if they feel it would be an acceptable design.
> > >  * We can define and vote and approve as many canonical alternative
> layouts
> > > as we want:
> > >* A canonical alternative layout should, at a minimum, have some
> > >  reasonable justification, such as improved performance for
> algorithm X
> > >  * Arrow implementations MUST support the primary layouts
> > >  * An Arrow implementation MAY support a canonical alternative,
> however:
> > >* An Arrow implementation MUST first support the primary layout
> > >* An Arrow implementation MUST support conversion to/from the
> primary
> > > and canonical layout
> > >* An Arrow implementation's APIs MUST only provide data in the
> > > alternative
> > >  layout if it is explicitly asked for (e.g. schema inference should
> > > prefer the primary layout).
> > >  * We can still vote for new primary layouts (e.g. promoting a
> canonical
> > > alternative) but, in these
> > > votes we don't only consider the value (e.g. performance) of the
> layout
> > > but also the interoperability.
> > > In other words, a layout can only become a primary layout if there
> is
> > > significant evidence that most
> > > implementations plan to adopt it.
> > >
> > > This lets us evolve support for new layouts more naturally.  We can
> > > generally assume that users will not, initially, be aware of these
> > > alternative layouts.  However, everything will just work.  They may
> start
> > > to see a performance penalty stemming from a lack of support for these
> > > layouts.  If this performance penalty becomes significant then they
> will
> > > discover it and become aware of the problem.  They can then ask
> whatever
> > > library they are using to add support for the alternative layout.  As
> > > enough users find a need for it then libraries will add support.
> > > Eventually, enough libraries will support it that we can adopt it as a
> > > primary layout.
> > >
> > > Also, it allows libraries to adopt alternative layouts more
> aggressively if
> > > they would like while still hopefully ensuring that we eventually all
> > > converge on the same implementation of the alternative layout.
> > >
> > > On Mon, May 22, 2023 at 9:35 AM Will Jones 
> > > wrote:
> > >
> > > > Hello Arrow devs,
> > > >
> > > > I don't understand why we would start deprecating features in the
> Arrow
> > > > > format. Even starting this talk might already be a bad idea
> PR-wise.
> > > > >
> > > >
> > > > I agree we don't want to make breaking changes to the Arrow format

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
That makes sense!  I can see how masked reads are useful in that kind of
approach too.  Thanks for the explanation.

On Fri, Jun 2, 2023, 8:45 AM Will Jones  wrote:

> > The main downside with using the mask (or any solution based on a filter
> > node / filtering) is that it requires that the delete indices go into the
> > plan itself.  So you need to first read the delete files and then create
> > the plan.  And, if there are many deleted rows, this can be costly.
>
> Ah, I see. I was assuming you could load the indices within the fragment
> scan, at the same time the page index was read. That's how I'm implementing
> with Lance, and how I plan to implement with Delta Lake. But if you can't
> do that, then filtering with an anti-join makes sense. You wouldn't want to
> include those in a plan.
>
> On Fri, Jun 2, 2023 at 7:38 AM Weston Pace  wrote:
>
> > Also, for clarity, I do agree with Gang that these are both valuable
> > features in their own right.  A mask makes a lot of sense for page
> indices.
> >
> > On Fri, Jun 2, 2023 at 7:36 AM Weston Pace 
> wrote:
> >
> > > > then I think the incremental cost of adding the
> > > > positional deletes to the mask is probably lower than the anti-join.
> > > Do you mean developer cost?  Then yes, I agree.  Although there may be
> > > some subtlety in the pushdown to connect a dataset filter to a parquet
> > > reader filter.
> > >
> > > The main downside with using the mask (or any solution based on a
> filter
> > > node / filtering) is that it requires that the delete indices go into
> the
> > > plan itself.  So you need to first read the delete files and then
> create
> > > the plan.  And, if there are many deleted rows, this can be costly.
> > >
> > > On Thu, Jun 1, 2023 at 7:13 PM Will Jones 
> > wrote:
> > >
> > >> That's a good point, Gang. To perform deletes, we definitely need the
> > row
> > >> index, so we'll want that regardless of whether it's used in scans.
> > >>
> > >> > I'm not sure a mask would be the ideal solution for Iceberg (though
> it
> > >> is
> > >> a reasonable feature in its own right) because I think position-based
> > >> deletes, in Iceberg, are still done using an anti-join and not a
> filter.
> > >>
> > >> For just positional deletes in isolation, I agree the mask wouldn't be
> > >> more
> > >> optimal than the anti-join. But if they end up using the mask for
> > >> filtering
> > >> with the page index, then I think the incremental cost of adding the
> > >> positional deletes to the mask is probably lower than the anti-join.
> > >>
> > >> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu  wrote:
> > >>
> > >> > IMO, the adding a row_index column from the reader is orthogonal to
> > >> > the mask implementation. Table formats (e.g. Apache Iceberg and
> > >> > Delta) require the knowledge of row index to finalize row deletion.
> It
> > >> > would be trivial to natively support row index from the file reader.
> > >> >
> > >> > Best,
> > >> > Gang
> > >> >
> > >> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace 
> > >> wrote:
> > >> >
> > >> > > I agree that having a row_index is a good approach.  I'm not sure
> a
> > >> mask
> > >> > > would be the ideal solution for Iceberg (though it is a reasonable
> > >> > feature
> > >> > > in its own right) because I think position-based deletes, in
> > Iceberg,
> > >> are
> > >> > > still done using an anti-join and not a filter.
> > >> > >
> > >> > > That being said, we probably also want to implement a streaming
> > >> > merge-based
> > >> > > anti-join because I believe delete files are ordered by row_index
> > and
> > >> so
> > >> > a
> > >> > > streaming approach is likely to be much more performant.
> > >> > >
> > >> > > On Mon, May 29, 2023 at 4:01 PM Will Jones <
> will.jones...@gmail.com
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Rusty,
> > >> > > >
> > >> > > > At first glance, I think adding a row_index column would make
> > >> sense. To
> > >> > > be
> > >> > > 

Re: Add limit and offset to ScannerOption

2023-06-02 Thread Weston Pace
The simplest way to do this sort of paging today would be to create
multiple files and then you could read as few or as many files as you want.
This approach also works regardless of format.

With parquet/orc you can create multiple row groups / stripes within a
single file, and then partition amongst those.

Scan based skipping should be possible for parquet and orc, I think support
was recently added to the parquet reader for this.  However, there is not
yet support for this in datasets.

> or simply wants to load the last 500 records.

A word of caution here.  Row-based skipping in the file reader is good for
partitioning but not always good for a "last 500 records".  This is because
skipping at the file-level (in the scan) will happen before any filters are
applied.  So, for example, if a user asks for "the last 500 records with
filter x > 0" then that might not be the same thing as "the last 500
records in my file".  That sort of LIMIT operation often has to be applied
in memory and then only pushed down into the scan if there are no filters
(and no sorts).

-Weston

On Fri, Jun 2, 2023 at 5:45 AM Wenbo Hu  wrote:

> Hi,
> I'm trying to implement a data management system by python with
> arrow flight. The well designed dataset with filesystem makes the data
> management even simpler.
> But I'm facing a situation: reading range in a dataset.
> Considering a dataset stored in feather format with 1 million rows in
> a remote file system (e.g. s3), client connects to multiple flight
> servers to parallel load data (e.g. 2 servers, one do_get from head to
> half, the other do_get from half to end), or simply wants to load the
> last 500 records.
> At this point, the server needs to skip reading heading records
> for the reasons of network bandwidth and memory limitation, rather
> than transferring, loading heading records into memory and discarding
> them.
> I think modern storage format may have advantages in determining
> position of the specific range of records in the file than csv, since
> csv has to move line by line without indexes. Also, I found fragment
> related apis in the dataset, but not much documentation on that (maybe
> more related to partitioning?).
> Here is the proposal to add "limit and offset" to ScannerOption
> and Available Compute Functions and Acero, since it is also a very
> common operation in SQL as well.
> But I realize that only implementing "limit and offset" compute
> functions have little effect on my situation, since the arrow compute
> functions accept arrays/scalars as input which the loading process has
> been taken. "Limit and offset'' in ScannerOption of dataset may need
> to have a dedicated implementation rather than directly call compute
> to filter. Furthermore, Acero may also benefit from this feature for
> scansink.
>Or any other ideas for this situation?
> --
> -
> Best Regards,
> Wenbo Hu,
>


Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
Also, for clarity, I do agree with Gang that these are both valuable
features in their own right.  A mask makes a lot of sense for page indices.

On Fri, Jun 2, 2023 at 7:36 AM Weston Pace  wrote:

> > then I think the incremental cost of adding the
> > positional deletes to the mask is probably lower than the anti-join.
> Do you mean developer cost?  Then yes, I agree.  Although there may be
> some subtlety in the pushdown to connect a dataset filter to a parquet
> reader filter.
>
> The main downside with using the mask (or any solution based on a filter
> node / filtering) is that it requires that the delete indices go into the
> plan itself.  So you need to first read the delete files and then create
> the plan.  And, if there are many deleted rows, this can be costly.
>
> On Thu, Jun 1, 2023 at 7:13 PM Will Jones  wrote:
>
>> That's a good point, Gang. To perform deletes, we definitely need the row
>> index, so we'll want that regardless of whether it's used in scans.
>>
>> > I'm not sure a mask would be the ideal solution for Iceberg (though it
>> is
>> a reasonable feature in its own right) because I think position-based
>> deletes, in Iceberg, are still done using an anti-join and not a filter.
>>
>> For just positional deletes in isolation, I agree the mask wouldn't be
>> more
>> optimal than the anti-join. But if they end up using the mask for
>> filtering
>> with the page index, then I think the incremental cost of adding the
>> positional deletes to the mask is probably lower than the anti-join.
>>
>> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu  wrote:
>>
>> > IMO, the adding a row_index column from the reader is orthogonal to
>> > the mask implementation. Table formats (e.g. Apache Iceberg and
>> > Delta) require the knowledge of row index to finalize row deletion. It
>> > would be trivial to natively support row index from the file reader.
>> >
>> > Best,
>> > Gang
>> >
>> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace 
>> wrote:
>> >
>> > > I agree that having a row_index is a good approach.  I'm not sure a
>> mask
>> > > would be the ideal solution for Iceberg (though it is a reasonable
>> > feature
>> > > in its own right) because I think position-based deletes, in Iceberg,
>> are
>> > > still done using an anti-join and not a filter.
>> > >
>> > > That being said, we probably also want to implement a streaming
>> > merge-based
>> > > anti-join because I believe delete files are ordered by row_index and
>> so
>> > a
>> > > streaming approach is likely to be much more performant.
>> > >
>> > > On Mon, May 29, 2023 at 4:01 PM Will Jones 
>> > > wrote:
>> > >
>> > > > Hi Rusty,
>> > > >
>> > > > At first glance, I think adding a row_index column would make
>> sense. To
>> > > be
>> > > > clear, this would be an index within a file / fragment, not across
>> > > multiple
>> > > > files, which don't necessarily have a known ordering in Acero
>> (IIUC).
>> > > >
>> > > > However, another approach would be to take a mask argument in the
>> > Parquet
>> > > > reader. We may wish to do this anyways for support for using
>> predicate
>> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
>> > > implemented
>> > > > predicate pushdown on page index (right now just supports row
>> groups),
>> > > > Arrow Rust has and provides an API to pass in a mask to support it.
>> The
>> > > > reason for this implementation is described in the blog post
>> "Querying
>> > > > Parquet with Millisecond Latency" [1], under "Page Pruning". The
>> > > > RowSelection struct API is worth a look [2].
>> > > >
>> > > > I'm not yet sure which would be preferable, but I think adopting a
>> > > similar
>> > > > pattern to what the Rust community has done may be wise. It's
>> possible
>> > > that
>> > > > row_index is easy to implement while the mask will take time, in
>> which
>> > > case
>> > > > row_index makes sense as an interim solution.
>> > > >
>> > > > Best,
>> > > >
>> > > > Will Jones
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> &

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
> then I think the incremental cost of adding the
> positional deletes to the mask is probably lower than the anti-join.
Do you mean developer cost?  Then yes, I agree.  Although there may be some
subtlety in the pushdown to connect a dataset filter to a parquet reader
filter.

The main downside with using the mask (or any solution based on a filter
node / filtering) is that it requires that the delete indices go into the
plan itself.  So you need to first read the delete files and then create
the plan.  And, if there are many deleted rows, this can be costly.

On Thu, Jun 1, 2023 at 7:13 PM Will Jones  wrote:

> That's a good point, Gang. To perform deletes, we definitely need the row
> index, so we'll want that regardless of whether it's used in scans.
>
> > I'm not sure a mask would be the ideal solution for Iceberg (though it is
> a reasonable feature in its own right) because I think position-based
> deletes, in Iceberg, are still done using an anti-join and not a filter.
>
> For just positional deletes in isolation, I agree the mask wouldn't be more
> optimal than the anti-join. But if they end up using the mask for filtering
> with the page index, then I think the incremental cost of adding the
> positional deletes to the mask is probably lower than the anti-join.
>
> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu  wrote:
>
> > IMO, the adding a row_index column from the reader is orthogonal to
> > the mask implementation. Table formats (e.g. Apache Iceberg and
> > Delta) require the knowledge of row index to finalize row deletion. It
> > would be trivial to natively support row index from the file reader.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace 
> wrote:
> >
> > > I agree that having a row_index is a good approach.  I'm not sure a
> mask
> > > would be the ideal solution for Iceberg (though it is a reasonable
> > feature
> > > in its own right) because I think position-based deletes, in Iceberg,
> are
> > > still done using an anti-join and not a filter.
> > >
> > > That being said, we probably also want to implement a streaming
> > merge-based
> > > anti-join because I believe delete files are ordered by row_index and
> so
> > a
> > > streaming approach is likely to be much more performant.
> > >
> > > On Mon, May 29, 2023 at 4:01 PM Will Jones 
> > > wrote:
> > >
> > > > Hi Rusty,
> > > >
> > > > At first glance, I think adding a row_index column would make sense.
> To
> > > be
> > > > clear, this would be an index within a file / fragment, not across
> > > multiple
> > > > files, which don't necessarily have a known ordering in Acero (IIUC).
> > > >
> > > > However, another approach would be to take a mask argument in the
> > Parquet
> > > > reader. We may wish to do this anyways for support for using
> predicate
> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
> > > implemented
> > > > predicate pushdown on page index (right now just supports row
> groups),
> > > > Arrow Rust has and provides an API to pass in a mask to support it.
> The
> > > > reason for this implementation is described in the blog post
> "Querying
> > > > Parquet with Millisecond Latency" [1], under "Page Pruning". The
> > > > RowSelection struct API is worth a look [2].
> > > >
> > > > I'm not yet sure which would be preferable, but I think adopting a
> > > similar
> > > > pattern to what the Rust community has done may be wise. It's
> possible
> > > that
> > > > row_index is easy to implement while the mask will take time, in
> which
> > > case
> > > > row_index makes sense as an interim solution.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > > [2]
> > > >
> > > >
> > >
> >
> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html
> > > >
> > > > On Mon, May 29, 2023 at 2:12 PM Rusty Conover
>  > >
> > > > wrote:
> > > >
> > > > > Hi Arrow Team,
> > > > >
> > > > > I wanted to suggest an improvement regarding Acero's Scan node.
> > > > > Currently, it provides use

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Weston Pace
I agree that having a row_index is a good approach.  I'm not sure a mask
would be the ideal solution for Iceberg (though it is a reasonable feature
in its own right) because I think position-based deletes, in Iceberg, are
still done using an anti-join and not a filter.

That being said, we probably also want to implement a streaming merge-based
anti-join because I believe delete files are ordered by row_index and so a
streaming approach is likely to be much more performant.

On Mon, May 29, 2023 at 4:01 PM Will Jones  wrote:

> Hi Rusty,
>
> At first glance, I think adding a row_index column would make sense. To be
> clear, this would be an index within a file / fragment, not across multiple
> files, which don't necessarily have a known ordering in Acero (IIUC).
>
> However, another approach would be to take a mask argument in the Parquet
> reader. We may wish to do this anyways for support for using predicate
> pushdown with Parquet's page index. While Arrow C++ hasn't yet implemented
> predicate pushdown on page index (right now just supports row groups),
> Arrow Rust has and provides an API to pass in a mask to support it. The
> reason for this implementation is described in the blog post "Querying
> Parquet with Millisecond Latency" [1], under "Page Pruning". The
> RowSelection struct API is worth a look [2].
>
> I'm not yet sure which would be preferable, but I think adopting a similar
> pattern to what the Rust community has done may be wise. It's possible that
> row_index is easy to implement while the mask will take time, in which case
> row_index makes sense as an interim solution.
>
> Best,
>
> Will Jones
>
> [1]
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> [2]
>
> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html
>
> On Mon, May 29, 2023 at 2:12 PM Rusty Conover 
> wrote:
>
> > Hi Arrow Team,
> >
> > I wanted to suggest an improvement regarding Acero's Scan node.
> > Currently, it provides useful information such as __fragment_index,
> > __batch_index, __filename, and __last_in_fragment. However, it would
> > be beneficial to have an additional column that returns an overall
> > "row index" from the source.
> >
> > The row index would start from zero and increment for each row
> > retrieved from the source, particularly in the case of Parquet files.
> > Is it currently possible to obtain this row index or would expanding
> > the Scan node's behavior be required?
> >
> > Having this row index column would be valuable in implementing support
> > for Iceberg's positional-based delete files, as outlined in the
> > following link:
> >
> > https://iceberg.apache.org/spec/#delete-formats
> >
> > While Iceberg's value-based deletes can already be performed using the
> > support for anti joins, using a projection node does not guarantee the
> > row ordering within an Acero graph. Hence, the inclusion of a
> > dedicated row index column would provide a more reliable solution in
> > this context.
> >
> > Thank you for considering this suggestion.
> >
> > Rusty
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Weston Pace
gt;>> This Message Is From an External Sender
> > > >>>>>>
> > > >>>>>> ZjQcmQRYFpfptBannerEnd
> > > >>>>>> +pedroerp
> > > >>>>>>
> > > >>>>>> On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
> > > >>>>>>  wrote:
> > > >>>>>> Hi All,
> > > >>>>>>
> > > >>>>>>> if we added this, do we think many Arrow and query
> > > >>>>>>> engine implementations (for example, DataFusion) will be eager
> to
> > > >>> add
> > > >>>>>> full
> > > >>>>>>> support for the type, including compute kernels? Or are they
> > likely
> > > >>> to
> > > >>>>>> just
> > > >>>>>>> convert this type to ListArray at import boundaries?
> > > >>>>>> I can't speak for query engines in general, but at least for
> > > arrow-rs
> > > >>>>>> and by extension DataFusion, and based on my current
> understanding
> > > of
> > > >>>>>> the use-cases I would be rather hesitant to add support to the
> > > >>> kernels
> > > >>>>>> for this array type, definitely instead favouring conversion at
> > the
> > > >>>>>> edges. We already have issues with the amount of code generation
> > > >>>>>> resulting in binary bloat and long compile times, and I worry
> this
> > > >>> would
> > > >>>>>> worsen this situation whilst not really providing compelling
> > > >>> advantages
> > > >>>>>> for the vast majority of workloads that don't interact with
> Velox.
> > > >>>>>> Whilst I can definitely see that the ListView representation is
> > > >>> probably
> > > >>>>>> a better way to represent variable length lists than what arrow
> > > >>> settled
> > > >>>>>> upon, I'm not yet convinced it is sufficiently better to
> > incentivise
> > > >>>>>> broad ecosystem adoption.
> > > >>>>>>
> > > >>>>>> Kind Regards,
> > > >>>>>>
> > > >>>>>> Raphael Taylor-Davies
> > > >>>>>>
> > > >>>>>>> On 11/05/2023 21:20, Will Jones wrote:
> > > >>>>>>> Hi Felipe,
> > > >>>>>>>
> > > >>>>>>> Thanks for the additional details.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> Velox kernels benefit from being able to append data to the
> > array
> > > >>> from
> > > >>>>>>>> different threads without care for strict ordering. Only the
> > > >>> offsets
> > > >>>>>> array
> > > >>>>>>>> has to be written according to logical order but that is
> > > >>> potentially a
> > > >>>>>> much
> > > >>>>>>>> smaller buffer than the values buffer.
> > > >>>>>>>>
> > > >>>>>>> It still seems to me like applications are still pretty niche,
> > as I
> > > >>>>>> suspect
> > > >>>>>>> in most cases the benefits are outweighed by the costs. The
> > benefit
> > > >>>> here
> > > >>>>>>> seems pretty limited: if you are trying to split work between
> > > >>> threads,
> > > >>>>>>> usually you will have other levels such as array chunks to
> > > >>> parallelize.
> > > >>>>>> And
> > > >>>>>>> if you have an incoming stream of row data, you'll want to
> append
> > > in
> > > >>>>>>> predictable order to match the order of the other arrays. Am I
> > > >>> missing
> > > >>>>>>> something?
> > > >>>>>>>
> > > >>>>>>> And, IIUC, the cost of using ListView with out-of-order values
> > over
> > > >>>

[DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread Weston Pace
Regrettabl, 12.0.0 had a significant performance regression (I'll take the
blame for not thinking through all the use cases), most easily exposed when
writing datasets from pandas / numpy data, which is being addressed in
[1].  I believe this to be a fairly common use case and it may warrant a
12.0.1 patch.  Are there other issues that would need a patch?  Do we feel
this issue is significant enough to justify the work?

[1] https://github.com/apache/arrow/pull/35565


Re: [ANNOUNCE] New Arrow committer: Gang Wu

2023-05-15 Thread Weston Pace
Congratulations!

On Mon, May 15, 2023 at 6:34 AM Rok Mihevc  wrote:

> Congrats Gang!
>
> Rok
>
> On Mon, May 15, 2023 at 3:33 PM Sutou Kouhei  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Gang
> > Wu has accepted an invitation to become a committer on
> > Apache Arrow. Welcome, and thank you for your contributions!
> >
> > Thanks,
> > --
> > kou
> >
>


Re: Re: Reusing RecordBatch objects and their memory space

2023-05-12 Thread Weston Pace
The mailing list cannot handle attachments and images.  Can you upload the
flame graphs to a gist?

On Fri, May 12, 2023 at 6:55 PM SHI BEI  wrote:

> What I meant is that shared_ptr has a large overhead, which is clearly
> reflected in the CPU flame graph. In my testing scenario, there are 10
> Parquet files, each with a size of 1.3GB and no compression applied to the
> data within the files. Each row group has 65536 rows in those files. In
> each test, all files are read 10 times to facilitate capturing the CPU
> flame graph. To verify the issue described above, I controlled the number
> of calls to the RecordBatchReader::ReadNext interface by adjusting the
> number of rows of data read each time. The CPU flame graph capture results
>  are as follows:
> 1)  batch_size = 2048
>
>
> 2) batch_size = 65536
>
>
>
> --
>
> SHI BEI
> shibei...@foxmail.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage=true=SHI+BEI=https%3A%2F%2Fthirdwx.qlogo.cn%2Fmmopen%2Fvi_32%2FQ0j4TwGTfTIXtiatj6eqJnThGs5GrPTyWewWVqE1snw8hJDmnePicI611Zvub05AnTGjcJ5xCNlkD6uezOVoA2Gw%2F132%3Frand%3D1646387113%3Frand%3D1646387124%3Frand%3D1646387148=shibei.lh%40foxmail.com=KAmESwJvMrwAxwnQWafGjlsCzQ9tgHLSs7s2ohGx7ou54B0-ZyrWJkTg5npy2p1LmT5WQjSlhwncoGhA6w_xb-hQTDq6tGNfwF1sIGtP_HQ>
>
>
>
>
> 原始邮件
>
> 发件人:"Weston Pace"< weston.p...@gmail.com >;
>
> 发件时间:2023/5/13 2:30
>
> 收件人:"dev"< dev@arrow.apache.org >;
>
> 主题:Re: Reusing RecordBatch objects and their memory space
>
> I think there are perhaps various things being discussed here:
>
> * Reusing large blocks of memory
>
> I don't think the memory pools actually provide this kind of reuse (e.g.
> they aren't like "connection pools" or "thread pools"). I'm pretty sure,
> when you allocate a new buffer on a pool, it always triggers an allocation
> on the underlying allocator. Now, that being said, I think this is
> generally fine. Allocators themselves (e.g. malloc, jemalloc) will keep
> and reuse blocks of memory before returning it to the OS. Though this can
> be difficult due to things like fragmentation.
>
> One potential exception to the "let allocators handle the reuse" rule would
> be cases where you are frequently allocating buffers that are the exact
> same size (or you are ok with the buffers being larger than you need so you
> can reuse them). For example, packet pools are very common in network
> programming. In this case, you can perhaps be more efficient than the
> allocator, since you know the buffers have the same size.
>
> It's not entirely clear to me that this would be useful in reading parquet.
>
> * shared_ptr overhead
>
> Everytime a shared_ptr is created there is an atomic increment of the ref
> counter. Everytime it is destroyed there is an atomic decrement. These
> atomic increments/decrements introduce memory fences which can foil
> compiler optimizations and just be costly on their own.
>
> > I'm using the RecordBatchReader::ReadNext interface to read Parquet
> data in my project, and I've noticed that there are a lot of temporary
> object destructors being generated during usage.
>
> Can you clarify what you mean here? When I read this sentence I thought of
> something completely different than the previous two things mentioned :)
> At one time I had a suspicion that thrift was generating a lot of small
> allocations reading the parquet metadata and that this was leading to
> fragmentation of the system allocator (thrift's allocations do not go
> through the memory pool / jemalloc and we have a bit of a habit in datasets
> of keeping parquet metadata around to speed up future reads). I never did
> investigate this further though.
>
> On Fri, May 12, 2023 at 10:48 AM David Li wrote:
>
> > I can't find it anymore, but there is a quite old issue that made the
> same
> > observation: RecordBatch's heavy use of shared_ptr in C++ can lead to a
> lot
> > of overhead just calling destructors. That may be something to explore
> more
> > (e.g. I think someone had tried to "unbox" some of the fields in
> > RecordBatch).
> >
> > On Fri, May 12, 2023, at 13:04, Will Jones wrote:
> > > Hello,
> > >
> > > I'm not sure if there are easy ways to avoid calling the destructors.
> > > However, I would point out memory space reuse is handled through memory
> > > pools; if you have one enabled it shouldn't be handing memory back to
> the
> > > OS between each iteration.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Fri, May 12, 2023 at 9:59 AM SHI BEI wrote:
> > >
> > >> Hi community,
> > >>
> > >>
> > >> I'm using the RecordBatchReader::ReadNext interface to read Parquet
> > >> data in my project, and I've noticed that there are a lot of temporary
> > >> object destructors being generated during usage. Has the community
> > >> considered providing an interface to reuse RecordBatch objects
> > >> and their memory space for storing data?
> > >>
> > >>
> > >>
> > >>
> > >> SHI BEI
> > >> shibei...@foxmail.com
> >
>
>


Re: Reusing RecordBatch objects and their memory space

2023-05-12 Thread Weston Pace
I think there are perhaps various things being discussed here:

 * Reusing large blocks of memory

I don't think the memory pools actually provide this kind of reuse (e.g.
they aren't like "connection pools" or "thread pools").  I'm pretty sure,
when you allocate a new buffer on a pool, it always triggers an allocation
on the underlying allocator.  Now, that being said, I think this is
generally fine.  Allocators themselves (e.g. malloc, jemalloc) will keep
and reuse blocks of memory before returning it to the OS.  Though this can
be difficult due to things like fragmentation.

One potential exception to the "let allocators handle the reuse" rule would
be cases where you are frequently allocating buffers that are the exact
same size (or you are ok with the buffers being larger than you need so you
can reuse them).  For example, packet pools are very common in network
programming.  In this case, you can perhaps be more efficient than the
allocator, since you know the buffers have the same size.

It's not entirely clear to me that this would be useful in reading parquet.

 * shared_ptr overhead

Everytime a shared_ptr is created there is an atomic increment of the ref
counter.  Everytime it is destroyed there is an atomic decrement.  These
atomic increments/decrements introduce memory fences which can foil
compiler optimizations and just be costly on their own.

> I'm using theRecordBatchReader::ReadNext interface to read Parquet
data in my project, and I've noticed that there are a lot of temporary
object destructors being generated during usage.

Can you clarify what you mean here?  When I read this sentence I thought of
something completely different than the previous two things mentioned :)
At one time I had a suspicion that thrift was generating a lot of small
allocations reading the parquet metadata and that this was leading to
fragmentation of the system allocator (thrift's allocations do not go
through the memory pool / jemalloc and we have a bit of a habit in datasets
of keeping parquet metadata around to speed up future reads).  I never did
investigate this further though.

On Fri, May 12, 2023 at 10:48 AM David Li  wrote:

> I can't find it anymore, but there is a quite old issue that made the same
> observation: RecordBatch's heavy use of shared_ptr in C++ can lead to a lot
> of overhead just calling destructors. That may be something to explore more
> (e.g. I think someone had tried to "unbox" some of the fields in
> RecordBatch).
>
> On Fri, May 12, 2023, at 13:04, Will Jones wrote:
> > Hello,
> >
> > I'm not sure if there are easy ways to avoid calling the destructors.
> > However, I would point out memory space reuse is handled through memory
> > pools; if you have one enabled it shouldn't be handing memory back to the
> > OS between each iteration.
> >
> > Best,
> >
> > Will Jones
> >
> > On Fri, May 12, 2023 at 9:59 AM SHI BEI  wrote:
> >
> >> Hi community,
> >>
> >>
> >> I'm using theRecordBatchReader::ReadNext interface to read Parquet
> >> data in my project, and I've noticed that there are a lot of temporary
> >> object destructors being generated during usage. Has the community
> >> considered providing an interface to reuseRecordBatchobjects
> >> and their memory space for storing data?
> >>
> >>
> >>
> >>
> >> SHIBEI
> >> shibei...@foxmail.com
>


Re: Freeing memory when working with static crt in windows.

2023-05-12 Thread Weston Pace
You're right that the default is delete/free.  However, the important bit
is that it needs to be the correct delete/free.  The error you described
originates from the fact that the final application has two copies of the
CRT and thus two copies of delete/free.  Since shared_ptr/unique_ptr picks
the call to delete/free at compile time this means that it should be
picking the correct delete/free.

On Fri, May 12, 2023 at 9:27 AM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
avertl...@bloomberg.net> wrote:

> Maybe I am wrong about shared_ptr though.  Yes, according to Scott Meyers
> it is safe, at least in tr1.  Have to see what is there now...
>
> From: dev@arrow.apache.org At: 05/12/23 12:23:48 UTC-4:00To:
> dev@arrow.apache.org
> Subject: Re: Freeing memory when working with static crt in windows.
>
> Unless you provide custom deleter, which I don't believe arrow does, it is
> just
> default delete()->free(), which exactly matches to the problem I am
> having.  It
> would have to be a shared_ptr with additional template parameter.
>
> So no, unless special care was taken by arrow developers, shared_ptr would
> not
> solve the issue :-(
>
> Also I think potentially something like returning std::vector by value
> would
> cause the same issue.
>
>
> From: dev@arrow.apache.org At: 05/12/23 11:53:34 UTC-4:00To:
> dev@arrow.apache.org
> Subject: Re: Freeing memory when working with static crt in windows.
>
> I'm not very familiar with Windows.  However, I read through [1] and that
> matches your description.
>
> I suppose I thought that a shared_ptr / unique_ptr would not have this
> problem.  I believe these smart pointers store / template a deleter as part
> of their implementation.  This seems to be reinforced by [2].
>
> [1]
>
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing
> -crt-objects-across-dll-boundaries?view=msvc-170
> 
> [2]
>
> https://stackoverflow.com/questions/1958643/is-it-ok-to-use-boostshared-ptr-in-d
> ll-interface
> 
>
> On Fri, May 12, 2023 at 8:21 AM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
> avertl...@bloomberg.net> wrote:
>
> > Hi all.
> >
> > In some cases arrow API allocates an object and returns a shared pointer
> > to it.  Which means the object will be deallocated on the client side.
> >
> > This represents a problem when working with a static CRT in windows
> (which
> > I am experiencing right now).
> >
> > IIUC, the way to deal with this would be to export a "free" wrapper from
> > arrow DLL and use custom deleter on the shared pointer to call this
> > wrapper, so that both allocation and deallocation happens inside the
> arrow
> > DLL itself.
> >
> > Does arrow provide this kind of facility?
> >
> > Thanks,
> > Arkadiy
> >
> >
>
>
>


Re: Freeing memory when working with static crt in windows.

2023-05-12 Thread Weston Pace
I'm not very familiar with Windows.  However, I read through [1] and that
matches your description.

I suppose I thought that a shared_ptr / unique_ptr would not have this
problem.  I believe these smart pointers store / template a deleter as part
of their implementation.  This seems to be reinforced by [2].

[1]
https://learn.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries?view=msvc-170
[2]
https://stackoverflow.com/questions/1958643/is-it-ok-to-use-boostshared-ptr-in-dll-interface

On Fri, May 12, 2023 at 8:21 AM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
avertl...@bloomberg.net> wrote:

> Hi all.
>
> In some cases arrow API allocates an object and returns a shared pointer
> to it.  Which means the object will be deallocated on the client side.
>
> This represents a problem when working with a static CRT in windows (which
> I am experiencing right now).
>
> IIUC, the way to deal with this would be to export a "free" wrapper from
> arrow DLL and use custom deleter on the shared pointer to call this
> wrapper, so that both allocation and deallocation happens inside the arrow
> DLL itself.
>
> Does arrow provide this kind of facility?
>
> Thanks,
> Arkadiy
>
>


Re: [ANNOUNCE] New Arrow committer: Marco Neumann

2023-05-11 Thread Weston Pace
Congratulations!

On Thu, May 11, 2023 at 4:28 AM vin jake  wrote:

> Congratulations Marco!
>
> On Thu, May 11, 2023 at 7:18 AM Andrew Lamb  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Marco Neumann
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > Andrew
> >
>


[Format] Is it legal to have a struct array with a shorter length than its children?

2023-05-05 Thread Weston Pace
We allow arrays to have a shorter length than their buffers.  Is it also
legal for a struct array to have a shorter length than its child arrays?
For example, in C++, I can create this today by slicing a struct array:

```
  std::shared_ptr my_array =
std::dynamic_pointer_cast(array);
  ASSERT_EQ(my_array->length(), 4);
  ASSERT_EQ(my_array->field(0)->length(), 4);
  auto sliced = std::dynamic_pointer_cast(my_array->Slice(2));
  ASSERT_EQ(sliced->length(), 2);
  // Note: StructArray::field pushes its offset and length into the created
array
  ASSERT_EQ(sliced->field(0)->length(), 2);
  // However, the actually ArrayData objects show the truth
  ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
  // Our validator thinks this is ok
  ASSERT_OK(sliced->ValidateFull());
```

The only reference I can find in the spec is this:

> Struct: a nested layout consisting of a collection of named child fields
each
> having the same length but possibly different types.

This seems to suggest that the C++ implementation is doing something
incorrect.

I'm asking because I've started to encounter some issues relating to
this[1][2] and I'm not sure if the struct array itself is the issue or the
fact that we aren't expecting these kinds of struct arrays is the problem.

[1] https://github.com/apache/arrow/issues/35450
[2] https://github.com/apache/arrow/issues/35452


Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Weston Pace
Congratulations!

On Wed, May 3, 2023 at 10:47 AM Raúl Cumplido 
wrote:

> Congratulations Matt!
>
> El mié, 3 may 2023, 19:44, vin jake  escribió:
>
> > Congratulations, Matt!
> >
> > Felipe Oliveira Carvalho  于 2023年5月4日周四 01:42写道:
> >
> > > Congratulations, Matt!
> > >
> > > On Wed, 3 May 2023 at 14:37 Andrew Lamb  wrote:
> > >
> > > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > > Matt Topol (zeroshade) to become a PMC member and we are pleased to
> > > > announce
> > > > that Matt has accepted.
> > > >
> > > > Congratulations and welcome!
> > > >
> > >
> >
>


Re: [Python] Casting struct to map

2023-05-03 Thread Weston Pace
No, struct array is not naturally castable to map.  It's not something that
can be done zero-copy and I don't think anyone has encountered this need
before.  Let me make sure I understand.

The goal is to go from a type of STRUCT,
where every key in the struct has the same type, to a MAP, where
each record will have Z map entries?  This seems like it could be expressed
as a compute function.  I don't think it would be very natural as a cast
since it has a pretty strict requirement that all fields in the struct have
the same type and so it will be pretty limited.  I think you could have a
compute function as well that went the opposite direction.

I do agree with Alenka, if there is any way to create your original input
data as a map then that will have better performance.

On Wed, May 3, 2023 at 4:58 AM Jerald Alex  wrote:

> Hi Alenka,
>
> Great! Thank you so much for your inputs.
>
> I have indeed tried to use schema when creating a table from a pylist and
> it worked but in my use case, I wouldn't know the table schema beforehand
> especially for the other columns -  I need to do transformations before I
> can cast it to the expected schema. Please let me know if you have any
> other thoughts.
>
> Regards,
> Infant Alex
>
> On Wed, May 3, 2023 at 9:43 AM Alenka Frim  .invalid>
> wrote:
>
> > Hi Alex,
> >
> > passing the schema to from_pylist() method on the Table should work for
> > your example (not sure if it solves your initial problem?)
> >
> > import pyarrow as pa
> >
> > table_schema = pa.schema([pa.field("id", pa.int32()),
> > pa.field("names", pa.map_(pa.string(), pa.string()))])
> >
> > table_data = [{"id": 1,"names": {"first_name": "Tyler",
> "last_name":"Brady"
> > }},
> > {"id": 2,"names": {"first_name": "Walsh", "last_name": "Weaver"}}]
> >
> > pa.Table.from_pylist(table_data, schema=table_schema)
> > # pyarrow.Table
> > # id: int32
> > # names: map
> > # child 0, entries: struct not null
> > # child 0, key: string not null
> > # child 1, value: string
> > # 
> > # id: [[1,2]]
> > # names:
> >
> >
> [[keys:["first_name","last_name"]values:["Tyler","Brady"],keys:["first_name","last_name"]values:["Walsh","Weaver"]]]
> >
> >
> > Best, Alenka
> >
> > On Wed, May 3, 2023 at 9:13 AM Jerald Alex  wrote:
> >
> > > Any inputs on this please?
> > >
> > > On Tue, May 2, 2023 at 10:03 AM Jerald Alex 
> wrote:
> > >
> > > > Hi Experts,
> > > >
> > > > Can anyone please highlight if it is possible to cast struct to map
> > type?
> > > >
> > > > I tried the following but it seems to  be producing an error as
> below.
> > > >
> > > > pyarrow.lib.ArrowNotImplementedError: Unsupported cast from
> > > > struct to map using function
> > > cast_map
> > > >
> > > > Note: Snippet is just an example to show the problem.
> > > >
> > > > Code Snippet:
> > > >
> > > > table_schema = pa.schema([pa.field("id", pa.int32()),
> pa.field("names",
> > > > pa.map_(pa.string(), pa.string()))])
> > > >
> > > > table_data = [{"id": 1,"names": {"first_name": "Tyler", "last_name":
> > > > "Brady"}},
> > > > {"id": 2,"names": {"first_name": "Walsh", "last_name": "Weaver"}}]
> > > >
> > > > tbl = pa.Table.from_pylist(table_data)
> > > > print(tbl)
> > > > print(tbl.cast(table_schema))
> > > > print(tbl)
> > > >
> > > > Error :
> > > >
> > > > id: int64
> > > > names: struct
> > > >   child 0, first_name: string
> > > >   child 1, last_name: string
> > > > 
> > > > id: [[1,2]]
> > > > names: [
> > > >   -- is_valid: all not null
> > > >   -- child 0 type: string
> > > > ["Tyler","Walsh"]
> > > >   -- child 1 type: string
> > > > ["Brady","Weaver"]]
> > > > Traceback (most recent call last):
> > > >   File "/Users/
> > > >
> > >
> >
> infant.a...@cognitedata.com/Documents/Github/HubOcean/demo/pyarrow_types.py
> > > ",
> > > > line 220, in 
> > > > print(tbl.cast(table_schema))
> > > >   File "pyarrow/table.pxi", line 3489, in pyarrow.lib.Table.cast
> > > >   File "pyarrow/table.pxi", line 523, in
> pyarrow.lib.ChunkedArray.cast
> > > >   File "/Users/
> > > >
> > >
> >
> infant.a...@cognitedata.com/Library/Caches/pypoetry/virtualenvs/demo-LzMA3Hsd-py3.10/lib/python3.10/site-packages/pyarrow/compute.py
> > > ",
> > > > line 391, in cast
> > > > return call_function("cast", [arr], options)
> > > >   File "pyarrow/_compute.pyx", line 560, in
> > > pyarrow._compute.call_function
> > > >   File "pyarrow/_compute.pyx", line 355, in
> > > pyarrow._compute.Function.call
> > > >   File "pyarrow/error.pxi", line 144, in
> > > > pyarrow.lib.pyarrow_internal_check_status
> > > >   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> > > > pyarrow.lib.ArrowNotImplementedError: Unsupported cast from
> > > > struct to map using function
> > > cast_map
> > > >
> > > > Regards,
> > > > Alex Vincent
> > > >
> > >
> >
>


Re: [DISCUSS][Format][Flight] Ordered data support

2023-04-27 Thread Weston Pace
Thank you both for the extra information.  Acero couldn't actually merge
the streams today, I was thinking more of datafusion and velox which would
often want to keep the streams separate, especially if there was some kind
of filtering or transformation that could be applied before applying a
sorted merge.

However, I also very much agree that both scenarios are valid.  First, if
there are a lot of partitions (e.g. far more than the # of parallelism
units) then you probably don't want parallel paths for all of them.

Second, as you said, simpler clients (e.g. those where all filtering is
down downstream, or those that don't need any filtering at all) will
appreciate flight's ability to merge for them.  It makes the client more
complex but given that clients are already doing this to some extent it
seems worthwhile.

On Thu, Apr 27, 2023 at 7:45 PM David Li  wrote:

> In addition to Kou's response:
>
> The individual endpoints have always represented a subset of a single
> stream of data. So each endpoint in a FlightInfo is a partition of the
> overall result set.
>
> Not all clients want to deal with reading all the Flight streams
> themselves and may want a single stream of data. (For example: ADBC exposes
> both paths. The JDBC driver also has to deal with this.) So some client
> libraries have to deal with the question of whether to read in parallel and
> whether to keep the result in order or not. A more advanced use case, like
> Acero, would probably read the endpoints itself and could use this flag to
> decide how to merge the streams.
>
> On Fri, Apr 28, 2023, at 09:56, Sutou Kouhei wrote:
> > Hi,
> >
> >> This seems of very limited value if, for example, if the user desired
> DESC
> >> order, then the endpoint would return
> >>
> >> Endpoint 1: (C, B, A)
> >> Endpoint 2: (F, E, D)
> >
> > As David said, the server returns
> >
> > Endpoint 2: (F, E, D)
> > Endpoint 1: (C, B, A)
> >
> > in this case.
> >
> > Here is an use case I think:
> >
> > A system has time series data. Each node in the system has
> > data for one day. If a client requests "SELECT * FROM data
> > WHERE server = 'server1' ORDER BY created_at DESC", the
> > system returns the followings:
> >
> > Endpoint 20230428: (DATA_FOR_2023_04_28)
> > Endpoint 20230427: (DATA_FOR_2023_04_27)
> > Endpoint 20230426: (DATA_FOR_2023_04_26)
> > ...
> >
> > If we have the "ordered" flag, the client can assume that
> > received data are sorted. In other words, if the client
> > reads data from Endpoint 20230428 -> Endpoint 20230427 ->
> > Endpoint 20230426, the data the client read is sorted.
> >
> > If we don't have the "ordered" flag and we use "the relative
> > ordering of data from different endpoints is implementation
> > defined", we can't implement a general purpose Flight based
> > client library (Flight SQL based client library, Flight SQL
> > based ADBC driver and so on). The client library will have
> > the following code:
> >
> >   # TODO: How to detect server_type?
> >   if server_type == "DB1"
> > # DB1 returns ordered result.
> > endpoints.each do |endpoint|
> >   yield(endpoints.read)
> > end
> >   else
> > # Other DBs doesn't return ordered result.
> > # So, we read data in parallel for performance.
> > threads = endpoints.collect do |endpoint|
> >   Thread.new do
> > yield(endpoints.read)
> >   end
> > end
> > threads.each do |thread|
> >   thread.join
> > end
> >   end
> >
> > The client library needs to add 'or server_type == "DB2"' to
> > 'if server_type == "DB1"' when DB2 also adds support for
> > ordered result. If DB2 2.0 or later is only ordered result
> > ready, the client library needs more condition 'or
> > (server_type == "DB2" and server_version > 2.0)'.
> >
> > So I think that the "ordered" flag is useful.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS][Format][Flight] Ordered data support" on Thu, 27 Apr
> > 2023 10:55:32 -0400,
> >   Andrew Lamb  wrote:
> >
> >> I wonder if we have considered simply removing the statement "There is
> no
> >> ordering defined on endpoints. Hence, if the returned data has an
> ordering,
> >> it should be returned in a single endpoint." and  replacing it with
> >> something that says "the relative ordering of data from different
> endpoints
> >> is implementation defined"
> >>
> >> I am struggling to come up with a concrete usecase for the "ordered"
> flag.
> >>
> >> The ticket references "distributed sort" but most distributed sort
> >> algorithms I know of would produce multiple sorted streams that need to
> be
> >> merged together. For example
> >>
> >> Endpoint 1: (B, C, D)
> >> Endpoint 2: (A, E, F)
> >>
> >> It is not clear how the "ordered" flag would help here
> >>
> >> If the intent is somehow to signal the client it doesn't have to merge
> >> (e.g. with data like)
> >>
> >> Endpoint 1: (A, B, C)
> >> Endpoint 2:  (D, E, F)
> >>
> >> This seems of very limited value if, for example, if the user 

Re: [DISCUSS][Format][Flight] Ordered data support

2023-04-27 Thread Weston Pace
So this would be a case where multiple "endpoints" are acting as a single
"stream of batches"?  Or am I misunderstanding?

What're some scenarios where that would be done?  When would it be
preferred for the client to merge the endpoints instead of the client's
user?

On Thu, Apr 27, 2023, 3:22 PM David Li  wrote:

> The server would have to report these as multiple endpoints in all your
> examples. (There's nothing saying a particular location can only appear
> once, or that "Endpoint 2" has to come after "Endpoint 1" for the DESC
> example.)
>
> The flag tells the client if it can fetch data in parallel without regard
> to order or if it should make sure to preserve the sorting of the data.
> (The ADBC Flight SQL clients in Go, C++, etc. already had to deal with
> this.) For instance Acero may care because certain plan nodes require some
> sort of ordering to be present; knowing a Flight datasource has this
> ordering could then save having to insert a sort operation into the plan.
>
> "Implementation defined" I think would basically devolve to clients always
> making the conservative/inefficient choice, like the Go ADBC driver always
> preserving order out of concern for compatibility and Acero always sorting
> data to use order-dependent nodes.
>
> On Thu, Apr 27, 2023, at 23:55, Andrew Lamb wrote:
> > I wonder if we have considered simply removing the statement "There is no
> > ordering defined on endpoints. Hence, if the returned data has an
> ordering,
> > it should be returned in a single endpoint." and  replacing it with
> > something that says "the relative ordering of data from different
> endpoints
> > is implementation defined"
> >
> > I am struggling to come up with a concrete usecase for the "ordered"
> flag.
> >
> > The ticket references "distributed sort" but most distributed sort
> > algorithms I know of would produce multiple sorted streams that need to
> be
> > merged together. For example
> >
> > Endpoint 1: (B, C, D)
> > Endpoint 2: (A, E, F)
> >
> > It is not clear how the "ordered" flag would help here
> >
> > If the intent is somehow to signal the client it doesn't have to merge
> > (e.g. with data like)
> >
> > Endpoint 1: (A, B, C)
> > Endpoint 2:  (D, E, F)
> >
> > This seems of very limited value if, for example, if the user desired
> DESC
> > order, then the endpoint would return
> >
> > Endpoint 1: (C, B, A)
> > Endpoint 2: (F, E, D)
> >
> > Which doesn't seem to conform to the updated definition
> >
> > Andrew
> >
> >
> > On Tue, Apr 25, 2023 at 8:56 PM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> I would like to propose adding support for ordered data to
> >> Apache Arrow Flight. If anyone has comments for this
> >> proposal, please share them at here or the issue for this
> >> proposal: https://github.com/apache/arrow/issues/34852
> >>
> >> This is one of proposals in "[DISCUSS] Flight RPC/Flight
> >> SQL/ADBC enhancements":
> >>
> >>   https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp
> >>
> >> See also the "Flight RPC: Ordered Data" section in the
> >> design document for the proposals:
> >>
> >>
> >>
> https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit#
> >>
> >> Background:
> >>
> >> Currently, the endpoints within a FlightInfo explicitly have
> >> no ordering.
> >>
> >> This is unnecessarily limiting. Systems can and do implement
> >> distributed sorts, but they can't reflect this in the
> >> current specification.
> >>
> >> Proposal:
> >>
> >> Add a flag to FlightInfo. If the flag is set, the client may
> >> assume that the data is sorted in the same order as the
> >> endpoints. Otherwise, the client cannot make any assumptions
> >> (as before).
> >>
> >> This is a compatible change because the client can just
> >> ignore the flag.
> >>
> >> Implementation:
> >>
> >> https://github.com/apache/arrow/pull/35178 is an
> >> implementation of this proposal. The pull requests has the
> >> followings:
> >>
> >> 1. Format changes:
> >>
> >>
> https://github.com/apache/arrow/pull/35178/files#diff-53b6c132dcc789483c879f667a1c675792b77aae9a056b257d6b20287bb09dba
> >>* format/Flight.proto
> >>
> >> 2. Documentation changes:
> >>
> >>
> https://github.com/apache/arrow/pull/35178/files#diff-839518fb41e923de682e8587f0b6fdb00eb8f3361d360c2f7249284a136a7d89
> >>* docs/source/format/Flight.rst
> >>
> >> 3. The C++ implementation and an integration test:
> >>* cpp/src/arrow/flight/
> >>
> >> 4. The Java implementation and an integration test (thanks to David
> Li!):
> >>* java/flight/
> >>
> >> 5. The Go implementation and an integration test:
> >>* go/arrow/flight/
> >>* go/arrow/internal/flight_integration/
> >>
> >> Next:
> >>
> >> I'll start a vote for this proposal after we reach a consensus
> >> on this proposal.
> >>
> >> It's the standard process for format change.
> >> See also:
> >>
> >> * [VOTE] Formalize how to change format
> >>   https://lists.apache.org/thread/jlc4wtt09rfszlzqdl55vrc4dxzscr4c
> >> * 

  1   2   3   4   5   >