Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Gang Wu
Could you please simply describe the layout of DuckDB and Velox
so we can know what kind of conversion is required from the raw
pointer variant? If any engine simply represents string array in the
form of something like std::vector, should we
provide a similar variant in C++ to minimize the conversion cost?

Best,
Gang

On Wed, Sep 27, 2023 at 7:09 AM Raphael Taylor-Davies
 wrote:

> I'm confused why this would need to copy string data, assuming the
> pointers are into defined memory regions, something necessary for the C
> data interface's ownership semantics regardless, why can't these memory
> regions just be used as buffers as is? This would therefore require just
> rewriting the views buffer to subtract the base pointer of the given
> buffer, which should be extremely fast?
>
> On 26 September 2023 23:34:54 BST, Matt Topol 
> wrote:
> >I believe the motivation is to avoid the cost of the data copy that would
> >have to happen to convert from a pointer based to offset based scenario.
> >Allowing the pointer-based implementation will ensure that we can maintain
> >zero-copy communication with both DuckDB and Velox in a common workflow
> >scenario.
> >
> >Converting to the offset-based version would have a cost of having to copy
> >strings from their locations to contiguous buffers which could end up
> being
> >very significant depending on the shape and size of the data. The pointer
> >-based solution wouldn't be allowed in IPC though, only across the C Data
> >interface (correct me if I'm wrong).
> >
> >--Matt
> >
> >On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies
> > wrote:
> >
> >> Hi,
> >>
> >> Is the motivation here to avoid DuckDB and Velox having to duplicate the
> >> conversion logic from pointer-based to offset-based, or to allow
> >> arrow-cpp to operate directly on pointer-based arrays?
> >>
> >> If it is the former, I personally wouldn't have thought the conversion
> >> logic sufficiently complex to really warrant this?
> >>
> >> If it is the latter, I wonder if you have some benchmark numbers for
> >> converting between and operating on the differing representations? In
> >> the absence of a strong performance case, it's hard in my opinion to
> >> justify adding what will be an arrow-cpp specific extension that isn't
> >> part of the standard, with all the potential for confusion and
> >> interoperability challenges that entails.
> >>
> >> Kind Regards,
> >>
> >> Raphael
> >>
> >> On 26/09/2023 21:34, Benjamin Kietzman wrote:
> >> > Hello all,
> >> >
> >> > In the PR to add support for Utf8View to the c++ implementation,
> >> > I've taken the approach of allowing raw pointer views [1] alongside
> the
> >> > index/offset views described in the spec [2]. This was done to ease
> >> > communication with other engines such as DuckDB and Velox whose native
> >> > string representation is the raw pointer view. In order to be usable
> >> > as a utility for writing IPC files and other operations on arrow
> >> > formatted data, it is useful for the library to be able to directly
> >> > import raw pointer arrays even when immediately converting these to
> >> > the index/offset representation.
> >> >
> >> > However there has been objection in review [3] since the raw pointer
> >> > representation is not part of the official format. Since data
> visitation
> >> > utilities are generic, IMHO this hybrid approach does not add
> >> > significantly to the complexity of the C++ library, and I feel the
> >> > aforementioned interoperability is a high priority when adding this
> >> > feature to the C++ library. It's worth noting that this
> interoperability
> >> > has been a stated goal of the Utf8Type since its original proposal [4]
> >> > and throughout the discussion of its adoption [5].
> >> >
> >> > Sincerely,
> >> > Ben Kietzman
> >> >
> >> > [1]:
> >> >
> >>
> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
> >> > [2]:
> >> >
> >>
> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
> >> > [3]:
> https://github.com/apache/arrow/pull/37792#discussion_r1336010665
> >> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
> >> >
> >>
>


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Jacob Wujciak-Jens
+1 (non-binding)

full verification with conda arrow 13.0.0 R 4.3 on pop_os 23.04, cmake
3.27, gcc 11

On Wed, Sep 27, 2023 at 1:26 AM Bryce Mecum  wrote:

> +1 (non-binding)
>
> Verified with `./verify-release-candidate.sh 0.3.0 0` on:
> - Windows 10, x86_64, libarrow-main, MSVC 17 2022, R 4.3.1, Rtools 43
> - macOS 13.6, aarch64, libarrow 13.0.0, R 4.3.1
> - Ubuntu 23.04, aarch64, libarrow 13.0.0, R 4.2.2
>


Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-26 Thread Yue Ni
> I think the key idea is to let users call Gandiva functions to register
functions and pass necessary info explicitly to Gandiva, rather than
letting Gandiva discover them by itself.

That makes sense. Thanks Jin and Antonie for your valuable feedback. I will
revise the proposal accordingly later.

Regards,
Yue

On Wed, Sep 27, 2023 at 4:29 AM Jin Shang  wrote:

> I agree with Antoine that we don't need to define a JSON format or a
> directory structure for Gandiva.
> To support external functions, we essentially need two things:
> 1. Gandiva's function registry needs to be aware of the function metadata:
> We can achieve this by having a
> `FunctionRegistry::AddFunction(NativeFunction* func)` function. The
> `NativeFunction` can come from whatever source the user wants or even hard
> coded, not necessarily from JSON files. The function registry is a
> singleton so this should be easy.
> 2. The LLVM engine needs access to the function IR definition: Users should
> be able register a string representation of IR bytecode similar to what the
> current `Engine::LoadPreCompiledIR` does. So something like
> `LoadExternalIR(std::string_view)` is enough. Although `Engine` is not a
> singleton, we can create a global object holding external IRs and Engines
> can link them on construction.
> I think the key idea is to let users call Gandiva functions to register
> functions and pass necessary info explicitly to Gandiva, rather than
> letting Gandiva discover them by itself.
>
> On Tue, Sep 26, 2023 at 2:14 AM Yue Ni  wrote:
>
> > > The definition of an external function registry can certainly belong in
> > Gandiva, but how it's populated should be left to third-party projects
> >
> > Are you proposing a more general approach, like incorporating the
> following
> > APIs into Gandiva? (Please note that the function names/signatures are
> > tentative and just meant for illustrative purposes.)
> > 1) AddExternalFunctionRegistry(ExternalFunctionRegistry
> function_registry)
> > 2) AddFunctionBitcodeLoader(FunctionBitcodeLoader bitcode_loader)
> > Where `ExternalFunctionRegistry` can return a list of function
> definitions
> > and `FunctionBitcodeLoader` can return a list of bitcode buffers, so that
> > the specific metadata/bitcode data population logic can be moved out of
> > Gandiva? Thanks.
> >
> > Regards,
> > Yue
> >
> > On Tue, Sep 26, 2023 at 12:25 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > Hi Yue,
> > >
> > > Le 25/09/2023 à 18:15, Yue Ni a écrit :
> > > >
> > > >> a CMake entrypoint (for example a function) making it easy for
> > > > third-party projects to compile their own functions
> > > > I can come up with a minimum CMake template so that users can compile
> > C++
> > > > based functions, and I think if the integration happens at the LLVM
> IR
> > > > level, it is possible to author the functions beyond C++ languages,
> > such
> > > as
> > > > Rust/Zig as long as the compiler can generate LLVM IR (there are
> other
> > > > issues that need to be addressed from the Rust experiment I made, but
> > > that
> > > > can be another proposal/PR). If we make that work, CMake is probably
> > not
> > > so
> > > > important either since other languages can use their own build tools
> > such
> > > > as Cargo/zig build, and we just need some documentation to describe
> how
> > > it
> > > > should be interfaced typically.
> > >
> > > As long as there's a well-known and supported way to generate the code
> > > for external functions, then it's fine to me.
> > >
> > > (also the required signature for these functions should be documented
> > > somewhere)
> > >
> > > >> The rest of the proposal (a specific JSON file format, a bunch of
> > > functions
> > > > to iterate directory entries in a specific layout) is IMHO off-topic
> > for
> > > > Gandiva, and each third-party project can implement their own idioms
> > for
> > > > the discovery of external functions
> > >  >
> > > > Could you give some more guidance on how this should work without an
> > > > external function registry containing metadata? As far as I know, for
> > > each
> > > > pre-compiled function used in an expression, Gandiva needs to lookup
> > its
> > > > signature from the function registry, which currently is a C++ class
> > that
> > > > is hard coded to contain 6 categories of built-in functions
> > > > (arithmetic/datetime/hash/mathops/string/datetime arithmetic). If a
> > third
> > > > party function cannot be found in the registry, it cannot be used in
> > the
> > > > expression. If we don't load the pre-compiled function metadata from
> > > > external files, how do we avoid Gandiva rejecting the expression
> when a
> > > > third party function cannot be found in the function registry?
> Thanks.
> > >
> > > What I'm saying is that code to load function metadata from JSON and
> > > walk directories of .bc files does not belong in Gandiva. The
> definition
> > > of an external function registry can certainly belong in Gandiva, but
> > > how it's 

[DISCUSS][Flight SQL] Adding Ingest Support for Flight SQL

2023-09-26 Thread Joel Lubi
Hi devs,

I would like to open a discussion around adding support for a native
"ingest" command to the Flight SQL specification. The initial motivating
use-case for this is to be able to support ADBC ingest when using the
Flight SQL driver, which is currently not possible because the specific
UPDATE semantics cannot be generalized across all possible Flight SQL
backends.

Specifically, I am proposing to extend the Flight SQL protobuf
specification with a "CommandStatementIngest" message type. The GH issue
[1] includes a sample message definition for this command. This command
would be included in the FlightDescriptor of a DoPut call to the server,
after which the subsequent FlightData stream could be handled as a single
bulk ingest.

I would greatly appreciate thoughts and feedback on this proposal.

Thank you,
Joel Lubinitsky

[1] https://github.com/apache/arrow-adbc/issues/1107


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Bryce Mecum
+1 (non-binding)

Verified with `./verify-release-candidate.sh 0.3.0 0` on:
- Windows 10, x86_64, libarrow-main, MSVC 17 2022, R 4.3.1, Rtools 43
- macOS 13.6, aarch64, libarrow 13.0.0, R 4.3.1
- Ubuntu 23.04, aarch64, libarrow 13.0.0, R 4.2.2


Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies
I'm confused why this would need to copy string data, assuming the pointers are 
into defined memory regions, something necessary for the C data interface's 
ownership semantics regardless, why can't these memory regions just be used as 
buffers as is? This would therefore require just rewriting the views buffer to 
subtract the base pointer of the given buffer, which should be extremely fast?

On 26 September 2023 23:34:54 BST, Matt Topol  wrote:
>I believe the motivation is to avoid the cost of the data copy that would
>have to happen to convert from a pointer based to offset based scenario.
>Allowing the pointer-based implementation will ensure that we can maintain
>zero-copy communication with both DuckDB and Velox in a common workflow
>scenario.
>
>Converting to the offset-based version would have a cost of having to copy
>strings from their locations to contiguous buffers which could end up being
>very significant depending on the shape and size of the data. The pointer
>-based solution wouldn't be allowed in IPC though, only across the C Data
>interface (correct me if I'm wrong).
>
>--Matt
>
>On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies
> wrote:
>
>> Hi,
>>
>> Is the motivation here to avoid DuckDB and Velox having to duplicate the
>> conversion logic from pointer-based to offset-based, or to allow
>> arrow-cpp to operate directly on pointer-based arrays?
>>
>> If it is the former, I personally wouldn't have thought the conversion
>> logic sufficiently complex to really warrant this?
>>
>> If it is the latter, I wonder if you have some benchmark numbers for
>> converting between and operating on the differing representations? In
>> the absence of a strong performance case, it's hard in my opinion to
>> justify adding what will be an arrow-cpp specific extension that isn't
>> part of the standard, with all the potential for confusion and
>> interoperability challenges that entails.
>>
>> Kind Regards,
>>
>> Raphael
>>
>> On 26/09/2023 21:34, Benjamin Kietzman wrote:
>> > Hello all,
>> >
>> > In the PR to add support for Utf8View to the c++ implementation,
>> > I've taken the approach of allowing raw pointer views [1] alongside the
>> > index/offset views described in the spec [2]. This was done to ease
>> > communication with other engines such as DuckDB and Velox whose native
>> > string representation is the raw pointer view. In order to be usable
>> > as a utility for writing IPC files and other operations on arrow
>> > formatted data, it is useful for the library to be able to directly
>> > import raw pointer arrays even when immediately converting these to
>> > the index/offset representation.
>> >
>> > However there has been objection in review [3] since the raw pointer
>> > representation is not part of the official format. Since data visitation
>> > utilities are generic, IMHO this hybrid approach does not add
>> > significantly to the complexity of the C++ library, and I feel the
>> > aforementioned interoperability is a high priority when adding this
>> > feature to the C++ library. It's worth noting that this interoperability
>> > has been a stated goal of the Utf8Type since its original proposal [4]
>> > and throughout the discussion of its adoption [5].
>> >
>> > Sincerely,
>> > Ben Kietzman
>> >
>> > [1]:
>> >
>> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
>> > [2]:
>> >
>> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
>> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
>> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
>> >
>>


Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Matt Topol
I believe the motivation is to avoid the cost of the data copy that would
have to happen to convert from a pointer based to offset based scenario.
Allowing the pointer-based implementation will ensure that we can maintain
zero-copy communication with both DuckDB and Velox in a common workflow
scenario.

Converting to the offset-based version would have a cost of having to copy
strings from their locations to contiguous buffers which could end up being
very significant depending on the shape and size of the data. The pointer
-based solution wouldn't be allowed in IPC though, only across the C Data
interface (correct me if I'm wrong).

--Matt

On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies
 wrote:

> Hi,
>
> Is the motivation here to avoid DuckDB and Velox having to duplicate the
> conversion logic from pointer-based to offset-based, or to allow
> arrow-cpp to operate directly on pointer-based arrays?
>
> If it is the former, I personally wouldn't have thought the conversion
> logic sufficiently complex to really warrant this?
>
> If it is the latter, I wonder if you have some benchmark numbers for
> converting between and operating on the differing representations? In
> the absence of a strong performance case, it's hard in my opinion to
> justify adding what will be an arrow-cpp specific extension that isn't
> part of the standard, with all the potential for confusion and
> interoperability challenges that entails.
>
> Kind Regards,
>
> Raphael
>
> On 26/09/2023 21:34, Benjamin Kietzman wrote:
> > Hello all,
> >
> > In the PR to add support for Utf8View to the c++ implementation,
> > I've taken the approach of allowing raw pointer views [1] alongside the
> > index/offset views described in the spec [2]. This was done to ease
> > communication with other engines such as DuckDB and Velox whose native
> > string representation is the raw pointer view. In order to be usable
> > as a utility for writing IPC files and other operations on arrow
> > formatted data, it is useful for the library to be able to directly
> > import raw pointer arrays even when immediately converting these to
> > the index/offset representation.
> >
> > However there has been objection in review [3] since the raw pointer
> > representation is not part of the official format. Since data visitation
> > utilities are generic, IMHO this hybrid approach does not add
> > significantly to the complexity of the C++ library, and I feel the
> > aforementioned interoperability is a high priority when adding this
> > feature to the C++ library. It's worth noting that this interoperability
> > has been a stated goal of the Utf8Type since its original proposal [4]
> > and throughout the discussion of its adoption [5].
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]:
> >
> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
> > [2]:
> >
> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
> >
>


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread David Li
+1

Tested on Ubuntu 20.04 LTS/x86_64, R 4.3.1

On Tue, Sep 26, 2023, at 18:05, Dane Pitkin wrote:
> +1 (non-binding)
>
> I verified successfully on MacOS 13.5 (aarch64) with:
>
> cd dev/release && ./verify-release-candidate.sh 0.3.0 0
>
>
>
> On Tue, Sep 26, 2023 at 5:30 PM Sutou Kouhei  wrote:
>
>> +1
>>
>> I ran the following command line on Debian GNU/Linux sid:
>>
>>   CMAKE_PREFIX_PATH=/tmp/local \
>> dev/release/verify-release-candidate.sh 0.3.0 0
>>
>> with:
>>
>>   * Apache Arrow C++ main
>>   * gcc (Debian 13.2.0-4) 13.2.0
>>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023
>> 12:23:52 -0300,
>>   Dewey Dunnington  wrote:
>>
>> > Hello,
>> >
>> > I would like to propose the following release candidate (rc0) of
>> > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
>> > consisting of 42 resolved GitHub issues from 4 contributors [1].
>> >
>> > This release candidate is based on commit:
>> > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
>> >
>> > The source release rc0 is hosted at [3].
>> > The changelog is located at [4].
>> >
>> > Please download, verify checksums and signatures, run the unit tests,
>> > and vote on the release. See [5] for how to validate a release
>> > candidate.
>> >
>> > See also a successful suite of verification runs at [6].
>> >
>> > The vote will be open for at least 72 hours.
>> >
>> > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
>> > [ ] +0
>> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
>> >
>> > [0] https://github.com/apache/arrow-nanoarrow
>> > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
>> > [2]
>> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
>> > [3]
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
>> > [4]
>> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
>> > [5]
>> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
>> > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940
>>


Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies

Hi,

Is the motivation here to avoid DuckDB and Velox having to duplicate the 
conversion logic from pointer-based to offset-based, or to allow 
arrow-cpp to operate directly on pointer-based arrays?


If it is the former, I personally wouldn't have thought the conversion 
logic sufficiently complex to really warrant this?


If it is the latter, I wonder if you have some benchmark numbers for 
converting between and operating on the differing representations? In 
the absence of a strong performance case, it's hard in my opinion to 
justify adding what will be an arrow-cpp specific extension that isn't 
part of the standard, with all the potential for confusion and 
interoperability challenges that entails.


Kind Regards,

Raphael

On 26/09/2023 21:34, Benjamin Kietzman wrote:

Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4



Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Dane Pitkin
+1 (non-binding)

I verified successfully on MacOS 13.5 (aarch64) with:

cd dev/release && ./verify-release-candidate.sh 0.3.0 0



On Tue, Sep 26, 2023 at 5:30 PM Sutou Kouhei  wrote:

> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   CMAKE_PREFIX_PATH=/tmp/local \
> dev/release/verify-release-candidate.sh 0.3.0 0
>
> with:
>
>   * Apache Arrow C++ main
>   * gcc (Debian 13.2.0-4) 13.2.0
>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023
> 12:23:52 -0300,
>   Dewey Dunnington  wrote:
>
> > Hello,
> >
> > I would like to propose the following release candidate (rc0) of
> > Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
> > consisting of 42 resolved GitHub issues from 4 contributors [1].
> >
> > This release candidate is based on commit:
> > c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
> >
> > The source release rc0 is hosted at [3].
> > The changelog is located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [5] for how to validate a release
> > candidate.
> >
> > See also a successful suite of verification runs at [6].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
> > [2]
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
> > [3]
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
> > [4]
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
> > [5]
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> > [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940
>


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Sutou Kouhei
+1

I ran the following command line on Debian GNU/Linux sid:

  CMAKE_PREFIX_PATH=/tmp/local \
dev/release/verify-release-candidate.sh 0.3.0 0

with:

  * Apache Arrow C++ main
  * gcc (Debian 13.2.0-4) 13.2.0
  * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"

Thanks,
-- 
kou

In 
  "[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0" on Tue, 26 Sep 2023 
12:23:52 -0300,
  Dewey Dunnington  wrote:

> Hello,
> 
> I would like to propose the following release candidate (rc0) of
> Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
> consisting of 42 resolved GitHub issues from 4 contributors [1].
> 
> This release candidate is based on commit:
> c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]
> 
> The source release rc0 is hosted at [3].
> The changelog is located at [4].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [5] for how to validate a release
> candidate.
> 
> See also a successful suite of verification runs at [6].
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...
> 
> [0] https://github.com/apache/arrow-nanoarrow
> [1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
> [2] 
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
> [3] 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
> [4] 
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
> [5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
> [6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


[DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Benjamin Kietzman
Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4


Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-26 Thread Jin Shang
I agree with Antoine that we don't need to define a JSON format or a
directory structure for Gandiva.
To support external functions, we essentially need two things:
1. Gandiva's function registry needs to be aware of the function metadata:
We can achieve this by having a
`FunctionRegistry::AddFunction(NativeFunction* func)` function. The
`NativeFunction` can come from whatever source the user wants or even hard
coded, not necessarily from JSON files. The function registry is a
singleton so this should be easy.
2. The LLVM engine needs access to the function IR definition: Users should
be able register a string representation of IR bytecode similar to what the
current `Engine::LoadPreCompiledIR` does. So something like
`LoadExternalIR(std::string_view)` is enough. Although `Engine` is not a
singleton, we can create a global object holding external IRs and Engines
can link them on construction.
I think the key idea is to let users call Gandiva functions to register
functions and pass necessary info explicitly to Gandiva, rather than
letting Gandiva discover them by itself.

On Tue, Sep 26, 2023 at 2:14 AM Yue Ni  wrote:

> > The definition of an external function registry can certainly belong in
> Gandiva, but how it's populated should be left to third-party projects
>
> Are you proposing a more general approach, like incorporating the following
> APIs into Gandiva? (Please note that the function names/signatures are
> tentative and just meant for illustrative purposes.)
> 1) AddExternalFunctionRegistry(ExternalFunctionRegistry function_registry)
> 2) AddFunctionBitcodeLoader(FunctionBitcodeLoader bitcode_loader)
> Where `ExternalFunctionRegistry` can return a list of function definitions
> and `FunctionBitcodeLoader` can return a list of bitcode buffers, so that
> the specific metadata/bitcode data population logic can be moved out of
> Gandiva? Thanks.
>
> Regards,
> Yue
>
> On Tue, Sep 26, 2023 at 12:25 AM Antoine Pitrou 
> wrote:
>
> >
> > Hi Yue,
> >
> > Le 25/09/2023 à 18:15, Yue Ni a écrit :
> > >
> > >> a CMake entrypoint (for example a function) making it easy for
> > > third-party projects to compile their own functions
> > > I can come up with a minimum CMake template so that users can compile
> C++
> > > based functions, and I think if the integration happens at the LLVM IR
> > > level, it is possible to author the functions beyond C++ languages,
> such
> > as
> > > Rust/Zig as long as the compiler can generate LLVM IR (there are other
> > > issues that need to be addressed from the Rust experiment I made, but
> > that
> > > can be another proposal/PR). If we make that work, CMake is probably
> not
> > so
> > > important either since other languages can use their own build tools
> such
> > > as Cargo/zig build, and we just need some documentation to describe how
> > it
> > > should be interfaced typically.
> >
> > As long as there's a well-known and supported way to generate the code
> > for external functions, then it's fine to me.
> >
> > (also the required signature for these functions should be documented
> > somewhere)
> >
> > >> The rest of the proposal (a specific JSON file format, a bunch of
> > functions
> > > to iterate directory entries in a specific layout) is IMHO off-topic
> for
> > > Gandiva, and each third-party project can implement their own idioms
> for
> > > the discovery of external functions
> >  >
> > > Could you give some more guidance on how this should work without an
> > > external function registry containing metadata? As far as I know, for
> > each
> > > pre-compiled function used in an expression, Gandiva needs to lookup
> its
> > > signature from the function registry, which currently is a C++ class
> that
> > > is hard coded to contain 6 categories of built-in functions
> > > (arithmetic/datetime/hash/mathops/string/datetime arithmetic). If a
> third
> > > party function cannot be found in the registry, it cannot be used in
> the
> > > expression. If we don't load the pre-compiled function metadata from
> > > external files, how do we avoid Gandiva rejecting the expression when a
> > > third party function cannot be found in the function registry? Thanks.
> >
> > What I'm saying is that code to load function metadata from JSON and
> > walk directories of .bc files does not belong in Gandiva. The definition
> > of an external function registry can certainly belong in Gandiva, but
> > how it's populated should be left to third-party projects (which then
> > don't have to use JSON or a given directory layout).
> >
> > Regards
> >
> > Antoine.
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-26 Thread Andrew Lamb
+1 (binding)

Verified on mac x86_64

Looks like a good release to me -- thank you Raphael

Andrew

On Tue, Sep 26, 2023 at 12:05 PM Raphael Taylor-Davies
 wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow Rust Object
> Store Implementation, version 0.7.1.
>
> This release candidate is based on commit:
> 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust Object Store
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
> [2]:
>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
> [3]:
>
> https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
> [4]:
>
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
>
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-26 Thread L. C. Hsieh
+1 (binding)

Verified on M1 Mac.

Thanks Raphael.


On Tue, Sep 26, 2023 at 9:05 AM Raphael Taylor-Davies
 wrote:
>
> Hi,
>
> I would like to propose a release of Apache Arrow Rust Object
> Store Implementation, version 0.7.1.
>
> This release candidate is based on commit:
> 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust Object Store
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
>
> [1]:
> https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
> [3]:
> https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
> [4]:
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
>


Re: [Format] C Data Interface integration testing

2023-09-26 Thread Dewey Dunnington
Thank you for setting this up! I look forward to adding nanoarrow as
soon as time allows.

Cheers,

-dewey

On Tue, Sep 26, 2023 at 9:48 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> We have added some infrastructure for integration testing of the C Data
> Interface between Arrow implementations. We are now testing the C++ and
> Go implementations, but the goal in the future is for all major
> implementations to be tested there (perhaps including nanoarrow).
>
> - PR to add the testing infrastructure and enable the C++ implementation:
> https://github.com/apache/arrow/pull/37769
>
> - PR to enable the Go implementation
> https://github.com/apache/arrow/pull/37788
>
> Feel free to ask any questions.
>
> Regards
>
> Antoine.
>
>
>


[VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-26 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.7.1.

This release candidate is based on commit: 
4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-26 Thread Dewey Dunnington
Hello,

I would like to propose the following release candidate (rc0) of
Apache Arrow nanoarrow [0] version 0.3.0. This is an initial release
consisting of 42 resolved GitHub issues from 4 contributors [1].

This release candidate is based on commit:
c00cd7707bcddb4dab9a7d19bf63e87c06d36c63 [2]

The source release rc0 is hosted at [3].
The changelog is located at [4].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [5] for how to validate a release
candidate.

See also a successful suite of verification runs at [6].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow nanoarrow 0.3.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow nanoarrow 0.3.0 because...

[0] https://github.com/apache/arrow-nanoarrow
[1] https://github.com/apache/arrow-nanoarrow/milestone/3?closed=1
[2] 
https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.3.0-rc0
[3] 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.3.0-rc0/
[4] 
https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.3.0-rc0/CHANGELOG.md
[5] https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
[6] https://github.com/apache/arrow-nanoarrow/actions/runs/6314579940


[Format] C Data Interface integration testing

2023-09-26 Thread Antoine Pitrou



Hello,

We have added some infrastructure for integration testing of the C Data 
Interface between Arrow implementations. We are now testing the C++ and 
Go implementations, but the goal in the future is for all major 
implementations to be tested there (perhaps including nanoarrow).


- PR to add the testing infrastructure and enable the C++ implementation:
https://github.com/apache/arrow/pull/37769

- PR to enable the Go implementation
https://github.com/apache/arrow/pull/37788

Feel free to ask any questions.

Regards

Antoine.