Re: [DISC] Improving Arrow's database support

David Li Thu, 01 Sep 2022 13:38:11 -0700

Following up here with some specific questions:

Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to vote 
on those as well?


How should the process work for Java/Go? For C/C++, I assume we'd treat it like 
the C Data Interface and copy adbc.h to format/ after a vote, and then vote on 
releases of components. Or do we really only consider the C header as the 
'format', with the others being language-specific affordances?

What about for Java and for Go? We could vote on and tag a release for Go, and 
add a documentation page that links to the Java/Go definitions at a specific 
revision (as the equivalent 'format' definition for Java/Go)? Or would we 
vendor the entire Java module/Go package as the 'format'?

Do we have a preference for versioning strategy? Should we proceed in lockstep 
with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the API standard) 
with "drivers version 10.0.0", or use an independent versioning scheme? (For 
example, release API standard and components at "1.0.0". Then further releases 
of components that do not change the spec would be "1.1", "1.2", ...; if/when 
we change the spec, start over with "2.0", "2.1", ...)

[1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go

-David

On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> Hi,
>
> OK. I'll send pull requests for GLib and Ruby soon.
>
>> I'm curious if you have a particular use case in mind.
>
> I don't have any production-ready use case yet but I want to
> implement an Active Record adapter for ADBC. Active Record
> is the O/R mapper for Ruby on Rails. Implementing Web
> application by Ruby on Rails is one of major Ruby use
> cases. So providing Active Record interface for ADBC will
> increase Apache Arrow users in Ruby community.
>
> NOTE: Generally, Ruby on Rails users don't process large
> data but they sometimes need to process large (medium?) data
> in a batch process. Active Record adapter for ADBC may be
> useful for such use case.
>
>> There's a little bit more API cleanup to do [1]. If you
>> have comments on that or anything else, I'd appreciate
>> them. Otherwise, pull requests would also be appreciated.
>
> OK. I'll open issues/pull requests when I find
> something. For now, I think that "MODULE" type library
> instead of "SHARED" type library in CMake terminology
> [cmake] is better for driver modules. (I'll open an issue
> for this later.)
>
> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>
>
> Thanks,
> -- 
> kou
>
> In <[email protected]>
>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022 
> 15:28:56 -0400,
>   "David Li" <[email protected]> wrote:
>
>> I would be very happy to see GLib/Ruby bindings! I'm curious if you have a 
>> particular use case in mind. 
>> 
>> There's a little bit more API cleanup to do [1]. If you have comments on 
>> that or anything else, I'd appreciate them. Otherwise, pull requests would 
>> also be appreciated.
>> 
>> [1]: https://github.com/apache/arrow-adbc/issues/79
>> 
>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>> Hi,
>>>
>>> Thanks for sharing the current status!
>>> I understand.
>>>
>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>> before we release the first version? (I want to use ADBC
>>> from Ruby.) Or should I wait for the first release? If I can
>>> work on it now, I'll open pull requests for it.
>>>
>>> Thanks,
>>> -- 
>>> kou
>>>
>>> In <[email protected]>
>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
>>> 11:03:26 -0400,
>>>   "David Li" <[email protected]> wrote:
>>>
>>>> Thank you Kou!
>>>> 
>>>> At least initially, I don't think I'll be able to complete the Dataset 
>>>> integration in time. So 10.0.0 probably won't ship with a hard dependency. 
>>>> That said I am hoping to have PyArrow take an optional dependency (so 
>>>> Flight SQL can finally be available from Python).
>>>> 
>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>>> Hi,
>>>>>
>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>>>> to be released before apache/arrow is released so that
>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>>> .deb/.rpm.
>>>>>
>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>>>
>>>>> We can add .deb/.rpm related files
>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>>>
>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>>>
>>>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>>>> * 
>>>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>>>
>>>>> I can work on it in apache/arrow-adbc.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -- 
>>>>> kou
>>>>>
>>>>> In <[email protected]>
>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>>>> 11:51:08 -0400,
>>>>>   "David Li" <[email protected]> wrote:
>>>>>
>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of 
>>>>>> text that follows…)
>>>>>> 
>>>>>> These are the components:
>>>>>> 
>>>>>> - Core adbc.h header
>>>>>> - Driver manager for C/C++
>>>>>> - Flight SQL-based driver
>>>>>> - Postgres-based driver (WIP)
>>>>>> - SQLite-based driver (more of a testbed for me than an actual component 
>>>>>> - I don't think we'd actually distribute this)
>>>>>> - Java core interfaces
>>>>>> - Java driver manager
>>>>>> - Java JDBC-based driver
>>>>>> - Java Flight SQL-based driver
>>>>>> - Python driver manager
>>>>>> 
>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL 
>>>>>> drivers get moved to the main Arrow repo and distributed as part of the 
>>>>>> regular Arrow releases.
>>>>>> 
>>>>>> For the rest of the components: they could be packaged individually, but 
>>>>>> versioned and released together. Also, each C/C++ driver probably needs 
>>>>>> a corresponding Python package so Python users do not have to futz with 
>>>>>> shared library configurations. (See [1].) So for instance, installing 
>>>>>> PyArrow would also give you the Flight SQL driver, and `pip install 
>>>>>> adbc_postgres` would get you the Postgres-based driver.
>>>>>> 
>>>>>> That would mean setting up separate CI, release, etc. (and eventually 
>>>>>> linking Crossbow & Conbench as well?). That does mean duplication of 
>>>>>> effort, but the trade off is avoiding bloating the main release process 
>>>>>> even further. However, I'd like to hear from those closer to the release 
>>>>>> process on this subject - if it would make people's lives easier, we 
>>>>>> could merge everything into one repo/process.
>>>>>> 
>>>>>> Integrations would be distributed as part of their respective packages 
>>>>>> (e.g. Arrow Dataset would optionally link to the driver manager). So the 
>>>>>> "part of Arrow 10.0.0" aspect means having a stable interface for 
>>>>>> adbc.h, and getting the Flight SQL drivers into the main repo.
>>>>>> 
>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>>>> 
>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>>>> "David Li" <[email protected]> wrote:
>>>>>>>> Since it's been a while, I'd like to give an update. There are also a 
>>>>>>>> few questions I have around distribution.
>>>>>>>> 
>>>>>>>> Currently:
>>>>>>>> - Supported in C, Java, and Python.
>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and 
>>>>>>>> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI 
>>>>>>>> interface on top of that (+a few extension methods resembling 
>>>>>>>> DuckDB/Turbodbc).
>>>>>>>>  
>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd 
>>>>>>>> like to thank Hannes and Kirill for their comments, as well as 
>>>>>>>> Antoine, Dewey, and Matt here.)
>>>>>>>> 
>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm 
>>>>>>>> not sure how we would like to handle packaging and distribution. In 
>>>>>>>> particular, there are several sub-components for each language (the 
>>>>>>>> driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>>>
>>>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>>>
>>>>>>>> I'm also wondering how we want to handle this in terms of 
>>>>>>>> specification - I assume we'd consider the core header file/Java 
>>>>>>>> interfaces a spec like the C Data Interface/Flight RPC, and vote on 
>>>>>>>> them/mirror them into the format/ directory?
>>>>>>>
>>>>>>> That sounds like the right way to me indeed.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Reply via email to