Re: [DISC] Improving Arrow's database support

David Li Wed, 14 Sep 2022 06:22:13 -0700

I put up [1] as the PR to apache/arrow to vote on. There is a bit of a circular 
dependency here: my thought is that we will vote on this, then tag the 1.0.0 
API standard on apache/arrow-adbc, and finally update the PR before merging. 
But actual releases of the packages may be a later commit/tag as we set up all 
the necessary infrastructure.


I'll start a vote thread soon unless there are comments/concerns.

Also, I plan to make a ticket to INFRA for apache/arrow-adbc, to switch the 
default commit message to "PR title + description" [2] to go along with the 
conventional commit suggestion, unless anyone has other ideas.

In other words, I'm trying to set up the Flight SQL driver now [3] which will 
give us actual Python bindings (this adds an optional runtime dependency from 
PyArrow to ADBC); I would like to get back to the libpq driver [4] and set up 
benchmarks and start trying to compare it to other alternatives (pgeon, 
psycopg, etc.)

[1]: https://github.com/apache/arrow/pull/14079
[2]: 
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/configuring-commit-squashing-for-pull-requests
[3]: https://github.com/apache/arrow/pull/14082

On Tue, Sep 13, 2022, at 15:12, David Li wrote:
> Ah, thanks for the clarification Neal!
>
> Jacob/Matt: I put up https://github.com/apache/arrow-adbc/pull/124 to 
> describe the convention but I wonder if we should partition components 
> more granularly than we have so far.
>
> On Mon, Sep 12, 2022, at 12:57, Neal Richardson wrote:
>> On Mon, Sep 12, 2022 at 12:44 PM David Li <lidav...@apache.org> wrote:
>>
>>> I like this idea. I would also like to set up some sort of automated ABI
>>> checker as well (the options I found were GPL/LGPL so I need to figure out
>>> how to proceed).
>>>
>>
>> You should be able to use GPL software in CI, that's no problem. You can
>> even depend on GPL software as long as it is "optional":
>> https://www.apache.org/legal/resolved.html#optional But this would not even
>> count as that since the ABI checker wouldn't be required to use the
>> software.
>>
>> Neal
>>
>>
>>>
>>> I can put up a PR later that formalizes these guidelines in
>>> CONTRIBUTING.md. It looks like there's a pre-commit hook for this sort of
>>> thing too, which'll let us enforce it in CI!
>>>
>>> On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
>>> > Automated semver would be ideal if we can do it.....
>>> >
>>> > There's quite a lot of utilities that exist which would automatically
>>> > handle the versioning if we're using conventional commits.
>>> >
>>> > On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak
>>> > <ja...@voltrondata.com.INVALID> wrote:
>>> >> + 1 to independent, semver versioning for adbc.
>>> >> I would propose we use conventional commit style [1] commit messages
>>> >> for
>>> >> the pr commits (I assume squash + merge) so we can automate the
>>> >> versioning|double check manual versioning.
>>> >>
>>> >> [1]: <https://www.conventionalcommits.org/>
>>> >>
>>> >> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidav...@apache.org
>>> >> <mailto:lidav...@apache.org>> wrote:
>>> >>
>>> >>>  Thanks all, I've updated the header with the proposed versioning
>>> >>> scheme.
>>> >>>
>>> >>>  At this point I believe the core definitions are ready. (Note that
>>> >>> I'm
>>> >>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd
>>> >>> like to
>>> >>>  do the following:
>>> >>>
>>> >>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>>> >>>  docs/source/format/ADBC.rst that describes the header, the Java
>>> >>> interface,
>>> >>>  the Go interface, and the versioning scheme (I will put up a PR
>>> >>> beforehand)
>>> >>>  - Begin work on CI/packaging, with a release hopefully coinciding
>>> >>> with
>>> >>>  Arrow 10.0.0
>>> >>>  - Begin work on changes to the main repository, also hopefully in
>>> >>> time for
>>> >>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow;
>>> >>> exposing
>>> >>>  it in PyArrow; possibly also exposing Acero via ADBC)
>>> >>>
>>> >>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>>> >>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>>> >>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>>> >>>
>>> >>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>>> >>>  > +1 from me on the strategy proposed by Kou.
>>> >>>  >
>>> >>>  > That would be my preference also. I agree it is preferable to be
>>> >>>  versioned
>>> >>>  > independently.
>>> >>>  >
>>> >>>  > --Matt
>>> >>>  >
>>> >>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <k...@clear-code.com
>>> >>> <mailto:k...@clear-code.com>> wrote:
>>> >>>  >
>>> >>>  >> Hi,
>>> >>>  >>
>>> >>>  >> > Do we have a preference for versioning strategy? Should we
>>> >>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>>> >>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>>> >>>  >> > version 10.0.0", or use an independent versioning scheme?
>>> >>>  >> > (For example, release API standard and components at
>>> >>>  >> > "1.0.0". Then further releases of components that do not
>>> >>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>>> >>>  >> > change the spec, start over with "2.0", "2.1", ...)
>>> >>>  >>
>>> >>>  >> I like an independent versioning schema. I assume that ADBC
>>> >>>  >> doesn't need backward incompatible changes frequently. How
>>> >>>  >> about incrementing major version only when ADBC needs
>>> >>>  >> any backward incompatible changes?
>>> >>>  >>
>>> >>>  >> e.g.:
>>> >>>  >>
>>> >>>  >>   1.  Release ADBC (the API standard) 1.0.0
>>> >>>  >>   2.  Release adbc_driver_manager 1.0.0
>>> >>>  >>   3.  Release adbc_driver_postgres 1.0.0
>>> >>>  >>   4.  Add a new feature to adbc_driver_postgres without
>>> >>>  >>       any backward incompatible changes
>>> >>>  >>   5.  Release adbc_driver_postgres 1.1.0
>>> >>>  >>   6.  Fix a bug in adbc_driver_manager without
>>> >>>  >>       any backward incompatible changes
>>> >>>  >>   7.  Release adbc_driver_manager 1.0.1
>>> >>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>>> >>>  >>   9.  Release adbc_driver_manager 2.0.0
>>> >>>  >>   10. Add a new feature to ADBC without any
>>> >>>  >>       backward incompatible changes
>>> >>>  >>   11. Release ADBC (the API standard) 1.1.0
>>> >>>  >>
>>> >>>  >>
>>> >>>  >> Thanks,
>>> >>>  >> --
>>> >>>  >> kou
>>> >>>  >>
>>> >>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2...@www.fastmail.com
>>> >>> <mailto:7b20d730-b85e-4818-b99e-3335c40c2...@www.fastmail.com>>
>>> >>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep
>>> >>> 2022
>>> >>>  >> 16:36:43 -0400,
>>> >>>  >>   "David Li" <lidav...@apache.org <mailto:lidav...@apache.org>>
>>> >>> wrote:
>>> >>>  >>
>>> >>>  >> > Following up here with some specific questions:
>>> >>>  >> >
>>> >>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume
>>> >>> we want
>>> >>>  to
>>> >>>  >> vote on those as well?
>>> >>>  >> >
>>> >>>  >> > How should the process work for Java/Go? For C/C++, I assume
>>> >>> we'd
>>> >>>  treat
>>> >>>  >> it like the C Data Interface and copy adbc.h to format/ after a
>>> >>> vote,
>>> >>>  and
>>> >>>  >> then vote on releases of components. Or do we really only
>>> >>> consider the C
>>> >>>  >> header as the 'format', with the others being language-specific
>>> >>>  affordances?
>>> >>>  >> >
>>> >>>  >> > What about for Java and for Go? We could vote on and tag a
>>> >>> release for
>>> >>>  >> Go, and add a documentation page that links to the Java/Go
>>> >>> definitions
>>> >>>  at a
>>> >>>  >> specific revision (as the equivalent 'format' definition for
>>> >>> Java/Go)?
>>> >>>  Or
>>> >>>  >> would we vendor the entire Java module/Go package as the
>>> >>> 'format'?
>>> >>>  >> >
>>> >>>  >> > Do we have a preference for versioning strategy? Should we
>>> >>> proceed in
>>> >>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC
>>> >>> 1.0.0"
>>> >>>  (the
>>> >>>  >> API standard) with "drivers version 10.0.0", or use an
>>> >>> independent
>>> >>>  >> versioning scheme? (For example, release API standard and
>>> >>> components at
>>> >>>  >> "1.0.0". Then further releases of components that do not change
>>> >>> the spec
>>> >>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start
>>> >>> over with
>>> >>>  >> "2.0", "2.1", ...)
>>> >>>  >> >
>>> >>>  >> > [1]:
>>> >>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>>> >>>  >> >
>>> >>>  >> > -David
>>> >>>  >> >
>>> >>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>>> >>>  >> >> Hi,
>>> >>>  >> >>
>>> >>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>>> >>>  >> >>
>>> >>>  >> >>> I'm curious if you have a particular use case in mind.
>>> >>>  >> >>
>>> >>>  >> >> I don't have any production-ready use case yet but I want to
>>> >>>  >> >> implement an Active Record adapter for ADBC. Active Record
>>> >>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>>> >>>  >> >> application by Ruby on Rails is one of major Ruby use
>>> >>>  >> >> cases. So providing Active Record interface for ADBC will
>>> >>>  >> >> increase Apache Arrow users in Ruby community.
>>> >>>  >> >>
>>> >>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>>> >>>  >> >> data but they sometimes need to process large (medium?) data
>>> >>>  >> >> in a batch process. Active Record adapter for ADBC may be
>>> >>>  >> >> useful for such use case.
>>> >>>  >> >>
>>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>>> >>>  >> >>> have comments on that or anything else, I'd appreciate
>>> >>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>>> >>>  >> >>
>>> >>>  >> >> OK. I'll open issues/pull requests when I find
>>> >>>  >> >> something. For now, I think that "MODULE" type library
>>> >>>  >> >> instead of "SHARED" type library in CMake terminology
>>> >>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>>> >>>  >> >> for this later.)
>>> >>>  >> >>
>>> >>>  >> >> [cmake]:
>>> >>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>>> >>>  >> >>
>>> >>>  >> >>
>>> >>>  >> >> Thanks,
>>> >>>  >> >> --
>>> >>>  >> >> kou
>>> >>>  >> >>
>>> >>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597...@www.fastmail.com
>>> >>> <mailto:e6380315-94aa-4dd1-8685-268edd597...@www.fastmail.com>>
>>> >>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27
>>> >>> Aug 2022
>>> >>>  >> >> 15:28:56 -0400,
>>> >>>  >> >>   "David Li" <lidav...@apache.org
>>> >>> <mailto:lidav...@apache.org>> wrote:
>>> >>>  >> >>
>>> >>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious
>>> >>> if you
>>> >>>  >> have a particular use case in mind.
>>> >>>  >> >>>
>>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>>> >>>  comments
>>> >>>  >> on that or anything else, I'd appreciate them. Otherwise, pull
>>> >>> requests
>>> >>>  >> would also be appreciated.
>>> >>>  >> >>>
>>> >>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>>> >>>  >> >>>
>>> >>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>> >>>  >> >>>> Hi,
>>> >>>  >> >>>>
>>> >>>  >> >>>> Thanks for sharing the current status!
>>> >>>  >> >>>> I understand.
>>> >>>  >> >>>>
>>> >>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>> >>>  >> >>>> before we release the first version? (I want to use ADBC
>>> >>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>>> >>>  >> >>>> work on it now, I'll open pull requests for it.
>>> >>>  >> >>>>
>>> >>>  >> >>>> Thanks,
>>> >>>  >> >>>> --
>>> >>>  >> >>>> kou
>>> >>>  >> >>>>
>>> >>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d...@www.fastmail.com
>>> >>> <mailto:8703efd9-51bd-4f91-b550-73830667d...@www.fastmail.com>>
>>> >>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri,
>>> >>> 26 Aug
>>> >>>  2022
>>> >>>  >> >>>> 11:03:26 -0400,
>>> >>>  >> >>>>   "David Li" <lidav...@apache.org
>>> >>> <mailto:lidav...@apache.org>> wrote:
>>> >>>  >> >>>>
>>> >>>  >> >>>>> Thank you Kou!
>>> >>>  >> >>>>>
>>> >>>  >> >>>>> At least initially, I don't think I'll be able to complete
>>> >>> the
>>> >>>  >> Dataset integration in time. So 10.0.0 probably won't ship with
>>> >>> a hard
>>> >>>  >> dependency. That said I am hoping to have PyArrow take an
>>> >>> optional
>>> >>>  >> dependency (so Flight SQL can finally be available from Python).
>>> >>>  >> >>>>>
>>> >>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>> >>>  >> >>>>>> Hi,
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> As a maintainer of Linux packages, I want
>>> >>> apache/arrow-adbc
>>> >>>  >> >>>>>> to be released before apache/arrow is released so that
>>> >>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>> >>>  >> >>>>>> .deb/.rpm.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>> >>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>>> >>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> We can add .deb/.rpm related files
>>> >>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>> >>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for
>>> >>> apache/arrow-adbc.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> *
>>> >>>  >>
>>> >>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>>> >>>  >> >>>>>> *
>>> >>>  >> >>>>>>
>>> >>>  >>
>>> >>>
>>> >>> <
>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>> >
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> Thanks,
>>> >>>  >> >>>>>> --
>>> >>>  >> >>>>>> kou
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd...@www.fastmail.com
>>> >>> <mailto:5cbf2923-4fb4-4c5e-b11d-007209fdd...@www.fastmail.com>>
>>> >>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu,
>>> >>> 25 Aug
>>> >>>  >> 2022
>>> >>>  >> >>>>>> 11:51:08 -0400,
>>> >>>  >> >>>>>>   "David Li" <lidav...@apache.org
>>> >>> <mailto:lidav...@apache.org>> wrote:
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry
>>> >>> for the
>>> >>>  >> wall of text that follows…)
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> These are the components:
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> - Core adbc.h header
>>> >>>  >> >>>>>>> - Driver manager for C/C++
>>> >>>  >> >>>>>>> - Flight SQL-based driver
>>> >>>  >> >>>>>>> - Postgres-based driver (WIP)
>>> >>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an
>>> >>> actual
>>> >>>  >> component - I don't think we'd actually distribute this)
>>> >>>  >> >>>>>>> - Java core interfaces
>>> >>>  >> >>>>>>> - Java driver manager
>>> >>>  >> >>>>>>> - Java JDBC-based driver
>>> >>>  >> >>>>>>> - Java Flight SQL-based driver
>>> >>>  >> >>>>>>> - Python driver manager
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The
>>> >>> Flight
>>> >>>  SQL
>>> >>>  >> drivers get moved to the main Arrow repo and distributed as part
>>> >>> of the
>>> >>>  >> regular Arrow releases.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> For the rest of the components: they could be packaged
>>> >>>  >> individually, but versioned and released together. Also, each
>>> >>> C/C++
>>> >>>  driver
>>> >>>  >> probably needs a corresponding Python package so Python users do
>>> >>> not
>>> >>>  have
>>> >>>  >> to futz with shared library configurations. (See [1].) So for
>>> >>> instance,
>>> >>>  >> installing PyArrow would also give you the Flight SQL driver,
>>> >>> and `pip
>>> >>>  >> install adbc_postgres` would get you the Postgres-based driver.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> That would mean setting up separate CI, release, etc.
>>> >>> (and
>>> >>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>>> >>>  >> duplication of effort, but the trade off is avoiding bloating
>>> >>> the main
>>> >>>  >> release process even further. However, I'd like to hear from
>>> >>> those
>>> >>>  closer
>>> >>>  >> to the release process on this subject - if it would make
>>> >>> people's lives
>>> >>>  >> easier, we could merge everything into one repo/process.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> Integrations would be distributed as part of their
>>> >>> respective
>>> >>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>>> >>>  manager).
>>> >>>  >> So the "part of Arrow 10.0.0" aspect means having a stable
>>> >>> interface for
>>> >>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>> >>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>> >>>  >> >>>>>>>> "David Li" <lidav...@apache.org
>>> >>> <mailto:lidav...@apache.org>> wrote:
>>> >>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update.
>>> >>> There are
>>> >>>  >> also a few questions I have around distribution.
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> Currently:
>>> >>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>>> >>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping
>>> >>> Flight SQL
>>> >>>  and
>>> >>>  >> SQLite, with a draft of a libpq (Postgres) driver (using
>>> >>> nanoarrow).
>>> >>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight
>>> >>> SQL.
>>> >>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API,
>>> >>> and the
>>> >>>  >> DBAPI interface on top of that (+a few extension methods
>>> >>> resembling
>>> >>>  >> DuckDB/Turbodbc).
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R),
>>> >>> and
>>> >>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments,
>>> >>> as
>>> >>>  well as
>>> >>>  >> Antoine, Dewey, and Matt here.)
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some
>>> >>> fashion.
>>> >>>  >> However, I'm not sure how we would like to handle packaging and
>>> >>>  >> distribution. In particular, there are several sub-components
>>> >>> for each
>>> >>>  >> language (the driver manager + the drivers), increasing the
>>> >>> work. Any
>>> >>>  >> thoughts here?
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question
>>> >>> is too
>>> >>>  >> broadly
>>> >>>  >> >>>>>>>> formulated. It probably deserves a case-by-case
>>> >>> discussion,
>>> >>>  IMHO.
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms
>>> >>> of
>>> >>>  >> specification - I assume we'd consider the core header file/Java
>>> >>>  interfaces
>>> >>>  >> a spec like the C Data Interface/Flight RPC, and vote on
>>> >>> them/mirror
>>> >>>  them
>>> >>>  >> into the format/ directory?
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> That sounds like the right way to me indeed.
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Regards
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Antoine.
>>> >>>  >>
>>> >>>
>>>

Re: [DISC] Improving Arrow's database support

Reply via email to