Re: [DISC] Improving Arrow's database support

Jacob Wujciak Mon, 12 Sep 2022 05:26:37 -0700

+ 1 to independent, semver versioning for adbc.
I would propose we use conventional commit style [1] commit messages for
the pr commits (I assume squash + merge) so we can automate the
versioning|double check manual versioning.


[1]: https://www.conventionalcommits.org/

On Thu, Sep 8, 2022 at 6:05 PM David Li <[email protected]> wrote:

> Thanks all, I've updated the header with the proposed versioning scheme.
>
> At this point I believe the core definitions are ready. (Note that I'm
> explicitly punting on [1][2][3] here.) Absent further comments, I'd like to
> do the following:
>
> - Start a vote on mirroring adbc.h to arrow/format, as well adding
> docs/source/format/ADBC.rst that describes the header, the Java interface,
> the Go interface, and the versioning scheme (I will put up a PR beforehand)
> - Begin work on CI/packaging, with a release hopefully coinciding with
> Arrow 10.0.0
> - Begin work on changes to the main repository, also hopefully in time for
> 10.0.0 (moving the Flight SQL driver to be part of apache/arrow; exposing
> it in PyArrow; possibly also exposing Acero via ADBC)
>
> [1]: https://github.com/apache/arrow-adbc/issues/46
> [2]: https://github.com/apache/arrow-adbc/issues/55
> [3]: https://github.com/apache/arrow-adbc/issues/59
>
> On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
> > +1 from me on the strategy proposed by Kou.
> >
> > That would be my preference also. I agree it is preferable to be
> versioned
> > independently.
> >
> > --Matt
> >
> > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> > Do we have a preference for versioning strategy? Should we
> >> > proceed in lockstep with the Arrow C++ library et. al. and
> >> > release "ADBC 1.0.0" (the API standard) with "drivers
> >> > version 10.0.0", or use an independent versioning scheme?
> >> > (For example, release API standard and components at
> >> > "1.0.0". Then further releases of components that do not
> >> > change the spec would be "1.1", "1.2", ...; if/when we
> >> > change the spec, start over with "2.0", "2.1", ...)
> >>
> >> I like an independent versioning schema. I assume that ADBC
> >> doesn't need backward incompatible changes frequently. How
> >> about incrementing major version only when ADBC needs
> >> any backward incompatible changes?
> >>
> >> e.g.:
> >>
> >>   1.  Release ADBC (the API standard) 1.0.0
> >>   2.  Release adbc_driver_manager 1.0.0
> >>   3.  Release adbc_driver_postgres 1.0.0
> >>   4.  Add a new feature to adbc_driver_postgres without
> >>       any backward incompatible changes
> >>   5.  Release adbc_driver_postgres 1.1.0
> >>   6.  Fix a bug in adbc_driver_manager without
> >>       any backward incompatible changes
> >>   7.  Release adbc_driver_manager 1.0.1
> >>   8.  Add a backward incompatible change to adbc_driver_manager
> >>   9.  Release adbc_driver_manager 2.0.0
> >>   10. Add a new feature to ADBC without any
> >>       backward incompatible changes
> >>   11. Release ADBC (the API standard) 1.1.0
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <[email protected]>
> >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022
> >> 16:36:43 -0400,
> >>   "David Li" <[email protected]> wrote:
> >>
> >> > Following up here with some specific questions:
> >> >
> >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume we want
> to
> >> vote on those as well?
> >> >
> >> > How should the process work for Java/Go? For C/C++, I assume we'd
> treat
> >> it like the C Data Interface and copy adbc.h to format/ after a vote,
> and
> >> then vote on releases of components. Or do we really only consider the C
> >> header as the 'format', with the others being language-specific
> affordances?
> >> >
> >> > What about for Java and for Go? We could vote on and tag a release for
> >> Go, and add a documentation page that links to the Java/Go definitions
> at a
> >> specific revision (as the equivalent 'format' definition for Java/Go)?
> Or
> >> would we vendor the entire Java module/Go package as the 'format'?
> >> >
> >> > Do we have a preference for versioning strategy? Should we proceed in
> >> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0"
> (the
> >> API standard) with "drivers version 10.0.0", or use an independent
> >> versioning scheme? (For example, release API standard and components at
> >> "1.0.0". Then further releases of components that do not change the spec
> >> would be "1.1", "1.2", ...; if/when we change the spec, start over with
> >> "2.0", "2.1", ...)
> >> >
> >> > [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
> >> >
> >> > -David
> >> >
> >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> >> >> Hi,
> >> >>
> >> >> OK. I'll send pull requests for GLib and Ruby soon.
> >> >>
> >> >>> I'm curious if you have a particular use case in mind.
> >> >>
> >> >> I don't have any production-ready use case yet but I want to
> >> >> implement an Active Record adapter for ADBC. Active Record
> >> >> is the O/R mapper for Ruby on Rails. Implementing Web
> >> >> application by Ruby on Rails is one of major Ruby use
> >> >> cases. So providing Active Record interface for ADBC will
> >> >> increase Apache Arrow users in Ruby community.
> >> >>
> >> >> NOTE: Generally, Ruby on Rails users don't process large
> >> >> data but they sometimes need to process large (medium?) data
> >> >> in a batch process. Active Record adapter for ADBC may be
> >> >> useful for such use case.
> >> >>
> >> >>> There's a little bit more API cleanup to do [1]. If you
> >> >>> have comments on that or anything else, I'd appreciate
> >> >>> them. Otherwise, pull requests would also be appreciated.
> >> >>
> >> >> OK. I'll open issues/pull requests when I find
> >> >> something. For now, I think that "MODULE" type library
> >> >> instead of "SHARED" type library in CMake terminology
> >> >> [cmake] is better for driver modules. (I'll open an issue
> >> >> for this later.)
> >> >>
> >> >> [cmake]:
> https://cmake.org/cmake/help/latest/command/add_library.html
> >> >>
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> kou
> >> >>
> >> >> In <[email protected]>
> >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022
> >> >> 15:28:56 -0400,
> >> >>   "David Li" <[email protected]> wrote:
> >> >>
> >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious if you
> >> have a particular use case in mind.
> >> >>>
> >> >>> There's a little bit more API cleanup to do [1]. If you have
> comments
> >> on that or anything else, I'd appreciate them. Otherwise, pull requests
> >> would also be appreciated.
> >> >>>
> >> >>> [1]: https://github.com/apache/arrow-adbc/issues/79
> >> >>>
> >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> >> >>>> Hi,
> >> >>>>
> >> >>>> Thanks for sharing the current status!
> >> >>>> I understand.
> >> >>>>
> >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> >> >>>> before we release the first version? (I want to use ADBC
> >> >>>> from Ruby.) Or should I wait for the first release? If I can
> >> >>>> work on it now, I'll open pull requests for it.
> >> >>>>
> >> >>>> Thanks,
> >> >>>> --
> >> >>>> kou
> >> >>>>
> >> >>>> In <[email protected]>
> >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug
> 2022
> >> >>>> 11:03:26 -0400,
> >> >>>>   "David Li" <[email protected]> wrote:
> >> >>>>
> >> >>>>> Thank you Kou!
> >> >>>>>
> >> >>>>> At least initially, I don't think I'll be able to complete the
> >> Dataset integration in time. So 10.0.0 probably won't ship with a hard
> >> dependency. That said I am hoping to have PyArrow take an optional
> >> dependency (so Flight SQL can finally be available from Python).
> >> >>>>>
> >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
> >> >>>>>> to be released before apache/arrow is released so that
> >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
> >> >>>>>> .deb/.rpm.
> >> >>>>>>
> >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
> >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
> >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
> >> >>>>>>
> >> >>>>>> We can add .deb/.rpm related files
> >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
> >> >>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
> >> >>>>>>
> >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
> >> >>>>>>
> >> >>>>>> *
> >> https://github.com/datafusion-contrib/datafusion-c/tree/main/package
> >> >>>>>> *
> >> >>>>>>
> >>
> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
> >> >>>>>>
> >> >>>>>> I can work on it in apache/arrow-adbc.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> --
> >> >>>>>> kou
> >> >>>>>>
> >> >>>>>> In <[email protected]>
> >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug
> >> 2022
> >> >>>>>> 11:51:08 -0400,
> >> >>>>>>   "David Li" <[email protected]> wrote:
> >> >>>>>>
> >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the
> >> wall of text that follows…)
> >> >>>>>>>
> >> >>>>>>> These are the components:
> >> >>>>>>>
> >> >>>>>>> - Core adbc.h header
> >> >>>>>>> - Driver manager for C/C++
> >> >>>>>>> - Flight SQL-based driver
> >> >>>>>>> - Postgres-based driver (WIP)
> >> >>>>>>> - SQLite-based driver (more of a testbed for me than an actual
> >> component - I don't think we'd actually distribute this)
> >> >>>>>>> - Java core interfaces
> >> >>>>>>> - Java driver manager
> >> >>>>>>> - Java JDBC-based driver
> >> >>>>>>> - Java Flight SQL-based driver
> >> >>>>>>> - Python driver manager
> >> >>>>>>>
> >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight
> SQL
> >> drivers get moved to the main Arrow repo and distributed as part of the
> >> regular Arrow releases.
> >> >>>>>>>
> >> >>>>>>> For the rest of the components: they could be packaged
> >> individually, but versioned and released together. Also, each C/C++
> driver
> >> probably needs a corresponding Python package so Python users do not
> have
> >> to futz with shared library configurations. (See [1].) So for instance,
> >> installing PyArrow would also give you the Flight SQL driver, and `pip
> >> install adbc_postgres` would get you the Postgres-based driver.
> >> >>>>>>>
> >> >>>>>>> That would mean setting up separate CI, release, etc. (and
> >> eventually linking Crossbow & Conbench as well?). That does mean
> >> duplication of effort, but the trade off is avoiding bloating the main
> >> release process even further. However, I'd like to hear from those
> closer
> >> to the release process on this subject - if it would make people's lives
> >> easier, we could merge everything into one repo/process.
> >> >>>>>>>
> >> >>>>>>> Integrations would be distributed as part of their respective
> >> packages (e.g. Arrow Dataset would optionally link to the driver
> manager).
> >> So the "part of Arrow 10.0.0" aspect means having a stable interface for
> >> adbc.h, and getting the Flight SQL drivers into the main repo.
> >> >>>>>>>
> >> >>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
> >> >>>>>>>
> >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
> >> >>>>>>>> "David Li" <[email protected]> wrote:
> >> >>>>>>>>> Since it's been a while, I'd like to give an update. There are
> >> also a few questions I have around distribution.
> >> >>>>>>>>>
> >> >>>>>>>>> Currently:
> >> >>>>>>>>> - Supported in C, Java, and Python.
> >> >>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL
> and
> >> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
> >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
> >> >>>>>>>>> - For Python, there's low-level bindings to the C API, and the
> >> DBAPI interface on top of that (+a few extension methods resembling
> >> DuckDB/Turbodbc).
> >> >>>>>>>>>
> >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and
> >> DuckDB. (I'd like to thank Hannes and Kirill for their comments, as
> well as
> >> Antoine, Dewey, and Matt here.)
> >> >>>>>>>>>
> >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion.
> >> However, I'm not sure how we would like to handle packaging and
> >> distribution. In particular, there are several sub-components for each
> >> language (the driver manager + the drivers), increasing the work. Any
> >> thoughts here?
> >> >>>>>>>>
> >> >>>>>>>> Sorry, forgot to answer here. But I think your question is too
> >> broadly
> >> >>>>>>>> formulated. It probably deserves a case-by-case discussion,
> IMHO.
> >> >>>>>>>>
> >> >>>>>>>>> I'm also wondering how we want to handle this in terms of
> >> specification - I assume we'd consider the core header file/Java
> interfaces
> >> a spec like the C Data Interface/Flight RPC, and vote on them/mirror
> them
> >> into the format/ directory?
> >> >>>>>>>>
> >> >>>>>>>> That sounds like the right way to me indeed.
> >> >>>>>>>>
> >> >>>>>>>> Regards
> >> >>>>>>>>
> >> >>>>>>>> Antoine.
> >>
>

Re: [DISC] Improving Arrow's database support

Reply via email to