I don't have a strong opinion here, but had a question and comment:

Are there are implications from a project governance perspective of
packaging Parquet and Arrow into a single shared library?

As a comment, but I'm a big +1 on trying to tease apart the circular
dependencies between Parquet/Arrow (and any other modules).  As noted
above, I think this boils down to isolating IO and Buffer data structures
into 1 library and having the Arrow Array data structures in their own
separate libraries.

Thanks,
Micah

On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
>
> If this is circular, it's a problem. But this isn't circular
> for now.
>
> I think that we can use libarrow as the fundamental shared
> library to provide common implementation like [1] if we need
> to provide common implementation for template. (I think that
> we don't provide common implementation for template.)
>
> [1]
> https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6
>
> Anyway, I'm not strongly oppose to this idea. If we choose
> one shared library approach, Linux packages, GLib bindings
> and Ruby bindings can follow the change.
>
>
> Thanks,
> --
> kou
>
> In <cajpuwmdwencjpbw+hrswaojfez7e_yci-fg2d3lwgvncf45...@mail.gmail.com>
>   "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so /
> .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500,
>   Wes McKinney <wesmck...@gmail.com> wrote:
>
> > One thing I forgot to mention:
> >
> > One of the things driving the creation of new shared libraries is
> > interdependencies. For example:
> >
> > libarrow -> libparquet
> > libarrow -> libarrow_dataset
> > libparquet -> libarrow_dataset
> >
> > With the modular LLVM-like approach this issue goes away.
> >
> > On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >> I forgot to add the link to the LLVM library listing
> >>
> >> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc
> >>
> >> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >> >
> >> > hi folks,
> >> >
> >> > I wanted to share some concerns that I have about our current
> >> > trajectory with regards to producing shared libraries from the Arrow
> >> > build system.
> >> >
> >> > Currently, a comprehensive build produces many shared libraries:
> >> >
> >> > * libarrow
> >> > * libarrow_dataset
> >> > * libarrow_flight
> >> > * libarrow_python
> >> > * libgandiva
> >> > * libparquet
> >> > * libplasma
> >> >
> >> > There are some others. There are a number of problems with the
> current approach:
> >> >
> >> > * Each DLL needs its own set of "visibility" macros to control the use
> >> > of __declspec(dllimport/dllexport) on Windows, which is necessary to
> >> > instruct the import or export of symbols between DLLs on Windows. See
> >> > e.g.
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h
> >> >
> >> > * Templates instantiated in one DLL may cause a violation of the One
> >> > Definition Rule during linking (we lost at least a day of work time
> >> > collectively to issues around this in ARROW-6244). It is good to be
> >> > able to share common template interfaces in general
> >> >
> >> > * Statically-linked dependencies in one shared lib may need to be
> >> > statically linked into another library. For example, libgandiva
> >> > statically links parts of LLVM, but we will likely have some other
> >> > code that makes use of LLVM for other purposes (it has been discussed
> >> > in the context of Avro parsing)
> >> >
> >> > Overall, my preferred solution to these issues is to move to a similar
> >> > approach to what the LLVM project does. To help understand, let me
> >> > have you first look at the libraries that come from the llvm-7-dev
> >> > package on Ubuntu
> >> >
> >> > Here we have a collection of static "module" libraries that implement
> >> > different parts of the LLVM platform. Finally, a _single_ shared
> >> > library libLLVM-7.so is produced.
> >> >
> >> > I think we should do the same thing in Apache Arrow. So we only ever
> >> > will produce a single shared library from the build. We can
> >> > additionally make the "name" of this shared library configurable to
> >> > suit different needs. For example, the default name could be simply
> >> > "libarrow.so" or something. But if someone wants to produce a
> >> > barebones Parquet shared library they can override the name to create
> >> > a "libparquet.so" that contains only the "libarrow_core.a" and
> >> > "libarrow_io.a" symbols needed for reading Parquet files.
> >> >
> >> > This would have additional benefits:
> >> >
> >> > * Use the same visibility macros for all exported C++ symbols, rather
> >> > than having to define DLL-specific visibility
> >> >
> >> > * Improved modularization of builds and linking for third party users,
> >> > similar to the way that LLVM's modular linking works, see the way that
> >> > Gandiva requests specific components from LLVM to use for static
> >> > linking
> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53
> >> >
> >> > * Net simpler linking and deployment. Only one shared library to deal
> with
> >> >
> >> > There are some drawbacks, however:
> >> >
> >> > * Our C++ Linux packaging approach would need to be changed to be more
> >> > LLVM-like (a single .deb/.yum package containing the C++ platform
> >> > rather than many packages as now)
> >> >
> >> > Interested to hear from other C++ developers.
> >> >
> >> > Thanks
> >> > Wes
>

Reply via email to