Hello,

I'm actually against this proposal.

My main concern is at the moment that Arrow C++/Python grows to a really heavy 
tool where you always have to bring along all baggage even when you're only 
using a small part of it. This is a problem which makes it harder to use Arrow 
in projects because:

* Simply the sheer size, the more dependencies the full build has, we grow 
further in the size of the installable.
* Having a large number of dependencies also means that you will need to take 
care of security scanning of all of these in production settings. Even when 
you're not using the parts, you will need to check for version updates, correct 
licenses and origin of the dependencies. Having a more modular is much simpler 
than mastering the art of convincing corporate IT.
* Defining dependencies from third-party libraries gets less transperant. When 
a library depends just on a large libarrow.so and starts with a missing symbol 
error, a user is confused and might think that the Arrow installation is 
corrupt whereas if the error reports that libarrow_flight.so is missing, he is 
much more aware that his local build is one without Flight being built.

I would actually like to see the pyarrow packages split up into several 
packages in the future, making the C++ part a single shared object would quite 
hinder this. I don't have the resources to move forward with this now but as I 
know that I will need this, I'm going to want to implement this sometime.

Uwe

On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote:
> I don't have a strong opinion here, but had a question and comment:
> 
> Are there are implications from a project governance perspective of
> packaging Parquet and Arrow into a single shared library?
> 
> As a comment, but I'm a big +1 on trying to tease apart the circular
> dependencies between Parquet/Arrow (and any other modules).  As noted
> above, I think this boils down to isolating IO and Buffer data structures
> into 1 library and having the Arrow Array data structures in their own
> separate libraries.
> 
> Thanks,
> Micah
> 
> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei <k...@clear-code.com> wrote:
> 
> > Hi,
> >
> > If this is circular, it's a problem. But this isn't circular
> > for now.
> >
> > I think that we can use libarrow as the fundamental shared
> > library to provide common implementation like [1] if we need
> > to provide common implementation for template. (I think that
> > we don't provide common implementation for template.)
> >
> > [1]
> > https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6
> >
> > Anyway, I'm not strongly oppose to this idea. If we choose
> > one shared library approach, Linux packages, GLib bindings
> > and Ruby bindings can follow the change.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In <cajpuwmdwencjpbw+hrswaojfez7e_yci-fg2d3lwgvncf45...@mail.gmail.com>
> >   "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so /
> > .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500,
> >   Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > One thing I forgot to mention:
> > >
> > > One of the things driving the creation of new shared libraries is
> > > interdependencies. For example:
> > >
> > > libarrow -> libparquet
> > > libarrow -> libarrow_dataset
> > > libparquet -> libarrow_dataset
> > >
> > > With the modular LLVM-like approach this issue goes away.
> > >
> > > On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >>
> > >> I forgot to add the link to the LLVM library listing
> > >>
> > >> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc
> > >>
> > >> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >> >
> > >> > hi folks,
> > >> >
> > >> > I wanted to share some concerns that I have about our current
> > >> > trajectory with regards to producing shared libraries from the Arrow
> > >> > build system.
> > >> >
> > >> > Currently, a comprehensive build produces many shared libraries:
> > >> >
> > >> > * libarrow
> > >> > * libarrow_dataset
> > >> > * libarrow_flight
> > >> > * libarrow_python
> > >> > * libgandiva
> > >> > * libparquet
> > >> > * libplasma
> > >> >
> > >> > There are some others. There are a number of problems with the
> > current approach:
> > >> >
> > >> > * Each DLL needs its own set of "visibility" macros to control the use
> > >> > of __declspec(dllimport/dllexport) on Windows, which is necessary to
> > >> > instruct the import or export of symbols between DLLs on Windows. See
> > >> > e.g.
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h
> > >> >
> > >> > * Templates instantiated in one DLL may cause a violation of the One
> > >> > Definition Rule during linking (we lost at least a day of work time
> > >> > collectively to issues around this in ARROW-6244). It is good to be
> > >> > able to share common template interfaces in general
> > >> >
> > >> > * Statically-linked dependencies in one shared lib may need to be
> > >> > statically linked into another library. For example, libgandiva
> > >> > statically links parts of LLVM, but we will likely have some other
> > >> > code that makes use of LLVM for other purposes (it has been discussed
> > >> > in the context of Avro parsing)
> > >> >
> > >> > Overall, my preferred solution to these issues is to move to a similar
> > >> > approach to what the LLVM project does. To help understand, let me
> > >> > have you first look at the libraries that come from the llvm-7-dev
> > >> > package on Ubuntu
> > >> >
> > >> > Here we have a collection of static "module" libraries that implement
> > >> > different parts of the LLVM platform. Finally, a _single_ shared
> > >> > library libLLVM-7.so is produced.
> > >> >
> > >> > I think we should do the same thing in Apache Arrow. So we only ever
> > >> > will produce a single shared library from the build. We can
> > >> > additionally make the "name" of this shared library configurable to
> > >> > suit different needs. For example, the default name could be simply
> > >> > "libarrow.so" or something. But if someone wants to produce a
> > >> > barebones Parquet shared library they can override the name to create
> > >> > a "libparquet.so" that contains only the "libarrow_core.a" and
> > >> > "libarrow_io.a" symbols needed for reading Parquet files.
> > >> >
> > >> > This would have additional benefits:
> > >> >
> > >> > * Use the same visibility macros for all exported C++ symbols, rather
> > >> > than having to define DLL-specific visibility
> > >> >
> > >> > * Improved modularization of builds and linking for third party users,
> > >> > similar to the way that LLVM's modular linking works, see the way that
> > >> > Gandiva requests specific components from LLVM to use for static
> > >> > linking
> > https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53
> > >> >
> > >> > * Net simpler linking and deployment. Only one shared library to deal
> > with
> > >> >
> > >> > There are some drawbacks, however:
> > >> >
> > >> > * Our C++ Linux packaging approach would need to be changed to be more
> > >> > LLVM-like (a single .deb/.yum package containing the C++ platform
> > >> > rather than many packages as now)
> > >> >
> > >> > Interested to hear from other C++ developers.
> > >> >
> > >> > Thanks
> > >> > Wes
> >
>

Reply via email to