I agree with Uwe that becoming more monolithic than we already are may become a big PR problem at some point.
Regards Antoine. Le 17/09/2019 à 09:41, Uwe L. Korn a écrit : > Hello, > > I'm actually against this proposal. > > My main concern is at the moment that Arrow C++/Python grows to a really > heavy tool where you always have to bring along all baggage even when you're > only using a small part of it. This is a problem which makes it harder to use > Arrow in projects because: > > * Simply the sheer size, the more dependencies the full build has, we grow > further in the size of the installable. > * Having a large number of dependencies also means that you will need to take > care of security scanning of all of these in production settings. Even when > you're not using the parts, you will need to check for version updates, > correct licenses and origin of the dependencies. Having a more modular is > much simpler than mastering the art of convincing corporate IT. > * Defining dependencies from third-party libraries gets less transperant. > When a library depends just on a large libarrow.so and starts with a missing > symbol error, a user is confused and might think that the Arrow installation > is corrupt whereas if the error reports that libarrow_flight.so is missing, > he is much more aware that his local build is one without Flight being built. > > I would actually like to see the pyarrow packages split up into several > packages in the future, making the C++ part a single shared object would > quite hinder this. I don't have the resources to move forward with this now > but as I know that I will need this, I'm going to want to implement this > sometime. > > Uwe > > On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote: >> I don't have a strong opinion here, but had a question and comment: >> >> Are there are implications from a project governance perspective of >> packaging Parquet and Arrow into a single shared library? >> >> As a comment, but I'm a big +1 on trying to tease apart the circular >> dependencies between Parquet/Arrow (and any other modules). As noted >> above, I think this boils down to isolating IO and Buffer data structures >> into 1 library and having the Arrow Array data structures in their own >> separate libraries. >> >> Thanks, >> Micah >> >> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>> If this is circular, it's a problem. But this isn't circular >>> for now. >>> >>> I think that we can use libarrow as the fundamental shared >>> library to provide common implementation like [1] if we need >>> to provide common implementation for template. (I think that >>> we don't provide common implementation for template.) >>> >>> [1] >>> https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6 >>> >>> Anyway, I'm not strongly oppose to this idea. If we choose >>> one shared library approach, Linux packages, GLib bindings >>> and Ruby bindings can follow the change. >>> >>> >>> Thanks, >>> -- >>> kou >>> >>> In <cajpuwmdwencjpbw+hrswaojfez7e_yci-fg2d3lwgvncf45...@mail.gmail.com> >>> "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so / >>> .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500, >>> Wes McKinney <wesmck...@gmail.com> wrote: >>> >>>> One thing I forgot to mention: >>>> >>>> One of the things driving the creation of new shared libraries is >>>> interdependencies. For example: >>>> >>>> libarrow -> libparquet >>>> libarrow -> libarrow_dataset >>>> libparquet -> libarrow_dataset >>>> >>>> With the modular LLVM-like approach this issue goes away. >>>> >>>> On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>>>> >>>>> I forgot to add the link to the LLVM library listing >>>>> >>>>> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc >>>>> >>>>> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>>>>> >>>>>> hi folks, >>>>>> >>>>>> I wanted to share some concerns that I have about our current >>>>>> trajectory with regards to producing shared libraries from the Arrow >>>>>> build system. >>>>>> >>>>>> Currently, a comprehensive build produces many shared libraries: >>>>>> >>>>>> * libarrow >>>>>> * libarrow_dataset >>>>>> * libarrow_flight >>>>>> * libarrow_python >>>>>> * libgandiva >>>>>> * libparquet >>>>>> * libplasma >>>>>> >>>>>> There are some others. There are a number of problems with the >>> current approach: >>>>>> >>>>>> * Each DLL needs its own set of "visibility" macros to control the use >>>>>> of __declspec(dllimport/dllexport) on Windows, which is necessary to >>>>>> instruct the import or export of symbols between DLLs on Windows. See >>>>>> e.g. >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h >>>>>> >>>>>> * Templates instantiated in one DLL may cause a violation of the One >>>>>> Definition Rule during linking (we lost at least a day of work time >>>>>> collectively to issues around this in ARROW-6244). It is good to be >>>>>> able to share common template interfaces in general >>>>>> >>>>>> * Statically-linked dependencies in one shared lib may need to be >>>>>> statically linked into another library. For example, libgandiva >>>>>> statically links parts of LLVM, but we will likely have some other >>>>>> code that makes use of LLVM for other purposes (it has been discussed >>>>>> in the context of Avro parsing) >>>>>> >>>>>> Overall, my preferred solution to these issues is to move to a similar >>>>>> approach to what the LLVM project does. To help understand, let me >>>>>> have you first look at the libraries that come from the llvm-7-dev >>>>>> package on Ubuntu >>>>>> >>>>>> Here we have a collection of static "module" libraries that implement >>>>>> different parts of the LLVM platform. Finally, a _single_ shared >>>>>> library libLLVM-7.so is produced. >>>>>> >>>>>> I think we should do the same thing in Apache Arrow. So we only ever >>>>>> will produce a single shared library from the build. We can >>>>>> additionally make the "name" of this shared library configurable to >>>>>> suit different needs. For example, the default name could be simply >>>>>> "libarrow.so" or something. But if someone wants to produce a >>>>>> barebones Parquet shared library they can override the name to create >>>>>> a "libparquet.so" that contains only the "libarrow_core.a" and >>>>>> "libarrow_io.a" symbols needed for reading Parquet files. >>>>>> >>>>>> This would have additional benefits: >>>>>> >>>>>> * Use the same visibility macros for all exported C++ symbols, rather >>>>>> than having to define DLL-specific visibility >>>>>> >>>>>> * Improved modularization of builds and linking for third party users, >>>>>> similar to the way that LLVM's modular linking works, see the way that >>>>>> Gandiva requests specific components from LLVM to use for static >>>>>> linking >>> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53 >>>>>> >>>>>> * Net simpler linking and deployment. Only one shared library to deal >>> with >>>>>> >>>>>> There are some drawbacks, however: >>>>>> >>>>>> * Our C++ Linux packaging approach would need to be changed to be more >>>>>> LLVM-like (a single .deb/.yum package containing the C++ platform >>>>>> rather than many packages as now) >>>>>> >>>>>> Interested to hear from other C++ developers. >>>>>> >>>>>> Thanks >>>>>> Wes >>> >>