I agree with Uwe that becoming more monolithic than we already are may
become a big PR problem at some point.

Regards

Antoine.


Le 17/09/2019 à 09:41, Uwe L. Korn a écrit :
> Hello,
> 
> I'm actually against this proposal.
> 
> My main concern is at the moment that Arrow C++/Python grows to a really 
> heavy tool where you always have to bring along all baggage even when you're 
> only using a small part of it. This is a problem which makes it harder to use 
> Arrow in projects because:
> 
> * Simply the sheer size, the more dependencies the full build has, we grow 
> further in the size of the installable.
> * Having a large number of dependencies also means that you will need to take 
> care of security scanning of all of these in production settings. Even when 
> you're not using the parts, you will need to check for version updates, 
> correct licenses and origin of the dependencies. Having a more modular is 
> much simpler than mastering the art of convincing corporate IT.
> * Defining dependencies from third-party libraries gets less transperant. 
> When a library depends just on a large libarrow.so and starts with a missing 
> symbol error, a user is confused and might think that the Arrow installation 
> is corrupt whereas if the error reports that libarrow_flight.so is missing, 
> he is much more aware that his local build is one without Flight being built.
> 
> I would actually like to see the pyarrow packages split up into several 
> packages in the future, making the C++ part a single shared object would 
> quite hinder this. I don't have the resources to move forward with this now 
> but as I know that I will need this, I'm going to want to implement this 
> sometime.
> 
> Uwe
> 
> On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote:
>> I don't have a strong opinion here, but had a question and comment:
>>
>> Are there are implications from a project governance perspective of
>> packaging Parquet and Arrow into a single shared library?
>>
>> As a comment, but I'm a big +1 on trying to tease apart the circular
>> dependencies between Parquet/Arrow (and any other modules).  As noted
>> above, I think this boils down to isolating IO and Buffer data structures
>> into 1 library and having the Arrow Array data structures in their own
>> separate libraries.
>>
>> Thanks,
>> Micah
>>
>> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei <k...@clear-code.com> wrote:
>>
>>> Hi,
>>>
>>> If this is circular, it's a problem. But this isn't circular
>>> for now.
>>>
>>> I think that we can use libarrow as the fundamental shared
>>> library to provide common implementation like [1] if we need
>>> to provide common implementation for template. (I think that
>>> we don't provide common implementation for template.)
>>>
>>> [1]
>>> https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6
>>>
>>> Anyway, I'm not strongly oppose to this idea. If we choose
>>> one shared library approach, Linux packages, GLib bindings
>>> and Ruby bindings can follow the change.
>>>
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>> In <cajpuwmdwencjpbw+hrswaojfez7e_yci-fg2d3lwgvncf45...@mail.gmail.com>
>>>   "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so /
>>> .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500,
>>>   Wes McKinney <wesmck...@gmail.com> wrote:
>>>
>>>> One thing I forgot to mention:
>>>>
>>>> One of the things driving the creation of new shared libraries is
>>>> interdependencies. For example:
>>>>
>>>> libarrow -> libparquet
>>>> libarrow -> libarrow_dataset
>>>> libparquet -> libarrow_dataset
>>>>
>>>> With the modular LLVM-like approach this issue goes away.
>>>>
>>>> On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>>>>
>>>>> I forgot to add the link to the LLVM library listing
>>>>>
>>>>> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc
>>>>>
>>>>> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>>>>>
>>>>>> hi folks,
>>>>>>
>>>>>> I wanted to share some concerns that I have about our current
>>>>>> trajectory with regards to producing shared libraries from the Arrow
>>>>>> build system.
>>>>>>
>>>>>> Currently, a comprehensive build produces many shared libraries:
>>>>>>
>>>>>> * libarrow
>>>>>> * libarrow_dataset
>>>>>> * libarrow_flight
>>>>>> * libarrow_python
>>>>>> * libgandiva
>>>>>> * libparquet
>>>>>> * libplasma
>>>>>>
>>>>>> There are some others. There are a number of problems with the
>>> current approach:
>>>>>>
>>>>>> * Each DLL needs its own set of "visibility" macros to control the use
>>>>>> of __declspec(dllimport/dllexport) on Windows, which is necessary to
>>>>>> instruct the import or export of symbols between DLLs on Windows. See
>>>>>> e.g.
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h
>>>>>>
>>>>>> * Templates instantiated in one DLL may cause a violation of the One
>>>>>> Definition Rule during linking (we lost at least a day of work time
>>>>>> collectively to issues around this in ARROW-6244). It is good to be
>>>>>> able to share common template interfaces in general
>>>>>>
>>>>>> * Statically-linked dependencies in one shared lib may need to be
>>>>>> statically linked into another library. For example, libgandiva
>>>>>> statically links parts of LLVM, but we will likely have some other
>>>>>> code that makes use of LLVM for other purposes (it has been discussed
>>>>>> in the context of Avro parsing)
>>>>>>
>>>>>> Overall, my preferred solution to these issues is to move to a similar
>>>>>> approach to what the LLVM project does. To help understand, let me
>>>>>> have you first look at the libraries that come from the llvm-7-dev
>>>>>> package on Ubuntu
>>>>>>
>>>>>> Here we have a collection of static "module" libraries that implement
>>>>>> different parts of the LLVM platform. Finally, a _single_ shared
>>>>>> library libLLVM-7.so is produced.
>>>>>>
>>>>>> I think we should do the same thing in Apache Arrow. So we only ever
>>>>>> will produce a single shared library from the build. We can
>>>>>> additionally make the "name" of this shared library configurable to
>>>>>> suit different needs. For example, the default name could be simply
>>>>>> "libarrow.so" or something. But if someone wants to produce a
>>>>>> barebones Parquet shared library they can override the name to create
>>>>>> a "libparquet.so" that contains only the "libarrow_core.a" and
>>>>>> "libarrow_io.a" symbols needed for reading Parquet files.
>>>>>>
>>>>>> This would have additional benefits:
>>>>>>
>>>>>> * Use the same visibility macros for all exported C++ symbols, rather
>>>>>> than having to define DLL-specific visibility
>>>>>>
>>>>>> * Improved modularization of builds and linking for third party users,
>>>>>> similar to the way that LLVM's modular linking works, see the way that
>>>>>> Gandiva requests specific components from LLVM to use for static
>>>>>> linking
>>> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/FindLLVM.cmake#L53
>>>>>>
>>>>>> * Net simpler linking and deployment. Only one shared library to deal
>>> with
>>>>>>
>>>>>> There are some drawbacks, however:
>>>>>>
>>>>>> * Our C++ Linux packaging approach would need to be changed to be more
>>>>>> LLVM-like (a single .deb/.yum package containing the C++ platform
>>>>>> rather than many packages as now)
>>>>>>
>>>>>> Interested to hear from other C++ developers.
>>>>>>
>>>>>> Thanks
>>>>>> Wes
>>>
>>

Reply via email to