Hello,
I'm contacting you on behalf of the LCG Releases team at CERN. We provide a common software stack for LHCb, ATLAS and others to be used at CERN and the worldwide computing grid.

Right now we're looking into optimizing the way we're building Apache Arrow (C++ & Python) and its dependencies. Ideally we'd like to build Arrow using only the minimum of necessary dependencies to run it, and to use packages already installed in the stack to fulfill these dependencies. The former would be nice to keep the stack clean, the latter would help us avoid duplication and failing builds due to mirrors going offline.

Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO <https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst> setting, which results in duplicate and non-essential packages being downloaded by Arrow, as well as dependency on external mirrors. Setting it to SYSTEM allows us to avoid the downloads, but then the build process fails due to missing unused dependencies.

Do you know if there is a recommended way to achieve this? The problem seems to stem from the fact that all listed dependencies are downloaded, whether they are needed or not. We have considered patching out the non-essential dependencies ('double-conversion', 'GTEST', etc.) from the dependency list, as well as formally adding the unneeded dependencies to the stack in order to run with the SYSTEM setting. However, if there is a proper way to do it we would of course prefer to follow that course of action.


Any help would be very appreciated.
Kind regards:

    - Richard Bachmann

Reply via email to