Hello Wes and Sebastien,
First off a correction from earlier: It appears I misinterpreted the documentation and thought that 'thirdparty/download_dependencies.sh' would download all dependencies no matter what, which isn't the case. Apologies.

We were _originally_ building Arrow with the following command:

${long_path}/bin/cmake ${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \
    -DARROW_USE_SSE=ON \
    -DARROW_PYTHON=ON  \
    -DCMAKE_INSTALL_PREFIX=${path_to_install_dir} \
    -DCMAKE_CXX_COMPILER=${long_path}/bin/g++ \
    -DCMAKE_CXX_STANDARD=17 \
    -DARROW_WITH_ZSTD=OFF \
    -DARROW_BUILD_TESTS=OFF \
    -DARROW_BUILD_BENCHMARKS=OFF \
    -DARROW_PARQUET=ON \
    -DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi -mno-sse3 -mno-ssse3 \     -DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi -mno-sse3 -mno-ssse3 \
    -DBoost_NO_BOOST_CMAKE=ON \
    -DBoost_ADDITIONAL_VERSIONS=1.70

This produced the following in our build logs:
    [  7%] Performing download step (download, verify and extract) for 'rapidjson_ep'     [  8%] Performing download step (download, verify and extract) for 'double-conversion_ep'     [  8%] Performing download step (download, verify and extract) for 'snappy_ep'     [  8%] Performing download step (download, verify and extract) for 'lz4_ep'     [  8%] Performing download step (download, verify and extract) for 'jemalloc_ep'     [  8%] Performing download step (download, verify and extract) for 'gflags_ep'     [  9%] Performing download step (download, verify and extract) for 'thrift_ep'     [  9%] Performing download step (download, verify and extract) for 'brotli_ep'


Thank you for opening the Jira issue. I agree, the difficulty in telling why some of these packages are downloaded is a core part of the issue. In the example above I had some difficulty when trying to figure out why Snappy, for instance, was downloaded. The build's `projects/arrow-0.14.1/src/arrow/0.14.1/cpp/CMakeLists.txt` revealed that the setting ARROW_ORC is the likely cause, I think. Similarly it was unclear why jemalloc, which already exists in our stack, was not taken from the system. I now understand that this is done in order to use a specific version which you can reliably patch, but it would be nice to have some clearer labeling.

In order to avoid offline mirrors interrupting builds we have taken the following steps: The packages downloaded above have now been added properly to the stack, and listed as dependencies of arrow. Arrow is now built like so:

ENVIRONMENT FLATBUFFERS_HOME=${flatbuffers_home} ARROW_JEMALLOC_URL=${local_jemalloc_tar.bz2} ${long_path}/bin/cmake ${another_long_path}/arrow-0.14.1/src/arrow/0.14.1/cpp \
    -DARROW_PYTHON=ON
    -DCMAKE_INSTALL_PREFIX=${path_to_install_dir}
    -DCMAKE_CXX_COMPILER=${long_path}/bin/g++
    -DCMAKE_CXX_STANDARD=17
    -DARROW_WITH_ZSTD=OFF
    -DARROW_BUILD_TESTS=OFF
    -DARROW_BUILD_BENCHMARKS=OFF
    -DARROW_PARQUET=ON
*    -DRapidJSON_ROOT=${rapidjson_home}**
**    -DRAPIDJSON_INCLUDE_DIR=${rapidjson_home}/include*
    "-DCXX_COMMON_FLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi -mno-sse3 -mno-ssse3"     "-DARROW_CXXFLAGS=-march=core2 -mno-sse4.2 -mno-bmi2 -mno-bmi -mno-sse3 -mno-ssse3"
    -DBoost_NO_BOOST_CMAKE=ON \
    -DBoost_ADDITIONAL_VERSIONS=1.70

The dependencies are detected (no longer downloaded), except for jemalloc where the find function has been disabled. As a work-around the ARROW_JEMALLOC_URL is supplied to take the tarball from local storage. Thrift is now built with CMake, identically to how Arrow would do it internally, with the addition of the -fPIC flag. We will look into what features can be safely disabled for Arrow and Thrift in the future. Thank you Sebastien for the pointer to the ALICE build script.

We ended up not going for the full 'offline builds' solution of specifying all URLs, as this would introduce additional complexities in the form of a 'special' set of packages which are not version controlled like the others.

Thank you for the advice.
Kind regards:

    - Richard

On 11/7/19 5:10 PM, Wes McKinney wrote:
I just openedhttps://issues.apache.org/jira/browse/ARROW-7089  about
increasing transparency around what options are causing thirdparty
dependencies to be required

On Thu, Nov 7, 2019 at 10:05 AM Wes McKinney<wesmck...@gmail.com>  wrote:
hi Richard,

On Thu, Nov 7, 2019 at 9:59 AM Richard Bachmann
<richard.bachm...@cern.ch>  wrote:
Hello,
I'm contacting you on behalf of the LCG Releases team at CERN. We
provide a common software stack for LHCb, ATLAS and others to be used at
CERN and the worldwide computing grid.

Right now we're looking into optimizing the way we're building Apache
Arrow (C++ & Python) and its dependencies. Ideally we'd like to build
Arrow using only the minimum of necessary dependencies to run it, and to
use packages already installed in the stack to fulfill these
dependencies. The former would be nice to keep the stack clean, the
latter would help us avoid duplication and failing builds due to mirrors
going offline.

Our builds currently run with the ARROW_DEPENDENCY_SOURCE=AUTO
<https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst>
setting, which results in duplicate and non-essential packages being
downloaded by Arrow, as well as dependency on external mirrors. Setting
it to SYSTEM allows us to avoid the downloads, but then the build
process fails due to missing unused dependencies.
I'm surprised to hear this based on what I know about the build system
and from extensive local development.

Can you show the exact CMake invocation you are using and indicate
which unused dependencies are being downloaded?

In this Docker minimal build (unless something has been recently
broken) that the project can be built with only a small number of
third party dependencies:

https://github.com/apache/arrow/tree/master/cpp/examples/minimal_build

Note that we support a fully "offline" build to allow thirdparty
dependencies to be built in an air-gapped environment

https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#offline-builds

Do you know if there is a recommended way to achieve this? The problem
seems to stem from the fact that all listed dependencies are downloaded,
whether they are needed or not. We have considered patching out the
non-essential dependencies ('double-conversion', 'GTEST', etc.) from the
dependency list, as well as formally adding the unneeded dependencies to
the stack in order to run with the SYSTEM setting. However, if there is
a proper way to do it we would of course prefer to follow that course of
action.
We'll be able to know more based on how you're calling CMake and with
what options, but the build system should not be downloading any
dependencies that are not needed.

Any help would be very appreciated.
Kind regards:

      - Richard Bachmann

Reply via email to