Hello,

I can think of two other alternatives that make it more visible what Arrow core 
is and what are the optional components:

* Error out when no component is selected instead of building just the core 
Arrow. Here we could add an explanative message that list all components and 
for each component 2-3 words what it does and what it requires. This would make 
the first-time experience much better.
* Split the CMake project into several subprojects. By correctly structuring 
the CMakefiles, we should be able to separate out the Arrow components into 
separate CMake projects that can be built independently if needed while all 
using the same third-party toolchain. We would still have a top-level 
CMakeLists.txt that is invoked just like the current one but through having 
subprojects, you would not anymore be bound to use the single top-level one. 
This would also have some benefit for packagers that could separate out the 
build of individual Arrow modules. Furthermore, it would also make it easier 
for PoC/academic projects to just take the Arrow Core sources and drop it in as 
a CMake subproject; while this is not a good solution for production-grade 
software, it is quite common practice to do this in research.
I really like this approach and I think this is something we should have as a 
long-term target, I'm also happy to implement given the time but I think one 
CMake refactor per year is the maximum I can do and that was already eaten up 
by the dependency detection. Also, I'm unsure about how much this would block 
us at the moment vs the marketing benefit of having a more modular Arrow; 
currently I'm leaning on the side that the marketing/adoption benefit would be 
much larger but we lack someone frustration-tolerant to do the refactoring.

Uwe

On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> hi folks,
> 
> Lately there seem to be more and more people suggesting that the
> optional components in the Arrow C++ project are getting in the way of
> using the "core" which implements the columnar format and IPC
> protocol. I am not sure I agree with this argument, but in general I
> think it would be a good idea to make all optional components in the
> project "opt in" rather than "opt out"
> 
> To demonstrate where things currently stand, I created a Dockerfile to
> try to make the smallest possible and most dependency-free build
> 
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> 
> Here is the output of this build
> 
> https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> 
> First, let's look at the CMake invocation
> 
> cmake .. -DBOOST_SOURCE=BUNDLED \
> -DARROW_BOOST_USE_SHARED=OFF \
> -DARROW_COMPUTE=OFF \
> -DARROW_DATASET=OFF \
> -DARROW_JEMALLOC=OFF \
> -DARROW_JSON=ON \
> -DARROW_USE_GLOG=OFF \
> -DARROW_WITH_BZ2=OFF \
> -DARROW_WITH_ZLIB=OFF \
> -DARROW_WITH_ZSTD=OFF \
> -DARROW_WITH_LZ4=OFF \
> -DARROW_WITH_SNAPPY=OFF \
> -DARROW_WITH_BROTLI=OFF \
> -DARROW_BUILD_UTILITIES=OFF
> 
> Aside from the issue of how to obtain and link Boost, here's a couple of 
> things:
> 
> * COMPUTE and DATASET IMHO should be off by default
> * All compression libraries should be turned off
> * GLOG should be off by default
> * Utilities should be off (they are used for integration testing)
> * Jemalloc should probably be off, but we should make it clear that
> opting in will yield better performance
> 
> I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> the build. I opened ARROW-6590 to fix this
> 
> Aside from potentially changing these defaults, there's some things in
> the build that we might want to turn into optional pieces:
> 
> * We should see if we can make boost::filesystem not mandatory in the
> barebones build, if only to satisfy the peanut gallery
> * double-conversion is used in the CSV module. I think that
> double-conversion_ep and the CSV module should both be made opt-in
> * rapidjson_ep should be made optional. JSON support is only needed
> for integration testing
> 
> We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> is not mandatory.
> 
> In general, enabling optional components is primarily relevant for
> packagers. If we implement these changes, a number of package build
> scripts will have to change.
> 
> Thanks,
> Wes
>

Reply via email to