Hello, I can think of two other alternatives that make it more visible what Arrow core is and what are the optional components:
* Error out when no component is selected instead of building just the core Arrow. Here we could add an explanative message that list all components and for each component 2-3 words what it does and what it requires. This would make the first-time experience much better. * Split the CMake project into several subprojects. By correctly structuring the CMakefiles, we should be able to separate out the Arrow components into separate CMake projects that can be built independently if needed while all using the same third-party toolchain. We would still have a top-level CMakeLists.txt that is invoked just like the current one but through having subprojects, you would not anymore be bound to use the single top-level one. This would also have some benefit for packagers that could separate out the build of individual Arrow modules. Furthermore, it would also make it easier for PoC/academic projects to just take the Arrow Core sources and drop it in as a CMake subproject; while this is not a good solution for production-grade software, it is quite common practice to do this in research. I really like this approach and I think this is something we should have as a long-term target, I'm also happy to implement given the time but I think one CMake refactor per year is the maximum I can do and that was already eaten up by the dependency detection. Also, I'm unsure about how much this would block us at the moment vs the marketing benefit of having a more modular Arrow; currently I'm leaning on the side that the marketing/adoption benefit would be much larger but we lack someone frustration-tolerant to do the refactoring. Uwe On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote: > hi folks, > > Lately there seem to be more and more people suggesting that the > optional components in the Arrow C++ project are getting in the way of > using the "core" which implements the columnar format and IPC > protocol. I am not sure I agree with this argument, but in general I > think it would be a good idea to make all optional components in the > project "opt in" rather than "opt out" > > To demonstrate where things currently stand, I created a Dockerfile to > try to make the smallest possible and most dependency-free build > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal > > Here is the output of this build > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f > > First, let's look at the CMake invocation > > cmake .. -DBOOST_SOURCE=BUNDLED \ > -DARROW_BOOST_USE_SHARED=OFF \ > -DARROW_COMPUTE=OFF \ > -DARROW_DATASET=OFF \ > -DARROW_JEMALLOC=OFF \ > -DARROW_JSON=ON \ > -DARROW_USE_GLOG=OFF \ > -DARROW_WITH_BZ2=OFF \ > -DARROW_WITH_ZLIB=OFF \ > -DARROW_WITH_ZSTD=OFF \ > -DARROW_WITH_LZ4=OFF \ > -DARROW_WITH_SNAPPY=OFF \ > -DARROW_WITH_BROTLI=OFF \ > -DARROW_BUILD_UTILITIES=OFF > > Aside from the issue of how to obtain and link Boost, here's a couple of > things: > > * COMPUTE and DATASET IMHO should be off by default > * All compression libraries should be turned off > * GLOG should be off by default > * Utilities should be off (they are used for integration testing) > * Jemalloc should probably be off, but we should make it clear that > opting in will yield better performance > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking > the build. I opened ARROW-6590 to fix this > > Aside from potentially changing these defaults, there's some things in > the build that we might want to turn into optional pieces: > > * We should see if we can make boost::filesystem not mandatory in the > barebones build, if only to satisfy the peanut gallery > * double-conversion is used in the CSV module. I think that > double-conversion_ep and the CSV module should both be made opt-in > * rapidjson_ep should be made optional. JSON support is only needed > for integration testing > > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep > is not mandatory. > > In general, enabling optional components is primarily relevant for > packagers. If we implement these changes, a number of package build > scripts will have to change. > > Thanks, > Wes >