Both #1 and #3 seem reasonable to me. I think #2 should be avoided because the Con you listed will, I think, make contributing to Impala difficult for new contributors, and I think that's more serious than the Cons for #1 and #3.
On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <[email protected]> wrote: > One of the tasks remaining before we can push Impala's code to the ASF's > git instance is to reduce the size of the repository. Right now even a > checkout of origin/cdh5-trunk is in the multi-GB range. > > The vast majority of that is in the thirdparty/ directory, which adds up > over the git history to be pretty huge with all the various versions we've > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get > rid of thirdparty/ altogether. > > There are two main dependency types in thirdparty/. The first is a > compile-time C++ dependency like open-ldap or avro-c. These are (almost) > all superseded by the toolchain (see > https://github.com/cloudera/native-toolchain) build. A couple of > exceptions > are Squeasel and Mustache which don't produce their own libraries but are > source files directly included in the Impala build. I don't see a good > reason we couldn't move those to the toolchain as well. > > The other kind of dependency are the test binaries that are used when we > start Impala's test environment (i.e. start the Hive metastore, HBase, etc, > etc.). These are trickier to extract (they're not just JARs, but bin/hadoop > etc. etc.). We also need to be able to change these dependencies pretty > efficiently - the upstream ASF repo should use ASF-released artifacts here, > but downstream vendors (like Cloudera) will want to replace the ASF > artifacts with their own releases. > > Note that the Java binaries in thirdparty/ are *not* the compile-time > dependencies for Impala's Java frontend - those are resolved via Maven. > It's a bad thing that there's two dependency resolution mechanisms, but we > might not be able to solve that issue right now. > > So what should we do with the test dependencies? I see the following > options: > > 1. Put them in the native-toolchain repository. *Pros:* (almost) all > dependency resolution comes from one place. *Cons:* native-toolchain would > change very frequently as new releases happen. > > 2. Don't provide any built-in mechanism for starting a test environment. If > you want to test Impala - set up your own Hadoop cluster instance. > *Pros:* removes > a lot of complexity *Cons: *pushes a lot of work onto the user, makes it > harder to run self-contained tests. > > 3. Have a separate test-dependencies repository that does basically the > same thing as the toolchain. *Pros:* separates out fast-moving dependencies > from slow-moving ones *Cons:* more moving parts. HDFS would need to be in > both repositories (as libhdfs is a compile-time dependency for the > backend). > > My preference is for option #1. We can do something like the following: > > * Add a cmake target to 'build' a test environment (resolve test > dependencies, start mini-cluster using checked-in scripts) > * Add scripts to native-toolchain to download tarballs for HBase, HDFS, > Hive and others just like compile-time dependencies. Update Impala's CMake > scripts to use those the local toolchain directory to find binaries, > management scripts etc. > * During each upstream release, add any new dependencies to > native-toolchain, and update impala.git/bin/impala-config.sh with the new > version numbers. > > What does everyone think? >
