One of the tasks remaining before we can push Impala's code to the ASF's
git instance is to reduce the size of the repository. Right now even a
checkout of origin/cdh5-trunk is in the multi-GB range.

The vast majority of that is in the thirdparty/ directory, which adds up
over the git history to be pretty huge with all the various versions we've
checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
rid of thirdparty/ altogether.

There are two main dependency types in thirdparty/. The first is a
compile-time C++ dependency like open-ldap or avro-c. These are (almost)
all superseded by the toolchain (see
https://github.com/cloudera/native-toolchain) build. A couple of exceptions
are Squeasel and Mustache which don't produce their own libraries but are
source files directly included in the Impala build. I don't see a good
reason we couldn't move those to the toolchain as well.

The other kind of dependency are the test binaries that are used when we
start Impala's test environment (i.e. start the Hive metastore, HBase, etc,
etc.). These are trickier to extract (they're not just JARs, but bin/hadoop
etc. etc.). We also need to be able to change these dependencies pretty
efficiently - the upstream ASF repo should use ASF-released artifacts here,
but downstream vendors (like Cloudera) will want to replace the ASF
artifacts with their own releases.

Note that the Java binaries in thirdparty/ are *not* the compile-time
dependencies for Impala's Java frontend - those are resolved via Maven.
It's a bad thing that there's two dependency resolution mechanisms, but we
might not be able to solve that issue right now.

So what should we do with the test dependencies? I see the following
options:

1. Put them in the native-toolchain repository. *Pros:* (almost) all
dependency resolution comes from one place. *Cons:* native-toolchain would
change very frequently as new releases happen.

2. Don't provide any built-in mechanism for starting a test environment. If
you want to test Impala - set up your own Hadoop cluster instance.
*Pros:* removes
a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
harder to run self-contained tests.

3. Have a separate test-dependencies repository that does basically the
same thing as the toolchain. *Pros:* separates out fast-moving dependencies
from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
both repositories (as libhdfs is a compile-time dependency for the backend).

My preference is for option #1. We can do something like the following:

* Add a cmake target to 'build' a test environment (resolve test
dependencies, start mini-cluster using checked-in scripts)
* Add scripts to native-toolchain to download tarballs for HBase, HDFS,
Hive and others just like compile-time dependencies. Update Impala's CMake
scripts to use those the local toolchain directory to find binaries,
management scripts etc.
* During each upstream release, add any new dependencies to
native-toolchain, and update impala.git/bin/impala-config.sh with the new
version numbers.

What does everyone think?

Reply via email to