Both #1 and #3 seem reasonable to me. I think #2 should be avoided because
the Con you listed will, I think, make contributing to Impala difficult for
new contributors, and I think that's more serious than the Cons for #1 and
#3.

On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <[email protected]> wrote:

> One of the tasks remaining before we can push Impala's code to the ASF's
> git instance is to reduce the size of the repository. Right now even a
> checkout of origin/cdh5-trunk is in the multi-GB range.
>
> The vast majority of that is in the thirdparty/ directory, which adds up
> over the git history to be pretty huge with all the various versions we've
> checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
> rid of thirdparty/ altogether.
>
> There are two main dependency types in thirdparty/. The first is a
> compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> all superseded by the toolchain (see
> https://github.com/cloudera/native-toolchain) build. A couple of
> exceptions
> are Squeasel and Mustache which don't produce their own libraries but are
> source files directly included in the Impala build. I don't see a good
> reason we couldn't move those to the toolchain as well.
>
> The other kind of dependency are the test binaries that are used when we
> start Impala's test environment (i.e. start the Hive metastore, HBase, etc,
> etc.). These are trickier to extract (they're not just JARs, but bin/hadoop
> etc. etc.). We also need to be able to change these dependencies pretty
> efficiently - the upstream ASF repo should use ASF-released artifacts here,
> but downstream vendors (like Cloudera) will want to replace the ASF
> artifacts with their own releases.
>
> Note that the Java binaries in thirdparty/ are *not* the compile-time
> dependencies for Impala's Java frontend - those are resolved via Maven.
> It's a bad thing that there's two dependency resolution mechanisms, but we
> might not be able to solve that issue right now.
>
> So what should we do with the test dependencies? I see the following
> options:
>
> 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> dependency resolution comes from one place. *Cons:* native-toolchain would
> change very frequently as new releases happen.
>
> 2. Don't provide any built-in mechanism for starting a test environment. If
> you want to test Impala - set up your own Hadoop cluster instance.
> *Pros:* removes
> a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> harder to run self-contained tests.
>
> 3. Have a separate test-dependencies repository that does basically the
> same thing as the toolchain. *Pros:* separates out fast-moving dependencies
> from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
> both repositories (as libhdfs is a compile-time dependency for the
> backend).
>
> My preference is for option #1. We can do something like the following:
>
> * Add a cmake target to 'build' a test environment (resolve test
> dependencies, start mini-cluster using checked-in scripts)
> * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> Hive and others just like compile-time dependencies. Update Impala's CMake
> scripts to use those the local toolchain directory to find binaries,
> management scripts etc.
> * During each upstream release, add any new dependencies to
> native-toolchain, and update impala.git/bin/impala-config.sh with the new
> version numbers.
>
> What does everyone think?
>

Reply via email to