So my feeling on what we should be working towards: - Support for reproducible builds using the toolchain. - Support for building against the system versions of all dependencies (subject to whatever constraints about versions we agree on) - A straightforward way to set up a working test environment
I think native-toolchain is probably the way to go but I suspect we'll need to make some changes to native-toolchain at some point: - Currently native-toolchain is designed to build every historical version of every package every time. At some point this will stop scaling as we add more packages and more versions. - We probably eventually need a way for users of native-toolchain to get their source packages from somewhere other than a cloudera-managed S3 bucket I don't think we should use native-toolchain as a catch-all for all dependencies. I think it's reasonable to add C++ libraries that we want to be part of the reproducible native build, but I don't think it makes sense to use the toolchain to download precompiled dependencies that won't be part of the reproducible build. I.e. if buildall.sh doesn't build the library from source using the toolchain's compiler, I don't think it should be in the native toolchain. libhdfs I think is a bit of a corner case in that it is native code that we link with Impala that is part of the Hadoop distribution. We could move it to the toolchain if we want to build it as a standalone library but I'm not sure that necessarily makes sense. - Tim On Thu, Mar 10, 2016 at 11:03 AM, Henry Robinson <[email protected]> wrote: > " the upstream ASF repo should use ASF-released artifacts here" > > While there's precedent elsewhere in the ASF for depending on downstream > vendor-specific artifacts, I feel pretty strongly that there should be a > clean separation between the ASF and downstream dependencies. > > I take your point about the flexibility of choosing which toolchain > dependencies to take. Might be a good follow-on step to allow that > (TOOLCHAIN_MODE={ALL, COMPILE, TEST}) or something, but we can wait to see > if this is needed by the community. > > On 10 March 2016 at 10:59, Matthew Jacobs <[email protected]> wrote: > > > Thanks for outlining these options. How does native-toolchain factor into > > our ASF story? I.e. do we need it to be less Cloudera-project-oriented, > or > > is it OK for it to contain CDH (rather than Apache Hadoop) deployments? > If > > we're considering it to be more Cloudera-focused, it seems like it could > > make upstream contributions difficult as there wouldn't really be a > non-CDH > > build/runtime toolchain. I guess upstream contributors could fork our > > toolchain (or start their own) and replace the CDH components? If we > detach > > the compile-time dependencies and the test runtime projects, it would > > probably make things easier for the rest of the world as they could > easily > > take the native-toolchain, the test environment, or both. > > > > On Thu, Mar 10, 2016 at 10:38 AM Henry Robinson <[email protected]> > wrote: > > > > > One of the tasks remaining before we can push Impala's code to the > ASF's > > > git instance is to reduce the size of the repository. Right now even a > > > checkout of origin/cdh5-trunk is in the multi-GB range. > > > > > > The vast majority of that is in the thirdparty/ directory, which adds > up > > > over the git history to be pretty huge with all the various versions > > we've > > > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we > get > > > rid of thirdparty/ altogether. > > > > > > There are two main dependency types in thirdparty/. The first is a > > > compile-time C++ dependency like open-ldap or avro-c. These are > (almost) > > > all superseded by the toolchain (see > > > https://github.com/cloudera/native-toolchain) build. A couple of > > > exceptions > > > are Squeasel and Mustache which don't produce their own libraries but > are > > > source files directly included in the Impala build. I don't see a good > > > reason we couldn't move those to the toolchain as well. > > > > > > The other kind of dependency are the test binaries that are used when > we > > > start Impala's test environment (i.e. start the Hive metastore, HBase, > > etc, > > > etc.). These are trickier to extract (they're not just JARs, but > > bin/hadoop > > > etc. etc.). We also need to be able to change these dependencies pretty > > > efficiently - the upstream ASF repo should use ASF-released artifacts > > here, > > > but downstream vendors (like Cloudera) will want to replace the ASF > > > artifacts with their own releases. > > > > > > Note that the Java binaries in thirdparty/ are *not* the compile-time > > > dependencies for Impala's Java frontend - those are resolved via Maven. > > > It's a bad thing that there's two dependency resolution mechanisms, but > > we > > > might not be able to solve that issue right now. > > > > > > So what should we do with the test dependencies? I see the following > > > options: > > > > > > 1. Put them in the native-toolchain repository. *Pros:* (almost) all > > > dependency resolution comes from one place. *Cons:* native-toolchain > > would > > > change very frequently as new releases happen. > > > > > > 2. Don't provide any built-in mechanism for starting a test > environment. > > If > > > you want to test Impala - set up your own Hadoop cluster instance. > > > *Pros:* removes > > > a lot of complexity *Cons: *pushes a lot of work onto the user, makes > it > > > harder to run self-contained tests. > > > > > > 3. Have a separate test-dependencies repository that does basically the > > > same thing as the toolchain. *Pros:* separates out fast-moving > > dependencies > > > from slow-moving ones *Cons:* more moving parts. HDFS would need to be > in > > > both repositories (as libhdfs is a compile-time dependency for the > > > backend). > > > > > > My preference is for option #1. We can do something like the following: > > > > > > * Add a cmake target to 'build' a test environment (resolve test > > > dependencies, start mini-cluster using checked-in scripts) > > > * Add scripts to native-toolchain to download tarballs for HBase, HDFS, > > > Hive and others just like compile-time dependencies. Update Impala's > > CMake > > > scripts to use those the local toolchain directory to find binaries, > > > management scripts etc. > > > * During each upstream release, add any new dependencies to > > > native-toolchain, and update impala.git/bin/impala-config.sh with the > new > > > version numbers. > > > > > > What does everyone think? > > > > > >
