Thanks for the responses. Overall, I'm positive towards the inclusion
given your answers.
Sean Busbey wrote:
On Fri, Mar 6, 2015 at 12:03 PM, Josh Elser<[email protected]> wrote:
First off, thanks for the good-will in taking the time to ask.
My biggest concern in adopting it as a codebase would be ensuring that it
isn't another codebase dropped into contrib/ and subsequently ignored. How
do you plan to avoid this? Who do you see maintaining and running these
tests?
Well, I know I use them when we post candidates. I think it'd be nice if we
all generally got in the habit. Once they've gotten polished up enough to
cut a release we could add it to the e.g. major release procedure. That
would certainly make sure the community stays on it.
You definitely read between the lines. Having a tool for anyone's use is
a plus (I think Christopher touched on this, too). I wanted to make sure
that adoption of this as a contrib didn't immediately imply that it is
required testing. That would be a good goal for the codebase, but I
didn't want them to come as a package-deal.
Some more targeted implementation observations/questions -
* Do you plan to update the scripts to work with Apache Accumulo instead
of CDH specific artifacts? e.g. [1]
Yeah, that's part of the vendor-specific-details clean up I mentioned.
FWIW, I've used this for also testing the ASF artifacts and it's worked
fine.
Cool, thanks.
* For the MapReduce job specifically, why did you write your own and not
use an existing "vetted" job like Continuous Ingest? Is there something
that the included M/R job does which is not already contained by our CI
ingest and verify jobs?
I need to be able to check that none of the data has been corrupted or
lost, and I'd prefer to do it quickly. It's possible for the CI job to have
data corrupted or dropped in a way we can't detect (namely UNREFERENCED
cells).
It's possible, but unlikely, IMO. In a test at home when a single
character was changed (by some still unknown factor), the CI verify
caught it and failed the verification phase.
The data load job is considerably easier to run (esp at scale) than the CI
job. Presuming your cluster is configured correctly, you just use the tool
script and a couple of command line parameters and YARN/MR take care of the
rest. It will also do this across several tables configured with our
different storage options, to make sure we have better coverage.
That is a valid point for the ingest portion. I hadn't thought about that.
The given data verify job is also more parallelizable than the existing
jobs, since each executor can handle its share of the cells on the map side
without regard for the others.
For example, from a newly deployed unoptimized cluster I can
launch-and-forget data load + verify and it will get through ~78M cells in
each of 4 tables (for a total of 312M cells) on a low-power 5 node cluster
in around 7 minute load + 2 minute compaction + 2 minute verify without
using offline scans. (and ~2 min of the load time is taking the
two-level-pre-split optimization path when it isn't needed on this small
cluster). It can do more faster on bigger or better tuned clusters, but the
important bit is that I can check correctness by just telling it where
Accumulo + MR is.
* It looks like the current script only works for 1.4 to 1.6? Do you plan
to support 1.5->1.6, 1.5->1.7, 1.6->1.7? How do you envision this adoption
occurring?
The current script only has comments from a couple of vendor releases. I've
used the overall tooling for ASF releases 1.4 -> 1.5 -> 1.6, 1.4 -> 1.6,
1.5. -> 1.6 and 1.6.0 -> 1.6.1.
For the most part, adding in another target version is just a matter of
checking if the APIs still work. With the adoption of semver, that should
be pretty easy. I have toyed before with adding a shim layer for our API
versions and will probably readdress that once there's a 2.0.
So I think adding those other supported bits will mostly be a matter of
improving the documentation. I'd like to get some ease of use bits
included, like downloading the release or rc tarballs after a prompt for
version numbers. At the very least that documentation part will be a part
of the post-import cleanup.
* As far as exercising internal Accumulo implementation, I think you have
the basics covered. What about some more tricky things over the metadata
table (clone, import, export, merge, split table)? How might additional
functionality be added in a way that can be automatically tested?
Those would be great additions. The current compatibility test is limited
to data compatibility. Adding in packages for other api hooks (like that
import/export works across versions) should be just a matter of writing a
driver that talks to the Accumulo api and then updating the automated
script.
At least import/export and clone should be relatively easy, to the extent
that we can leverage the data compatibility tools to put a table in a known
state and then check that other tables match.
* It seems like you have also targeted a physical set of nodes. Have you
considered actually using some virtualization platform (e.g. vagrant) to
fully automate upgrade-testing? If there is a way that a user can spin up a
few VMs to do the testing, the barrier to entry is much lower (and likely
more foolproof) than requiring the user to set up the environment.
To date, our main concern has been testing against live clusters. Mostly
that's an artifact of internal testing procedures. I'd love it if someone
who's proficient in vagrant or docker or whatever could help add a lower
barrier test point.
Cool. I've been meaning to look at the stuff Wyatt posted on the user
list a week or two ago as a starting point for making an easy-to-spin up
Accumulo instance off of a commit. I'd be very excited for a day when
upgrade testing could be nothing more than `./upgrade-test.sh 1.6.1
1.6.2-rc0`.