On Fri, Mar 6, 2015 at 12:03 PM, Josh Elser <[email protected]> wrote:
> First off, thanks for the good-will in taking the time to ask. > > My biggest concern in adopting it as a codebase would be ensuring that it > isn't another codebase dropped into contrib/ and subsequently ignored. How > do you plan to avoid this? Who do you see maintaining and running these > tests? > > Well, I know I use them when we post candidates. I think it'd be nice if we all generally got in the habit. Once they've gotten polished up enough to cut a release we could add it to the e.g. major release procedure. That would certainly make sure the community stays on it. > Some more targeted implementation observations/questions - > > * Do you plan to update the scripts to work with Apache Accumulo instead > of CDH specific artifacts? e.g. [1] > Yeah, that's part of the vendor-specific-details clean up I mentioned. FWIW, I've used this for also testing the ASF artifacts and it's worked fine. > > * For the MapReduce job specifically, why did you write your own and not > use an existing "vetted" job like Continuous Ingest? Is there something > that the included M/R job does which is not already contained by our CI > ingest and verify jobs? > > I need to be able to check that none of the data has been corrupted or lost, and I'd prefer to do it quickly. It's possible for the CI job to have data corrupted or dropped in a way we can't detect (namely UNREFERENCED cells). The data load job is considerably easier to run (esp at scale) than the CI job. Presuming your cluster is configured correctly, you just use the tool script and a couple of command line parameters and YARN/MR take care of the rest. It will also do this across several tables configured with our different storage options, to make sure we have better coverage. The given data verify job is also more parallelizable than the existing jobs, since each executor can handle its share of the cells on the map side without regard for the others. For example, from a newly deployed unoptimized cluster I can launch-and-forget data load + verify and it will get through ~78M cells in each of 4 tables (for a total of 312M cells) on a low-power 5 node cluster in around 7 minute load + 2 minute compaction + 2 minute verify without using offline scans. (and ~2 min of the load time is taking the two-level-pre-split optimization path when it isn't needed on this small cluster). It can do more faster on bigger or better tuned clusters, but the important bit is that I can check correctness by just telling it where Accumulo + MR is. > * It looks like the current script only works for 1.4 to 1.6? Do you plan > to support 1.5->1.6, 1.5->1.7, 1.6->1.7? How do you envision this adoption > occurring? > > The current script only has comments from a couple of vendor releases. I've used the overall tooling for ASF releases 1.4 -> 1.5 -> 1.6, 1.4 -> 1.6, 1.5. -> 1.6 and 1.6.0 -> 1.6.1. For the most part, adding in another target version is just a matter of checking if the APIs still work. With the adoption of semver, that should be pretty easy. I have toyed before with adding a shim layer for our API versions and will probably readdress that once there's a 2.0. So I think adding those other supported bits will mostly be a matter of improving the documentation. I'd like to get some ease of use bits included, like downloading the release or rc tarballs after a prompt for version numbers. At the very least that documentation part will be a part of the post-import cleanup. > * As far as exercising internal Accumulo implementation, I think you have > the basics covered. What about some more tricky things over the metadata > table (clone, import, export, merge, split table)? How might additional > functionality be added in a way that can be automatically tested? > > Those would be great additions. The current compatibility test is limited to data compatibility. Adding in packages for other api hooks (like that import/export works across versions) should be just a matter of writing a driver that talks to the Accumulo api and then updating the automated script. At least import/export and clone should be relatively easy, to the extent that we can leverage the data compatibility tools to put a table in a known state and then check that other tables match. > * It seems like you have also targeted a physical set of nodes. Have you > considered actually using some virtualization platform (e.g. vagrant) to > fully automate upgrade-testing? If there is a way that a user can spin up a > few VMs to do the testing, the barrier to entry is much lower (and likely > more foolproof) than requiring the user to set up the environment. > > To date, our main concern has been testing against live clusters. Mostly that's an artifact of internal testing procedures. I'd love it if someone who's proficient in vagrant or docker or whatever could help add a lower barrier test point. -- Sean
