[jira] [Created] (BEAM-1799) IO ITs: simplify data loading design pattern

Stephen Sisk (JIRA) Thu, 23 Mar 2017 16:25:08 -0700

Stephen Sisk created BEAM-1799:
----------------------------------

             Summary: IO ITs: simplify data loading design pattern
                 Key: BEAM-1799
                 URL: https://issues.apache.org/jira/browse/BEAM-1799
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-extensions
            Reporter: Stephen Sisk
            Assignee: Stephen Sisk



Problems with the current solution
=============================
* The IO IT data loading guidelines [1] are complicated & aren't "native junit" 
- you end up working around junit rather than working with it (I was a part of 
defining them[0], so I critique the rules with (heart) )
* Doing data loading using external tools means we have additional dependencies 
outside of the tests themselves. If we *must* use them, it's worth the time, 
but I think we have another option. I find it especially amusing since the data 
loading tools are things like ycsb which themselves are benchmarking tools ("I 
heard you like performance benchmarking, so here's a performance benchmarking 
tool to use before you use your performance benchmarking tool"), and really are 
just solving the problem of "I want to write data in parallel to this data 
store" - that sounds familiar :) 

The current guidelines also don't scale well to performance tests:
* We want to write medium sized data for perf tests - doing data loading using 
external tools means a minimum of 2 reads & writes. For the small scale ITs, 
that's not a big deal, but for the large scale tests, if we assume we're 
working with a fixed budget, more data transferred/stored ~= fewer tests.
* If you want to verify that large data sets are correct (or create them), you 
need to actually read and write those large data sets - currently, the plan is 
that data loading/testing infrastructure only runs on one machine, so those 
operations are going to be slow. We aren't working with actual large data sets, 
so it won't take too long, but it's always nice to have faster tests.

New Proposed Solution
===================
Instead of trying to test read and write separately, the test should be a 
"write, then read back what you just wrote", all using the IO under test. To 
support scenarios like "I want to run my read test repeatedly without 
re-writing the data", tests would add flags for "skipCleanUp" and 
"useExistingData".

Check out the example I wrote up [2]

I didn't want to invest much time on this before I opened a Jira/talked to 
others, so I plan on expanding on this a bit more/formalizing it in the testing 
docs.

A reminder of some context:
* The goals for the ITs & Perf tests are that they are *not* intended to be the 
place where we exercise specific scenarios. Instead, they are tripwires 
designed to find problems with code *we already believe works* (as proven by 
the unit tests) when it runs against real data store instances/runners using 
multiple nodes of both.


There are some definite disadvantages: 
* There is a class of bugs that you can miss doing this. (namely: "I mangled 
the data on the way into the data store, and then reverse-mangled it again on 
the way back out so it looks fine, even though it is bad in the db") I assume 
that many of us have tested storage code in the past, and so we've thought 
about this trade-off. In this particular environment, where it's 
expensive/tricky to do independent testing of the storage code, I think this is 
the right trade off.
* The data loading scripts cannot be re-used between languages. I think this 
will be a pretty small relative cost compared to the cost of writing the IO in 
multiple languages, so it shouldn't matter too much. I think we'll save more 
time in not needing to use external tools for loading data.
* Read-only or write-only data stores - in this case, we'll either need to 
default to the old plan, or implement data loading or verification using beam
* This assumes the data store support parallelism - in the case where the read 
or write cannot be split, we probably should limit the amount of data we 
process in the tests to what we can reasonably do on a single worker anyway.
* It's harder to debug when this fails - I agree, and part of what I hope to 
invest a little time in as I go forward is to make it easier to determine what 
the actual failure is. Presumably folks debugging a particular IO's failures 
have tools to look at that IO and will be able to quickly determine if it's 
failing on the read or write.
* As with the previously before accepted proposal, we are relying on junit's 
afterClass to do cleanups. I don't have a good answer for this - if it proves 
to be a problem, we can investigate. 
* This focuses the test exclusively on reading and writing. To address this, if 
we wanted to write other types of tests, they could either piggy back off the 
writeThenRead test, or it might be that they should be restricted to smaller 
data sets and they should be tested independently from this test and simply 
write their own data to the data store.

There are some really nice advantages:
* The test ends up being pretty simple and elegant.
* We have no external dependencies
* Read and write occurs the bare minimum number of times
* I believe we'll be able to create shared PTransforms for generating test data 
& validating test data.


[0] Past discussion of IT guidelines - 
https://lists.apache.org/thread.html/a8ea2507aee4a849cbb6cd7f3ae23fc8b47d447bd553fa01d6da6348@%3Cdev.beam.apache.org%3E
[1] Current data loading for IT guidelines - 
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
[2] Example of writeThenRead test - 
https://github.com/ssisk/beam/blob/jdbc-it-perf/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOIT.java#L147



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (BEAM-1799) IO ITs: simplify data loading design pattern

Reply via email to