[
https://issues.apache.org/jira/browse/BEAM-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía updated BEAM-1799:
-------------------------------
Component/s: (was: z-do-not-use-sdk-java-extensions)
testing
> IO ITs: simplify data loading design pattern
> --------------------------------------------
>
> Key: BEAM-1799
> URL: https://issues.apache.org/jira/browse/BEAM-1799
> Project: Beam
> Issue Type: Improvement
> Components: testing
> Reporter: Stephen Sisk
> Assignee: Stephen Sisk
> Priority: Major
> Fix For: 2.0.0
>
>
> Problems with the current solution
> =============================
> * The IO IT data loading guidelines [1] are complicated & aren't "native
> junit" - you end up working around junit rather than working with it (I was a
> part of defining them[0], so I critique the rules with (heart) )
> * Doing data loading using external tools means we have additional
> dependencies outside of the tests themselves. If we *must* use them, it's
> worth the time, but I think we have another option. I find it especially
> amusing since the data loading tools are things like ycsb which themselves
> are benchmarking tools ("I heard you like performance benchmarking, so here's
> a performance benchmarking tool to use before you use your performance
> benchmarking tool"), and really are just solving the problem of "I want to
> write data in parallel to this data store" - that sounds familiar :)
> The current guidelines also don't scale well to performance tests:
> * We want to write medium sized data for perf tests - doing data loading
> using external tools means a minimum of 2 reads & writes. For the small scale
> ITs, that's not a big deal, but for the large scale tests, if we assume we're
> working with a fixed budget, more data transferred/stored ~= fewer tests.
> * If you want to verify that large data sets are correct (or create them),
> you need to actually read and write those large data sets - currently, the
> plan is that data loading/testing infrastructure only runs on one machine, so
> those operations are going to be slow. We aren't working with actual large
> data sets, so it won't take too long, but it's always nice to have faster
> tests.
> New Proposed Solution
> ===================
> Instead of trying to test read and write separately, the test should be a
> "write, then read back what you just wrote", all using the IO under test. To
> support scenarios like "I want to run my read test repeatedly without
> re-writing the data", tests would add flags for "skipCleanUp" and
> "useExistingData".
> Check out the example I wrote up [2]
> I didn't want to invest much time on this before I opened a Jira/talked to
> others, so I plan on expanding on this a bit more/formalizing it in the
> testing docs.
> A reminder of some context:
> * The goals for the ITs & Perf tests are that they are *not* intended to be
> the place where we exercise specific scenarios. Instead, they are tripwires
> designed to find problems with code *we already believe works* (as proven by
> the unit tests) when it runs against real data store instances/runners using
> multiple nodes of both.
> There are some definite disadvantages:
> * There is a class of bugs that you can miss doing this. (namely: "I mangled
> the data on the way into the data store, and then reverse-mangled it again on
> the way back out so it looks fine, even though it is bad in the db") I assume
> that many of us have tested storage code in the past, and so we've thought
> about this trade-off. In this particular environment, where it's
> expensive/tricky to do independent testing of the storage code, I think this
> is the right trade off.
> * The data loading scripts cannot be re-used between languages. I think this
> will be a pretty small relative cost compared to the cost of writing the IO
> in multiple languages, so it shouldn't matter too much. I think we'll save
> more time in not needing to use external tools for loading data.
> * Read-only or write-only data stores - in this case, we'll either need to
> default to the old plan, or implement data loading or verification using beam
> * This assumes the data store support parallelism - in the case where the
> read or write cannot be split, we probably should limit the amount of data we
> process in the tests to what we can reasonably do on a single worker anyway.
> * It's harder to debug when this fails - I agree, and part of what I hope to
> invest a little time in as I go forward is to make it easier to determine
> what the actual failure is. Presumably folks debugging a particular IO's
> failures have tools to look at that IO and will be able to quickly determine
> if it's failing on the read or write.
> * As with the previously before accepted proposal, we are relying on junit's
> afterClass to do cleanups. I don't have a good answer for this - if it proves
> to be a problem, we can investigate.
> * This focuses the test exclusively on reading and writing. To address this,
> if we wanted to write other types of tests, they could either piggy back off
> the writeThenRead test, or it might be that they should be restricted to
> smaller data sets and they should be tested independently from this test and
> simply write their own data to the data store.
> There are some really nice advantages:
> * The test ends up being pretty simple and elegant.
> * We have no external dependencies
> * Read and write occurs the bare minimum number of times
> * I believe we'll be able to create shared PTransforms for generating test
> data & validating test data.
> [0] Past discussion of IT guidelines -
> https://lists.apache.org/thread.html/a8ea2507aee4a849cbb6cd7f3ae23fc8b47d447bd553fa01d6da6348@%3Cdev.beam.apache.org%3E
> [1] Current data loading for IT guidelines -
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
> [2] Example of writeThenRead test -
> https://github.com/ssisk/beam/blob/jdbc-it-perf/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOIT.java#L147
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)