[
https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881950#comment-16881950
]
Elek, Marton commented on HDDS-1554:
------------------------------------
Thank you very much to work on this Eric Yang. I am very interested to have
some real / useful disk failure injection tests.
First of all let me clarify: I am not against this patch. It introduces new way
to execute tests, but doesn't modify any of the existing parts of the build,
therefore it's a good way to experiment with new approaches. But I would like
to share why I think that this approach introduces new problems (even if my
advices will be ignored).
There are two main problems which should be solved when we create docker based
acceptance tests:
1. How to start / stop / restart docker-based pseudo-clusters (manage the
environment)
2. Execute tests and do the assertions.
In the existing smoketest framework (1.) is handled by shell scripts (2.) is
handled by robotframework.
In this approach (1.) is handled by maven plugins and/or java code and (2.) is
handled by java code.
I have multiple problems with this approach:
* There are no clear separation between the two areas and it makes it
impossible to run the same tests in any other environment. As an example:
existing smoketests can be easily executed in kubernetes.
* Personally I don't think that java is the most effective way to maintain the
execution of external applications. For example the following code can be
replaced with one or two lines of shell code:
{code:java}
private synchronized void startCluster(String mode) throws IOException,
InterruptedException {
String relPath = getClass().getProtectionDomain().getCodeSource()
.getLocation().getFile();
File log = new File(relPath+"../test-dir/docker-compose-up.txt");
String composeFile = relPath+"../../target/docker-compose.yaml";
if (mode.equals(READ_ONLY)) {
composeFile = relPath+"../../target/docker-compose-read-only.yaml";
}
ProcessBuilder pb =
new ProcessBuilder("docker-compose", "-f", composeFile, "up");
pb.redirectErrorStream(true);
pb.redirectOutput(Redirect.appendTo(log));
Process p = pb.start();
assert pb.redirectInput() == Redirect.PIPE;
assert pb.redirectOutput().file() == log;
assert p.getInputStream().read() == -1;
p.waitFor(30L, TimeUnit.SECONDS);
}
{code}
I think all the mentioned goals can be achieved with the existing smoketest
framework (see my uploaded PR for an example). External volume management, or
asserting on exceptions are both can be achieved.
Using the existing smoketest approach would make to maintenance of the tests
more easy.
And a big part of the code introduced here (docker-compose up, waiting for safe
mode, etc.) are already implemented there.
Using `@FixMethodOrder(MethodSorters.NAME_ASCENDING)` is especially error-prone
and shows that we use unit test in a way which is different from the normal
usage (usually the @Test methods should be independent)
I know that Junit is a good hammer, but...
B.) The current tests uses the approach to use external mounts to reuse the
/data directory. It introduces new problems (see the workarounds with the
id/UID). All of them can be avoided with using internal docker volumes which
makes it possible to achieve the same without additional workarounds
(ITDiskReadOnly doesn't require direct access to the volume
ITDiskCorruption.addCorruption can be replaced with simple shell executions
`echo "asd" > ../db`)
C.) The unit test files contains a lot of code duplication inside and outside
the cluster.
* For example ITDiskReadOnly.startCluster and ITDiskReadOnly.stopCluster are
almost the same.
* The logic of waiting for the safe mode is duplicated in all of the
subprojects.
D.) I would use JUnit assertion everywhere. For me combining java and junit
assertions are confusing (ITDiskReadOnly.startCluster/stopCluster)
E.) I would use Autoclosable try-catch blocks (or at least finally blocks) to
close opened resources. It would help to define the boundaries where the
resources are used. For example in `ITDiskReadWrite.testStatFile`:
{code:java}
FileSystem fs = FileSystem.get(ozoneConf);
...
fs.close();
Assert.assertTrue("Create file failed", fs.exists(dest));
{code}
fs seems to be used after the close.
(try catch is used sometimes bot not everywhere)
F.) We don't need to catch `Exceptions` and `fail` as it's already handled by
JUnit framework. Just remove try and catch block.
{code:java}
@Test
public void testUpload() {
LOG.info("Performing upload test.");
try {
OzoneConfiguration ozoneConf = new OzoneConfiguration();
FileSystem fs = FileSystem.get(ozoneConf);
Path dest = new Path("/test");
FSDataOutputStream out = fs.create(dest);
byte[] b = "Hello world\n".getBytes();
out.write(b);
out.close();
fs.close();
Assert.assertTrue("Upload failed", fs.exists(dest));
} catch(Exception e) {
LOG.error("Upload test failed: ", e);
fail("Upload test failed.");
}
}
{code}
G.) You don't need to extend TestCase with JUnit4.
H.) There are two kind of configuration handling: the environment variable
based (usually with the help docker-config) and the file based (where real xml
files are mounted). The two of them are mixed which are confusing. For example
in read-write tests:
{code:java}
om:
image: ${docker.image}
user: ${UID}:${GID}
ports:
- 9862:9862
- 9874:9874
environment:
ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
env_file:
- ./docker-config
command: ["/opt/hadoop/bin/ozone","om"]
volumes:
- ${project.build.directory}/core-site.xml:/etc/hadoop/core-site.xml
- ${project.build.directory}/ozone-site.xml:/etc/hadoop/ozone-site.xml
- /etc/passwd:/etc/passwd:ro
{code}
Would be better to use just one... (With the current smoketest you don't need
to mount xml files as all the code run inside the cluster )
I.) We don't need to duplicate the same configuration keys multiple times
(docker-config-dn1 docker-config-dn2, docker-config-dn3). the common keys can
be loaded from one docker-config file and the special keys (container ipc and
ratis ipc ports) cam be adjusted directly from the docker-compose (Similar to
the environment variable ENSURE_OM_INITIALIZED which is defined only for the OM)
G.) There are RAT violations in some of the ozone-site.xml files.
H.) In read only test the safe mode is checked only the second start.
{code:java}
/**
* Test if scm would exit safe mode on read-only disk.
* @throws IOException
*/
@Test
public void test1WaitForSafeMode() throws InterruptedException,
IOException {
startCluster(READ_WRITE);
stopCluster(READ_WRITE);
startCluster(READ_ONLY);
{code}
How can we be sure that the first cluster is started properly before shutting
it down? Or do we need a real start or just the initialization of the om / scm?
I.) It's not clear for me what is the goal for these tests.
* ITDiskReadWrite: It is already tested by the main smoketest suite. The only
difference here that we run the client outside from the cluster + port forward.
I don't think it's a main use-case, but if it is, we can test it with the
existing tests, because it's not related to the disk failures, it's related to
the network settings.
* ITDiskCorruption: I am not sure what is the expected behavior here. I think
it depends from the caching of the file system + caching of the rocksdb. The
corrupted files may not be reread until a certain time. It's hard to do any
assertion if the expectation can't be defined well.
* ITDiskReadOnly: it checks the use case when the disk is read only, which is
fine. But I think it's more interesting to test random disk failures / random
disk slowness. I know that other followup jira-s are opened, I think they can
be more realistic.
J.) ITDiskCorruption.listFiles can be simplified:
{code:java}
f.listFiles();
{code}
Instead of:
{code:java}
return f.listFiles(new FilenameFilter() {
public boolean accept(File dir, String name) {
return true;
}
});
{code}
K.) ITDiskCorruption.addCorruption
Are you sure that File.listFiles are recursive? It may return with an empty
list of files are the real files are in the `metadata/scm-container.db` not in
the `scm-container.db` directory. Can you please add an assertion if you have
at least one file which are corrupted?
L.) In ITDiskReadOnly the clusters are not stopped in a finally block. In case
of an exception the cluster may remain in a running state.
M.) @Before/@After can help to start/stop cluster before/after each tests in
ITDiskReadOnly.
N.) in ITDiskCorruption
{code:java}
@Test
public void testUpload() {
LOG.info("Performing upload test.");
try {
OzoneConfiguration ozoneConf = new OzoneConfiguration();
FileSystem fs = FileSystem.get(ozoneConf);
Path dest = new Path("/test");
FSDataOutputStream out = fs.create(dest);
byte[] b = "Hello world\n".getBytes();
out.write(b);
out.close();
fs.close();
Assert.assertTrue("Upload failed", fs.exists(dest));
fail("Upload test should have failed.");
} catch (IOException e) {
Assert.assertTrue("Upload test passed.", true);
} catch (Exception e) {
LOG.error("Upload test failed: ", e);
fail("Upload test failed.");
}
}
{code}
* Here we have no information about the IOException which is checked. It could
be an exception from the out.write or an exception from the fs.exists(dest).
Unfortunately it's not clear what is the expected behavior.
* We don't need Assert.assertTrue if we expect exception.
* Exception type can be asserted by the Junit rule:
[https://junit.org/junit4/javadoc/4.12/org/junit/rules/ExpectedException.html]
with less code.
> Create disk tests for fault injection test
> ------------------------------------------
>
> Key: HDDS-1554
> URL: https://issues.apache.org/jira/browse/HDDS-1554
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: build
> Reporter: Eric Yang
> Assignee: Eric Yang
> Priority: Major
> Labels: pull-request-available
> Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch,
> HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch,
> HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch,
> HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> The current plan for fault injection disk tests are:
> # Scenario 1 - Read/Write test
> ## Run docker-compose to bring up a cluster
> ## Initialize scm and om
> ## Upload data to Ozone cluster
> ## Verify data is correct
> ## Shutdown cluster
> # Scenario 2 - Read/Only test
> ## Repeat Scenario 1
> ## Mount data disk as read only
> ## Try to write data to Ozone cluster
> ## Validate error message is correct
> ## Shutdown cluster
> # Scenario 3 - Corruption test
> ## Repeat Scenario 2
> ## Shutdown cluster
> ## Modify data disk data
> ## Restart cluster
> ## Validate error message for read from corrupted data
> ## Validate error message for write to corrupted volume
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]