[ 
https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881950#comment-16881950
 ] 

Elek, Marton commented on HDDS-1554:
------------------------------------

Thank you very much to work on this Eric Yang. I am very interested to have 
some real / useful disk failure injection tests.

First of all let me clarify: I am not against this patch. It introduces new way 
to execute tests, but doesn't modify any of the existing parts of the build, 
therefore it's a good way to experiment with new approaches. But I would like 
to share why I think that this approach introduces new problems (even if my 
advices will be ignored).

There are two main problems which should be solved when we create docker based 
acceptance tests:

1. How to start / stop / restart docker-based pseudo-clusters (manage the 
environment)
 2. Execute tests and do the assertions.

In the existing smoketest framework (1.) is handled by shell scripts (2.) is 
handled by robotframework.

In this approach (1.) is handled by maven plugins and/or java code and (2.) is 
handled by java code.

I have multiple problems with this approach:
 * There are no clear separation between the two areas and it makes it 
impossible to run the same tests in any other environment. As an example: 
existing smoketests can be easily executed in kubernetes.

 * Personally I don't think that java is the most effective way to maintain the 
execution of external applications. For example the following code can be 
replaced with one or two lines of shell code:

{code:java}
   private synchronized void startCluster(String mode) throws IOException,
      InterruptedException {
    String relPath = getClass().getProtectionDomain().getCodeSource()
        .getLocation().getFile();
    File log = new File(relPath+"../test-dir/docker-compose-up.txt");
    String composeFile = relPath+"../../target/docker-compose.yaml";
    if (mode.equals(READ_ONLY)) {
      composeFile = relPath+"../../target/docker-compose-read-only.yaml";
    }
    ProcessBuilder pb =
        new ProcessBuilder("docker-compose", "-f", composeFile, "up");
    pb.redirectErrorStream(true);
    pb.redirectOutput(Redirect.appendTo(log));
    Process p = pb.start();
    assert pb.redirectInput() == Redirect.PIPE;
    assert pb.redirectOutput().file() == log;
    assert p.getInputStream().read() == -1;
    p.waitFor(30L, TimeUnit.SECONDS);
  }
{code}
I think all the mentioned goals can be achieved with the existing smoketest 
framework (see my uploaded PR for an example). External volume management, or 
asserting on exceptions are both can be achieved.

Using the existing smoketest approach would make to maintenance of the tests 
more easy.

And a big part of the code introduced here (docker-compose up, waiting for safe 
mode, etc.) are already implemented there.

Using `@FixMethodOrder(MethodSorters.NAME_ASCENDING)` is especially error-prone 
and shows that we use unit test in a way which is different from the normal 
usage (usually the @Test methods should be independent)

I know that Junit is a good hammer, but...

B.) The current tests uses the approach to use external mounts to reuse the 
/data directory. It introduces new problems (see the workarounds with the 
id/UID). All of them can be avoided with using internal docker volumes which 
makes it possible to achieve the same without additional workarounds 
(ITDiskReadOnly doesn't require direct access to the volume 
ITDiskCorruption.addCorruption can be replaced with simple shell executions 
`echo "asd" > ../db`)

C.) The unit test files contains a lot of code duplication inside and outside 
the cluster.
 * For example ITDiskReadOnly.startCluster and ITDiskReadOnly.stopCluster are 
almost the same.
 * The logic of waiting for the safe mode is duplicated in all of the 
subprojects.

D.) I would use JUnit assertion everywhere. For me combining java and junit 
assertions are confusing (ITDiskReadOnly.startCluster/stopCluster)

E.) I would use Autoclosable try-catch blocks (or at least finally blocks) to 
close opened resources. It would help to define the boundaries where the 
resources are used. For example in `ITDiskReadWrite.testStatFile`:
{code:java}
  FileSystem fs = FileSystem.get(ozoneConf);
  ...
  fs.close();
  Assert.assertTrue("Create file failed", fs.exists(dest));
{code}
fs seems to be used after the close.

(try catch is used sometimes bot not everywhere)

F.) We don't need to catch `Exceptions` and `fail` as it's already handled by 
JUnit framework. Just remove try and catch block.
{code:java}
  @Test
  public void testUpload() {
    LOG.info("Performing upload test.");
    try {
      OzoneConfiguration ozoneConf = new OzoneConfiguration();
      FileSystem fs = FileSystem.get(ozoneConf);
      Path dest = new Path("/test");
      FSDataOutputStream out = fs.create(dest);
      byte[] b = "Hello world\n".getBytes();
      out.write(b);
      out.close();
      fs.close();
      Assert.assertTrue("Upload failed", fs.exists(dest));
    } catch(Exception e) {
      LOG.error("Upload test failed: ", e);
      fail("Upload test failed.");
    }
  }
{code}
G.) You don't need to extend TestCase with JUnit4.

H.) There are two kind of configuration handling: the environment variable 
based (usually with the help docker-config) and the file based (where real xml 
files are mounted). The two of them are mixed which are confusing. For example 
in read-write tests:
{code:java}
   om:
      image: ${docker.image}
      user: ${UID}:${GID}
      ports:
         - 9862:9862
         - 9874:9874
      environment:
         ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
      env_file:
          - ./docker-config
      command: ["/opt/hadoop/bin/ozone","om"]
      volumes:
        - ${project.build.directory}/core-site.xml:/etc/hadoop/core-site.xml
        - ${project.build.directory}/ozone-site.xml:/etc/hadoop/ozone-site.xml
        - /etc/passwd:/etc/passwd:ro
{code}
Would be better to use just one... (With the current smoketest you don't need 
to mount xml files as all the code run inside the cluster )

I.) We don't need to duplicate the same configuration keys multiple times 
(docker-config-dn1 docker-config-dn2, docker-config-dn3). the common keys can 
be loaded from one docker-config file and the special keys (container ipc and 
ratis ipc ports) cam be adjusted directly from the docker-compose (Similar to 
the environment variable ENSURE_OM_INITIALIZED which is defined only for the OM)

G.) There are RAT violations in some of the ozone-site.xml files.

H.) In read only test the safe mode is checked only the second start.
{code:java}
  /**
   * Test if scm would exit safe mode on read-only disk.
   * @throws IOException
   */
  @Test
  public void test1WaitForSafeMode() throws InterruptedException,
      IOException {
    startCluster(READ_WRITE);
    stopCluster(READ_WRITE);
    startCluster(READ_ONLY);
{code}
How can we be sure that the first cluster is started properly before shutting 
it down? Or do we need a real start or just the initialization of the om / scm?

I.) It's not clear for me what is the goal for these tests.
 * ITDiskReadWrite: It is already tested by the main smoketest suite. The only 
difference here that we run the client outside from the cluster + port forward. 
I don't think it's a main use-case, but if it is, we can test it with the 
existing tests, because it's not related to the disk failures, it's related to 
the network settings.

 * ITDiskCorruption: I am not sure what is the expected behavior here. I think 
it depends from the caching of the file system + caching of the rocksdb. The 
corrupted files may not be reread until a certain time. It's hard to do any 
assertion if the expectation can't be defined well.

 * ITDiskReadOnly: it checks the use case when the disk is read only, which is 
fine. But I think it's more interesting to test random disk failures / random 
disk slowness. I know that other followup jira-s are opened, I think they can 
be more realistic.

J.) ITDiskCorruption.listFiles can be simplified:
{code:java}
  f.listFiles();
{code}
Instead of:
{code:java}
    return f.listFiles(new FilenameFilter() {
      public boolean accept(File dir, String name) {
        return true;
      }
    });
{code}
K.) ITDiskCorruption.addCorruption

Are you sure that File.listFiles are recursive? It may return with an empty 
list of files are the real files are in the `metadata/scm-container.db` not in 
the `scm-container.db` directory. Can you please add an assertion if you have 
at least one file which are corrupted?

L.) In ITDiskReadOnly the clusters are not stopped in a finally block. In case 
of an exception the cluster may remain in a running state.

M.) @Before/@After can help to start/stop cluster before/after each tests in 
ITDiskReadOnly.

N.) in ITDiskCorruption
{code:java}
  @Test
  public void testUpload() {
    LOG.info("Performing upload test.");
    try {
      OzoneConfiguration ozoneConf = new OzoneConfiguration();
      FileSystem fs = FileSystem.get(ozoneConf);
      Path dest = new Path("/test");
      FSDataOutputStream out = fs.create(dest);
      byte[] b = "Hello world\n".getBytes();
      out.write(b);
      out.close();
      fs.close();
      Assert.assertTrue("Upload failed", fs.exists(dest));
      fail("Upload test should have failed.");
    } catch (IOException e) {
      Assert.assertTrue("Upload test passed.", true);
    } catch (Exception e) {
      LOG.error("Upload test failed: ", e);
      fail("Upload test failed.");
    }
  }
{code}
 * Here we have no information about the IOException which is checked. It could 
be an exception from the out.write or an exception from the fs.exists(dest). 
Unfortunately it's not clear what is the expected behavior.

 * We don't need Assert.assertTrue if we expect exception.

 * Exception type can be asserted by the Junit rule: 
[https://junit.org/junit4/javadoc/4.12/org/junit/rules/ExpectedException.html] 
with less code.

> Create disk tests for fault injection test
> ------------------------------------------
>
>                 Key: HDDS-1554
>                 URL: https://issues.apache.org/jira/browse/HDDS-1554
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: build
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, 
> HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, 
> HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, 
> HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current plan for fault injection disk tests are:
>  # Scenario 1 - Read/Write test
>  ## Run docker-compose to bring up a cluster
>  ## Initialize scm and om
>  ## Upload data to Ozone cluster
>  ## Verify data is correct
>  ## Shutdown cluster
>  # Scenario 2 - Read/Only test
>  ## Repeat Scenario 1
>  ## Mount data disk as read only
>  ## Try to write data to Ozone cluster
>  ## Validate error message is correct
>  ## Shutdown cluster
>  # Scenario 3 - Corruption test
>  ## Repeat Scenario 2
>  ## Shutdown cluster
>  ## Modify data disk data
>  ## Restart cluster
>  ## Validate error message for read from corrupted data
>  ## Validate error message for write to corrupted volume



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to