[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test

Eric Yang (JIRA) Wed, 10 Jul 2019 12:02:47 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882358#comment-16882358
 ]


Eric Yang commented on HDDS-1554:
---------------------------------

[~elek] Thank you for the thorough review, and here are my answers:

{quote}There are no clear separation between the two areas and it makes it 
impossible to run the same tests in any other environment. As an example: 
existing smoketests can be easily executed in kubernetes.{quote}

StartCluster/StopCluster method is only using docker-compose to start a 
distributed cluster at this time.  It can be extended to perform launching 
Ozone cluster on different environment such as YARN service later to keep the 
scope of the current feature set in check.

Smoke test only runs in a closed environment.  There is no exposure to outside 
network.  Ozone cli client works inside docker container.  This limits the 
tests from interacting with Ozone cluster from a external network.  By 
embedding robot framework in docker container, this poses another risk of 
inseparable test artifacts in docker container that can not be removed later.  
Test artifacts makes up 400+MB of the current docker image which is 100% more 
bloated than necessary.  Even when test artifact is separated later, we have no 
way to be sure that the docker image can properly function without test 
artifact because the tests can only work in docker containers.

The fault injection tests have mapped the ports manually to external network.  
It can provide better coverage on testing docker network with external 
environment today.  Smoke test can be modified to work with external network 
but effectively doubling the installation of test tool chain on host system, 
and doubling  the shell script configuration templating to run in Robot 
framework on host system.  Given maven is already a dev toolchain set in stone 
for Hadoop, and minimalist mindset to make best use of the tool chain.  I chose 
to stay in maven environment to let existing tool chain do the style and error 
checking for these tests.

I respect your approach to write the smoke test in shell + robot framework, but 
fault injection tests can perform more efficiently with help of Java tool chain 
imho.  

{quote}Using `@FixMethodOrder(MethodSorters.NAME_ASCENDING)` is especially 
error-prone and shows that we use unit test in a way which is different from 
the normal usage (usually the @Test methods should be independent){quote}

There is nothing wrong with using feature, and implemented base on popular 
request.  It is the developer who doesn't know how to use this feature 
correctly can create problem for themselves.  I will group the entire flow into 
one test case because you don't like using this annotation.

{quote}B.) The current tests uses the approach to use external mounts to reuse 
the /data directory. It introduces new problems (see the workarounds with the 
id/UID). All of them can be avoided with using internal docker volumes which 
makes it possible to achieve the same without additional workarounds 
(ITDiskReadOnly doesn't require direct access to the volume 
ITDiskCorruption.addCorruption can be replaced with simple shell executions 
`echo "asd" > ../db`){quote}

There are two possible cases where data becomes read-only.  B.1) The disk is 
mounted as read-only, or B.2) the data file is read-only.  It would be nice if 
there are distinct error message to inform system administrator to make 
adjustment for mounting the data disk, or he made a error when copying data 
during servicing the server.  ITReadOnly test is focusing on case #B.1 and can 
be expanded to case #B.2, if necessary.  Using internal path can not clearly 
test case #B.1.

{quote}C.) The unit test files contains a lot of code duplication inside and 
outside the cluster.

For example ITDiskReadOnly.startCluster and ITDiskReadOnly.stopCluster are 
almost the same.
The logic of waiting for the safe mode is duplicated in all of the 
subprojects.{quote}

Good point, I will clean up in next patch.

{quote}D.) I would use JUnit assertion everywhere. For me combining java and 
junit assertions are confusing (ITDiskReadOnly.startCluster/stopCluster)

E.) I would use Autoclosable try-catch blocks (or at least finally blocks) to 
close opened resources. It would help to define the boundaries where the 
resources are used.{quote}

Setup procedures are not tests.  It is common to throw IOException and 
InterruptedException to handle exceptions in JUnit setup method.  This is the 
reason that they are written this way.
You are correct that using autoclosable block is nicer to close file system 
resources, and will make the change.

{quote}F.) We don't need to catch `Exceptions` and `fail` as it's already 
handled by JUnit framework. Just remove try and catch block.{quote}

I like to be able to identify the starting point of the log.  This is the 
reason that I use try, catch and additional string logging for stack trace.  
This helps to jump to starting point of the error rather than reading the 
entire stack trace to locate the line number of the test case.  Sometimes, the 
stack trace is too long, the actual line number of test case becomes hidden.  I 
think it is more efficient for problem determination, but I will remove try 
catch if this looks silly to you.

{quote}G.) You don't need to extend TestCase with JUnit4.{quote}

Ok, it was faster to get IDE to autogenerate setup/tearDown methods by 
extending TestCase, but inconsequential detail can be removed.

{quote}
H.) There are two kind of configuration handling: the environment variable 
based (usually with the help docker-config) and the file based (where real xml 
files are mounted). The two of them are mixed which are confusing. 
{quote}

The envtoconf.py script does not generating core-site.xml when there is no 
change to core-site.xml.  We require some key/value pair in core-site.xml now.  
Hence the reason to mount this resource externally may have fade away.  I will 
clean this up.

{quote}I.) We don't need to duplicate the same configuration keys multiple 
times (docker-config-dn1 docker-config-dn2, docker-config-dn3). the common keys 
can be loaded from one docker-config file and the special keys (container ipc 
and ratis ipc ports) cam be adjusted directly from the docker-compose (Similar 
to the environment variable ENSURE_OM_INITIALIZED which is defined only for the 
OM){quote}

Same point that [~arp] raised.  Will clean this up.

{quote}G.) There are RAT violations in some of the ozone-site.xml files.{quote}

Will fix this.

{quote}How can we be sure that the first cluster is started properly before 
shutting it down? Or do we need a real start or just the initialization of the 
om / scm?{quote}

This is for initialization only.  The logic can be smarter.  Will revise for 
the next patch.

{quote}I.) It's not clear for me what is the goal for these tests.{quote}

ITDiskReadWrite: see HDDS-1771.
ITDiskCorruption: We can define a grace period that disk sync should happen to 
ensure minimum amount of data lost can occur.  The test case can be improved to 
validate if corruption exists over n seconds, then error detection should occur.

{quote}ITDiskReadOnly: it checks the use case when the disk is read only, which 
is fine. But I think it's more interesting to test random disk failures / 
random disk slowness. I know that other followup jira-s are opened, I think 
they can be more realistic.{quote}

This ticket does not intend to cover every test in one patch.  It has a broad 
starting point to structure the test area.  It may require patience for 
detailed tests to be developed.

{quote}J.) ITDiskCorruption.listFiles can be simplified:{quote}

This is a place holder.  There is probably a set of files to go after when 
refining the tests.  I think data blocks and rocksdb are probably the good 
starting points.  Any suggestion other than list all files.

{quote}K.) ITDiskCorruption.addCorruption

Are you sure that File.listFiles are recursive? It may return with an empty 
list of files are the real files are in the `metadata/scm-container.db` not in 
the `scm-container.db` directory. Can you please add an assertion if you have 
at least one file which are corrupted?{quote}

It currently only corrupts omMetrics.  Need to developed a better filter for J.

{quote}M.) @Before/@After can help to start/stop cluster before/after each 
tests in ITDiskReadOnly.{quote}

Will make this change, where it is applies.

{quote}Here we have no information about the IOException which is checked. It 
could be an exception from the out.write or an exception from the 
fs.exists(dest). Unfortunately it's not clear what is the expected 
behavior.{quote}

This doesn't seem to be in place yet in application code for proper fault 
detection.  Therefore, it is a generic IOException at this time.  It will be 
revised, when I know exactly the IOException to  catch.

Thanks again for the review.  Will clean up accordingly.

> Create disk tests for fault injection test
> ------------------------------------------
>
>                 Key: HDDS-1554
>                 URL: https://issues.apache.org/jira/browse/HDDS-1554
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: build
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, 
> HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, 
> HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, 
> HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current plan for fault injection disk tests are:
>  # Scenario 1 - Read/Write test
>  ## Run docker-compose to bring up a cluster
>  ## Initialize scm and om
>  ## Upload data to Ozone cluster
>  ## Verify data is correct
>  ## Shutdown cluster
>  # Scenario 2 - Read/Only test
>  ## Repeat Scenario 1
>  ## Mount data disk as read only
>  ## Try to write data to Ozone cluster
>  ## Validate error message is correct
>  ## Shutdown cluster
>  # Scenario 3 - Corruption test
>  ## Repeat Scenario 2
>  ## Shutdown cluster
>  ## Modify data disk data
>  ## Restart cluster
>  ## Validate error message for read from corrupted data
>  ## Validate error message for write to corrupted volume



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test

Reply via email to