[
https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882358#comment-16882358
]
Eric Yang commented on HDDS-1554:
---------------------------------
[~elek] Thank you for the thorough review, and here are my answers:
{quote}There are no clear separation between the two areas and it makes it
impossible to run the same tests in any other environment. As an example:
existing smoketests can be easily executed in kubernetes.{quote}
StartCluster/StopCluster method is only using docker-compose to start a
distributed cluster at this time. It can be extended to perform launching
Ozone cluster on different environment such as YARN service later to keep the
scope of the current feature set in check.
Smoke test only runs in a closed environment. There is no exposure to outside
network. Ozone cli client works inside docker container. This limits the
tests from interacting with Ozone cluster from a external network. By
embedding robot framework in docker container, this poses another risk of
inseparable test artifacts in docker container that can not be removed later.
Test artifacts makes up 400+MB of the current docker image which is 100% more
bloated than necessary. Even when test artifact is separated later, we have no
way to be sure that the docker image can properly function without test
artifact because the tests can only work in docker containers.
The fault injection tests have mapped the ports manually to external network.
It can provide better coverage on testing docker network with external
environment today. Smoke test can be modified to work with external network
but effectively doubling the installation of test tool chain on host system,
and doubling the shell script configuration templating to run in Robot
framework on host system. Given maven is already a dev toolchain set in stone
for Hadoop, and minimalist mindset to make best use of the tool chain. I chose
to stay in maven environment to let existing tool chain do the style and error
checking for these tests.
I respect your approach to write the smoke test in shell + robot framework, but
fault injection tests can perform more efficiently with help of Java tool chain
imho.
{quote}Using `@FixMethodOrder(MethodSorters.NAME_ASCENDING)` is especially
error-prone and shows that we use unit test in a way which is different from
the normal usage (usually the @Test methods should be independent){quote}
There is nothing wrong with using feature, and implemented base on popular
request. It is the developer who doesn't know how to use this feature
correctly can create problem for themselves. I will group the entire flow into
one test case because you don't like using this annotation.
{quote}B.) The current tests uses the approach to use external mounts to reuse
the /data directory. It introduces new problems (see the workarounds with the
id/UID). All of them can be avoided with using internal docker volumes which
makes it possible to achieve the same without additional workarounds
(ITDiskReadOnly doesn't require direct access to the volume
ITDiskCorruption.addCorruption can be replaced with simple shell executions
`echo "asd" > ../db`){quote}
There are two possible cases where data becomes read-only. B.1) The disk is
mounted as read-only, or B.2) the data file is read-only. It would be nice if
there are distinct error message to inform system administrator to make
adjustment for mounting the data disk, or he made a error when copying data
during servicing the server. ITReadOnly test is focusing on case #B.1 and can
be expanded to case #B.2, if necessary. Using internal path can not clearly
test case #B.1.
{quote}C.) The unit test files contains a lot of code duplication inside and
outside the cluster.
For example ITDiskReadOnly.startCluster and ITDiskReadOnly.stopCluster are
almost the same.
The logic of waiting for the safe mode is duplicated in all of the
subprojects.{quote}
Good point, I will clean up in next patch.
{quote}D.) I would use JUnit assertion everywhere. For me combining java and
junit assertions are confusing (ITDiskReadOnly.startCluster/stopCluster)
E.) I would use Autoclosable try-catch blocks (or at least finally blocks) to
close opened resources. It would help to define the boundaries where the
resources are used.{quote}
Setup procedures are not tests. It is common to throw IOException and
InterruptedException to handle exceptions in JUnit setup method. This is the
reason that they are written this way.
You are correct that using autoclosable block is nicer to close file system
resources, and will make the change.
{quote}F.) We don't need to catch `Exceptions` and `fail` as it's already
handled by JUnit framework. Just remove try and catch block.{quote}
I like to be able to identify the starting point of the log. This is the
reason that I use try, catch and additional string logging for stack trace.
This helps to jump to starting point of the error rather than reading the
entire stack trace to locate the line number of the test case. Sometimes, the
stack trace is too long, the actual line number of test case becomes hidden. I
think it is more efficient for problem determination, but I will remove try
catch if this looks silly to you.
{quote}G.) You don't need to extend TestCase with JUnit4.{quote}
Ok, it was faster to get IDE to autogenerate setup/tearDown methods by
extending TestCase, but inconsequential detail can be removed.
{quote}
H.) There are two kind of configuration handling: the environment variable
based (usually with the help docker-config) and the file based (where real xml
files are mounted). The two of them are mixed which are confusing.
{quote}
The envtoconf.py script does not generating core-site.xml when there is no
change to core-site.xml. We require some key/value pair in core-site.xml now.
Hence the reason to mount this resource externally may have fade away. I will
clean this up.
{quote}I.) We don't need to duplicate the same configuration keys multiple
times (docker-config-dn1 docker-config-dn2, docker-config-dn3). the common keys
can be loaded from one docker-config file and the special keys (container ipc
and ratis ipc ports) cam be adjusted directly from the docker-compose (Similar
to the environment variable ENSURE_OM_INITIALIZED which is defined only for the
OM){quote}
Same point that [~arp] raised. Will clean this up.
{quote}G.) There are RAT violations in some of the ozone-site.xml files.{quote}
Will fix this.
{quote}How can we be sure that the first cluster is started properly before
shutting it down? Or do we need a real start or just the initialization of the
om / scm?{quote}
This is for initialization only. The logic can be smarter. Will revise for
the next patch.
{quote}I.) It's not clear for me what is the goal for these tests.{quote}
ITDiskReadWrite: see HDDS-1771.
ITDiskCorruption: We can define a grace period that disk sync should happen to
ensure minimum amount of data lost can occur. The test case can be improved to
validate if corruption exists over n seconds, then error detection should occur.
{quote}ITDiskReadOnly: it checks the use case when the disk is read only, which
is fine. But I think it's more interesting to test random disk failures /
random disk slowness. I know that other followup jira-s are opened, I think
they can be more realistic.{quote}
This ticket does not intend to cover every test in one patch. It has a broad
starting point to structure the test area. It may require patience for
detailed tests to be developed.
{quote}J.) ITDiskCorruption.listFiles can be simplified:{quote}
This is a place holder. There is probably a set of files to go after when
refining the tests. I think data blocks and rocksdb are probably the good
starting points. Any suggestion other than list all files.
{quote}K.) ITDiskCorruption.addCorruption
Are you sure that File.listFiles are recursive? It may return with an empty
list of files are the real files are in the `metadata/scm-container.db` not in
the `scm-container.db` directory. Can you please add an assertion if you have
at least one file which are corrupted?{quote}
It currently only corrupts omMetrics. Need to developed a better filter for J.
{quote}M.) @Before/@After can help to start/stop cluster before/after each
tests in ITDiskReadOnly.{quote}
Will make this change, where it is applies.
{quote}Here we have no information about the IOException which is checked. It
could be an exception from the out.write or an exception from the
fs.exists(dest). Unfortunately it's not clear what is the expected
behavior.{quote}
This doesn't seem to be in place yet in application code for proper fault
detection. Therefore, it is a generic IOException at this time. It will be
revised, when I know exactly the IOException to catch.
Thanks again for the review. Will clean up accordingly.
> Create disk tests for fault injection test
> ------------------------------------------
>
> Key: HDDS-1554
> URL: https://issues.apache.org/jira/browse/HDDS-1554
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: build
> Reporter: Eric Yang
> Assignee: Eric Yang
> Priority: Major
> Labels: pull-request-available
> Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch,
> HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch,
> HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch,
> HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> The current plan for fault injection disk tests are:
> # Scenario 1 - Read/Write test
> ## Run docker-compose to bring up a cluster
> ## Initialize scm and om
> ## Upload data to Ozone cluster
> ## Verify data is correct
> ## Shutdown cluster
> # Scenario 2 - Read/Only test
> ## Repeat Scenario 1
> ## Mount data disk as read only
> ## Try to write data to Ozone cluster
> ## Validate error message is correct
> ## Shutdown cluster
> # Scenario 3 - Corruption test
> ## Repeat Scenario 2
> ## Shutdown cluster
> ## Modify data disk data
> ## Restart cluster
> ## Validate error message for read from corrupted data
> ## Validate error message for write to corrupted volume
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]