[ 
https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883392#comment-16883392
 ] 

Eric Yang edited comment on HDDS-1773 at 7/11/19 11:08 PM:
-----------------------------------------------------------

[~elek] Disk hang is still a good test case to verify datanode health logic.  
How about we move the patch 001 logic to HDDS-1774?

I have expressed concern about Byteman touching ASF licensed code in JVM.  
There is no clear answer if this is allowed.  Besides, it is not clear to me 
where to inject faults in jvm can yield the same result as actual disk errors.

The current approach of mounting disk volume is more favorable approach to 
simulate disk errors.  This can be combined with device mapper to create faulty 
virtual device to simulate real disk errors.  For example, we can create 
virtual block device with:

{code}dd if=/dev/zero of=/var/lib/virtualblock.img bs=512 count=1048576
losetup /dev/loop0 /var/lib/virtualblock.img{code}

This creates 512M file, and we format the loopback device, and punch some 
'hole' in the block device:

{code}dmsetup create errdev0
0 261144 linear /dev/loop0 0
261144 5 error
261149 787427 linear /dev/loop0 261139{code}

This will create a device called 'errdev0' (typically in /dev/mapper). When you 
type dmsetup create errdev0 it will wait for stdin and will finish on ^D being 
input.

In the example above, we've made a 5 sector hole (2.5kb) at sectors 261144 of 
the loop device. We then continue through the loop device as normal.

We can mount the errdev0 device like a normal block device into docker 
container.  When Ozone writes data to the errdev0 device, the program will come 
across some IO problems when it hits sectors that are really IO holes in the 
virtual device.  This is more realistic simulation imho.

We can include a readme file with instruction for setting up the faulty virtual 
device for user to repeat the tests.  Thought on this approach?



was (Author: eyang):
[~elek] Disk hang is still a good test case to verify datanode health logic.  
How about we move the patch 001 logic to HDDS-1774?

I have expressed concern about Byteman touching ASF licensed code in JVM.  
There is no clear answer if this is allowed.  Besides, it is not clear to me 
where to inject faults in jvm can yield the same result as actual disk errors.

The current approach of mounting disk volume is more favorable approach to 
simulate disk errors.  This can be combined with device mapper to create faulty 
virtual device to simulate real disk errors.  For example, we can create 
virtual block device with:

{code}dd if=/dev/zero of=/var/lib/virtualblock.img bs=512 count=1048576
losetup /dev/loop0 /var/lib/virtualblock.img{code}

This creates 512M file, and we format the loopback device, and punch some 
'hole' in the block device:

{code}dmsetup create errdev0
0 261144 linear /dev/loop0 0
261144 5 error
261149 787427 linear /dev/loop0 261139{code}

This will create a device called 'errdev0' (typically in /dev/mapper). When you 
type dmsetup create errdev0 it will wait for stdin and will finish on ^D being 
input.

In the example above, we've made a 5 sector hole (2.5kb) at sectors 261144 of 
the loop device. We then continue through the loop device as normal.

The we mount the errdev0 device like a normal block device into docker 
container.  When Ozone writes data to the errdev0 device, the program will come 
across some IO problems when it hits sectors that are really IO holes in the 
virtual device.  This is more realistic simulation imho.

We can include a readme file with instruction for setting up the faulty virtual 
device for user to repeat the tests.  Thought on this approach?


> Add intermittent IO disk test to fault injection test
> -----------------------------------------------------
>
>                 Key: HDDS-1773
>                 URL: https://issues.apache.org/jira/browse/HDDS-1773
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: Eric Yang
>            Priority: Major
>         Attachments: HDDS-1773.001.patch
>
>
> Disk errors can also be simulated by setting cgroup blkio rate to 0 while 
> Ozone cluster is running.  
> This test will be added to corruption test project and this test will only be 
> performed if there is write access into host cgroup to control the throttle 
> of disk IO.
> Expected result:
> When datanode becomes irresponsive due to slow io, scm must flag the node as 
> unhealthy.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to