Hi, all
cc szetszwo, shashi

We are using ozone with ratis for our services and we hit an issue with disk 
out of space. We checked the log and think that
it is that ratis has run out of space and ozone pipelines (raftgroups for 
ratis) created on the full disk are not able to close
because it has to take a final snapshot. short log appended below.


So I think as the consumer of the disk, ratis should be able to mange the 
free/used space and have some guarantee that
operations should not be partial completed due to out of space.
We may build a reserved space for each disk in ratis and filter out disks which 
reach the defined threshold for new raftgroup allocation.
Although the problem we hit happened on ozone side, but as the comsumer of the 
metadata disks, this should better be done in ratis.


Or does anyone has other solutions for this?


Thanks
Mark Gui


```
2021-05-25 19:10:47,171 
[492bc1be-439e-45db-856f-2e58336e2528@group-B26E6BC26E24-StateMachineUpdater] 
INFO 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
 group-B26E6BC26E24: Taking a snapshot at:(t:5, i:419) file 
/data1/ratis/c5a9bc6e-fee1-48a8-9100-b26e6bc26e24/sm/snapshot.5_419
2021-05-25 19:10:47,171 
[492bc1be-439e-45db-856f-2e58336e2528@group-B26E6BC26E24-StateMachineUpdater] 
ERROR 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
 group-B26E6BC26E24: Failed to write snapshot at:(t:5, i:419) file 
/data1/ratis/c5a9bc6e-fee1-48a8-9100-b26e6bc26e24/sm/snapshot.5_419
2021-05-25 19:10:47,171 
[492bc1be-439e-45db-856f-2e58336e2528@group-B26E6BC26E24-StateMachineUpdater] 
ERROR org.apache.ratis.server.impl.StateMachineUpdater: 
492bc1be-439e-45db-856f-2e58336e2528@group-B26E6BC26E24-StateMachineUpdater: 
Failed to take snapshot
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:326)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.doFlush(CodedOutputStream.java:3062)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.flushIfNotAvailable(CodedOutputStream.java:3057)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeUInt64NoTag(CodedOutputStream.java:2897)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream.writeInt64NoTag(CodedOutputStream.java:414)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.FieldSet.writeElementNoTag(FieldSet.java:657)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.FieldSet.writeElement(FieldSet.java:634)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.MapEntryLite.writeTo(MapEntryLite.java:110)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.MapEntry.writeTo(MapEntry.java:154)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeMessageNoTag(CodedOutputStream.java:2855)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeMessage(CodedOutputStream.java:2824)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.serializeMapTo(GeneratedMessageV3.java:3224)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.serializeLongMapTo(GeneratedMessageV3.java:3140)
        at 
org.apache.hadoop.hdds.protocol.datanode.proto.ContainerProtos$Container2BCSIDMapProto.writeTo(ContainerProtos.java:14633)
        at 
org.apache.ratis.thirdparty.com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:83)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.persistContainerSet(ContainerStateMachine.java:270)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.takeSnapshot(ContainerStateMachine.java:294)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.takeSnapshot(StateMachineUpdater.java:265)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.checkAndTakeSnapshot(StateMachineUpdater.java:257)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:183)
        at java.lang.Thread.run(Thread.java:748)
```

Reply via email to