Re: [DESIGN] Storage classes in Ozone

Elek, Marton Fri, 22 May 2020 02:35:30 -0700

Some update on the proposal based on the discussion from last CommunitySync and an offline discussion with Uma and Arpit:



Question I: How can we switch between different storage classes?

1. Today, it's not supported. We don't allow to change Ratis/ONE toRATIS/THREE.


 2. AWS support it only with PUT - copy command [1]

 3. We can do the same easily (support server side copy...)

3. For any smarter algorithm we need to improve the block placementpolicy (eg. open a new containers for the blocks of different buckets.One container should contain only blocks from the same buckets). Butit's a bigger change. I would suggest to skip this and use the samelimitation as AWS.




Question II: We need dynamic move between closed / erasure coded state.

In the document I argued that the Erasure Coding is a specific type ofStorage class (Ratis/THREE --> EC 6:2).

Arpit helped me to understand that this is not the case. EC/cold data isa state (similar to the state of Open/Close) and not a replication strategy.

Data can be moved from HOT to COLD manually or automatically over thetime. (For example SCM can initiate an EC process for containers whichare not frequently used. Or user can request the same for a specificbucket).

Based on the previous answer it means that we shouldn't use a specificstorage-class for EC. Because there is no easy move between storage-classes.



Summary:

In this storage-class model, STANDARD storage class should be definedsomething like this:


RATIS/THREE ---------> CLOSED/THREE <-----------> EC 6:3

Still the storage class is definition of replication types/parameters +transitions, but we introduced a third status after the closed containerstatus and we have transitions between them (triggered manually or bythe SCM).

Storage-class model still works, but COLD data is not a storage classbut a state.

If the administrator prefers, they can disable the second transition fora specific storage-class:


RATIS/THREE ---------> CLOSED/THREE

But it means, that these keys will be stored in closed containers,forever. EC feature is completely turned off.



TODO:

 1. I will update the storage-class document to clarify these points.

2. Will create an EC document to include this which can be a goodstart for the discussion about EC



Thanks,
Marton


[1]: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObject.html





On 5/18/20 9:44 AM, Elek, Marton wrote:

Here is a Hackmd link, if you prefer to comment it on there:

https://hackmd.io/@elek/B1j93h1s8

Marton




On 5/11/20 3:42 PM, Elek, Marton wrote:
An other topic, I am not sure if it's [DISCUSS] or [DESIGN], somethingbetween them, and all feedback is welcome anyway.
For the sake of the simplicity I just include the full text to here,but will also upload to somewhere, soon.
Thanks,
Marton



----------------------------------------------------------------------


## Abstract
One of the fundamental abstraction of Ozone is the _Container_ whichused as the unit of the replication.
Containers have to favors: _Open_ and _Closed_ containers: Opencontainers are replicated by Ratis and writable, Closed containers arereplicated with data copy and read only.
In this document a new level of abstraction is proposed: the *storageclass* which defines which type of containers should be used and whattype of transitions are supported.
## Containers in more details
Container is the unit of replication of Ozone. One Container can storemultiple blocks (default container size is 5GB) and they arereplicated together. Datanodes report only the replication state ofthe Containers back to the Storage Container Manager (SCM) which makesit possible to scale up to billions of objects.
The identifier of a block (BlockId) is combination of ContainerID andLocalID (ID inside the container). ContainerID can be used to find theright Datanode which stores the data. LocalID can be used to find thedata inside one container.
Container type defines the following:

  * How to write to the containers?
  * How to read from the containers?
  * How to recover / replicate data in case of error?
* How to store the data on the Datanode (related to the *how towrite* question?)?
The definition of *Ratis/THREE*:
* **How to write**: Call standard Datanode RPC API on *Leader*.Leader will replicate the data to the followers * **How to read**: Read the data from the Leader (stale read can bepossible long-term)
  * **How to replicate / recover**
     * Transient failures can be handled by new leader election
* Permanent degradation couldn't be handled. (Transition toClosed containers is required) * **How to write?**: Using standard *KeyValueContainer* and chunklayouts
The definitions of *Closed/THREE*:

   * **How to write**: Closed containers are not writeable
* **How to read**: Read the data from any nodes (Simple RPC call tothe DN)
   * **How to replicate / recover**
* Datanodes provides a GRPC endpoint to publish containers ascompressed package * Replication Manager (SCM) can send commands to DN to replicatedata FROM other Datanode
     * Datanode downloads the compressed package and import it

The definitions of *Closed/ONE*:
   * **How to write**: Closed containers are not writeable
* **How to read**: Read the data from any nodes (Simple RPC call tothe DN)
   * **How to replicate / recover**: No recovery, sorry.

## Storage-class
Let's define the *storage-class* as set of **types of the containers**and **transitions between the different types** during the life cycleof the containers.
The type of the Container can be defined with the implementation type(eg. Ratis, EC, Closed) and with additional parameters related to type(eg. replication type of Ratis, or EC algorithm for EC containers).
Today Ozone supports two storage classes (but we call it as storageclass only at S3 level):
The definition of STANDARD storage class:

  * *First container type/parameters*: Ratis/THREE replicated containers
* *Transitions*: In case of any error or if the container is full,convert to closed containers
  * *Second container type/parameters*: Closed/THREE container


The definition REDUCED Storage class:

  * *First container type/parameters*: Ratis/ONE replicated containers
* *Transitions*: In case the container is full, convert to closedcontainers
  * *Second container type/parameters*: Closed/ONE container
But we can define other storage classes as well. For example we candefine (Ratis/THREE --> Closed/FIVE) storage class, or more specificcontainers can be used for Erasure Coding or Random Read/Write.
With this approach the storage-class can be an adjustable abstractionto define the rules of replications. Some key properties of thisapproach:
* **Storage-class can be defined by configuration**: Storage classis nothing more just the definition of rules to store / replicatecontainers. They can be configured in config files and changed any time. * **Object creation requires storage class**: Right now we shoulddefined *replication factor* and *replication type* during the keycreation. They can be replaced with setting only the Storage class. * **Storage-class is property of the containers**: As the unit ofreplication in Ozone is container, one specific storage-class shouldbe recorded for each containers. * **Changing storage class of a key** means copying it to an othercontainer * **Changing definition of storage class** will modify the behaviorof the Replication Manager, and -- eventually -- the containers willbe replaced in a different way.
*Note*: we already support storage class for S3 objects the onlydifference is that it would become an Ozone level abstraction and itwould defined *all* the container types and transitions.
## Possible use-cases
First of all, we can configure different replication levels easilywith this approach (eg. Ratis/THREE --> Closed/TWO). Ratis need quorumbut we can have different replication number after closing containers.
We can also define topology related transitions (eg. after closing,one replica should be copied to different rack) or storage specificconstrains (Right now we have only one implementation of the storage:`KeyValueContainer` but we can implement more and storage classprovides an easy abstraction to configure the required storage).
Datanode also can provide different disk type for containers in acertain storage class (eg. SSD for fast access).
# Additional use cases
In addition to the possible, replication related additional optionsthere are two very specific use cases where we can use storageclasses. Both requires more design discussion but here I quicklysummarize some possible directions with the help of the storage classabstraction.
## Erasure coding
To store cold data on less bits (less than the data * 3 replicas) wecan encode the data and store some parity bits to survive replicaloss. In a streaming situation (streaming write) it can be tricky aswe need enough chunks to do the Reed-Solomon magic. With containers weare in a better position. After the first transition of the Opencontainers we can do EC encoding and that time we have enough data toencode.
There are multiple options to do EC, one most simplest way is toencode Containers after the first transition:
Storage class COLD:

  * *First container type/parameters*: Ratis/THREE replicated containers
* *Transitions*: In case of any error or if the container is full,convert to closed containers
  * *Second container type/parameters*: EC RS(6,2)
With this storage class the containers can be converted to a specificEC container together with other containers (For example instead of 3x C1, 3 x C2 and 3 x C3 containers can be converted to C1, C2, C3 +Parity 1 + Parity2).
The exact structure of EC container (encode it on container level orin smaller parts) depends from the requirement of recovery speed.
## Random-read write
NFS / Fuse file system might require to support Random read/writewhich can be tricky as the closed containers are immutable. In case ofchanging any 1 byte in a block, the whole block should be re-createdwith the new data. It can have a lot of overhead especially in case ofmany small writes.
But write is cheap with Ratis/THREE containers. Similar to any`ChunkWrite` and `PutBlock` we can implement an `UpdateChunk` callwhich modifies the current content of the chunk AND replicates thechange with the help of Ratis.
Let's imagine that we solved the resiliency of Ratis pipelines: Incase of any Ratis error we can ask other Datanode to join to the Ratisring *instead of* closing it. I know that it can be hard to implement,but **if** it is solved, we have an easy solution for randomreads/writes. (Thanks to the storage class abstraction)
If it works, we can define the following storage-class:

  * **Initial container** type/parameters: Ratis/THREE
  * **Transitions**: NO. They will remain "Open" forever
  * **Second container type**: NONE
This would help the random read/write problem, with some limitations:only a few containers should be managed in this storage-class. It'snot suitable for really Big data, but can be a very powerful additionto provide NFS/Fuse backend for containers / git db / SQL db.
## References

  * Storage class in S3: https://aws.amazon.com/s3/storage-classes/

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-dev-h...@hadoop.apache.org

Re: [DESIGN] Storage classes in Ozone

Reply via email to