[Gluster-devel] DHT2 trip report

Venky Shankar Tue, 22 Mar 2016 22:36:08 -0700

Hey folks,

This is a report of the discussion regarding DHT2 that took place with Jeff 
Darcy
and Shyam in Bangalore during February. Many of you have been already following 
(or
aware to some extent) of what's planned for DHT2, but there's no consolidated 
document
that explains the design in detail; just peices of information from various 
talks and
presentations. This report would follow up with the design doc. So, hang in for 
that.


[Short report of the discussion and what to expect in the design doc..]

DHT2 server component (MDS)
---------------------------
DHT2 client would act as a forwarder/router to the server component. It's the 
server
component that would drive an operation. Operations may be compound in nature -
locally and/or remote. The "originator" server may forward (sub)operations to 
another
server in the cluster. This depends on the type of (original) operation.

The server component also takes care of serializing operations if required to 
ensure
correctness and resilience of concurrent operations. For crash consistency, 
DHT2 would
first log the operation it's about to perform in a write ahead log (WAL or 
journal)
followed by performing the operation(s) to the store. WAL records are marked 
completed
after operations are durable on store. Pending operations are replayed on 
server restart.

Cluster Map
-----------
This is a versioned "view" of the current state of a cluster. State here refers 
to nodes
(or probably sub-volumes?) which form a cluster along with it's operational 
state (up,
down) and weightage. A cluster map is used to distribute objects to a set of 
nodes.
Every entity in a cluster (client, servers) keep a copy of the cluster map and 
consult
whenever required (e.g., during distribution). Cluster map versions are 
monotonically
increasing; epoch number is best suited for such a versioning scheme. A master 
copy of
the map is maintained by GlusterD (in etcd).

Operation performed by clients (and servers) carry the epoch number of their 
cached
version of the cluster map. Servers use this information to validate the 
freshness
of the cluster map.

Write Ahead Log
---------------
DHT2 would make use of journal to ensure crash consistency. It's tempting to 
reuse
the journaling translator developed for NSR (FDL: Full Data Logging xlator), but
doing that would require FDL handling special cases for DHT2. Furthermore, there
are plans to redesign quota to rely on journals (journaled quota). Also, 
designing
a system with such tight coupling makes it hard to switch to alternate 
implementations
(e.g., server side AFR instead of NSR). Therefore, implementing the journal as 
a regular
file that's treated as any other file by all layers below would

+ provide more control (discard, replay, etc..) to DHT2 server component on the 
journal

+ enables MDS to use NSR or AFR (server side) without any modifications to 
journaling part.
  NSR/AFR would treat the journal as a any other file and keep it 
replicated+consistent.

- restrict taking advantage of storing (DHT2) journal on faster storage (SSD)

Sharding
--------
Introduce the notion of block pointers for inodes. Block pointers are 
distributed by DHT2
rather than individual files/objects. This changes the translator API 
extensively. Try to
leverage existing shard implementation and see if the concept of block pointers 
can be
used in place of treating each shard as a separate file. Treating each shard as 
a separate
file bloats up the amount of tracking (on the MDS) needed for each file shard.

Size on MDS
-----------
https://review.gerrithub.io/#/c/253517/

Leader Election
---------------
It's important for the server side DHT2 compoenent to know, in a replicated MDS 
setup, if
a brick is acting as a leader. As of now this information is a part of NSR, but 
needs to
be carved out as a separate translator in the server.

[More on this in the design document]

Server Graph
------------
DHT2 MDSes "interact" with each other (non-replicated MDC would typically have 
the client
translator loaded on the server to "talk" to other (N - 1) MDS nodes. When the 
MDC is
replicated (NSR for example) then:

1) 'N - 1' NSR client component(s) loaded on the server to talk to other 
replias (when
   a (sub)operation needs to be performed on non local/replica-group) node)

         where, N == number of distribute subvolumes

2) 'N' NSR client component(s) loaded on the client for high availability of 
distributed
   MDS.

--

As usual, comments are more than welcome. If things are still unclear or you're 
wanting to
read more, then hang in for the design doc.

Thanks,

                                Venky
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] DHT2 trip report

Reply via email to