Using GFID does not work for d_off. The GFID represents and inode, and a d_off represents a directory entry. Therefore using GFID as an alternative to d_off breaks down when you have hardlinks for the same inode in a single directory.
On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez <xhernan...@datalab.es> wrote: > On 12/22/2014 06:41 PM, Jeff Darcy wrote: > >> An alternative would be to convert directories into regular files from > >> the brick point of view. > >> > >> The benefits of this would be: > >> > >> * d_off would be controlled by gluster, so all bricks would have the > >> same d_off and order. No need to use any d_off mapping or > transformation. > > > > I don't think a full-out change from real directories to virtual ones is > > in the cards, but a variant of this idea might be worth exploring > further. > > If we had a *server side* component to map between on-disk d_off values > > and those we present to clients, then it might be able to do a better job > > than the local FS of ensuring uniqueness within the bits (e.g. 48 of > them) > > that are left over after we subtract some for a brick ID. This could be > > enough to make the bit-stealing approach (on the client) viable. There > > are probably some issues with failing over between replicas, which should > > have the same files but might not have assigned the same internal d_off > > values, but those issues might be avoidable if the d_off values are > > deterministic with respect to GFIDs. > > Having a server-side xlator seems a better approximation, however I see > some problems that need to be solved: > > The mapper should work on the fly (i.e. it should do the mapping between > the local d_off to the client d_off without having full knowledge of the > directory contents). This is a good approach for really big directories > because doesn't require to waste large amounts of memory, but it will be > hard to find a way to avoid duplicates, specially if are limited to ~48 > bits. > > Making it based on the GFID would be a good way to have common d_off > between bricks, however maintaining order will be harder. It will also > be hard to guarantee uniqueness if mapping is deterministic and > directory is very big. Otherwise it would need to read full directory > contents before returning mapped d_off's. > > To minimize the collision problem, we need to solve the ordering > problem. If we can guarantee that all bricks return directory entries in > the same order and d_off, we don't need to reserve some bits in d_off. > > I think the virtual directories solution should be the one to consider > for 4.0. For earlier versions we can try to find an intermediate solution. > > Following your idea of a server side component, could this be useful ? > > * Keep all directories and its entries in a double linked list stored in > xattr of each inode. > > * Use this linked list to build the readdir answer. > > * Use the first 64 (or 63) bits of gfid as the d_off. > > * There will be two special offsets: 0 for '.' and 1 for '..' > > Example (using shorter gfid's for simplicity): > > Directory root with gfid 0001 > Directory 'test1' inside root with gfid 1111 > Directory 'test2' inside root with gfid 2222 > Entry 'entry1' inside 'test1' with gfid 3333 > Entry 'entry2' inside 'test1' with gfid 4444 > Entry 'entry3' inside 'test2' with gfid 4444 > Entry 'entry4' inside 'test2' with gfid 5555 > Entry 'entry5' inside 'test2' with gfid 6666 > > / (0001) > test1/ (1111) > entry1 (3333) > entry2 (4444) > test2/ (2222) > entry3 (4444) > entry4 (5555) > entry5 (6666) > > Note that entry2 and entry3 are hardlinks. > > xattrs of root (0001): > trusted.dirmap.0001.next = 1111 > trusted.dirmap.0001.prev = 2222 > > xattrs of 'test1' (1111): > trusted.dirmap.0001.next = 2222 > trusted.dirmap.0001.prev = 0001 > trusted.dirmap.1111.next = 3333 > trusted.dirmap.1111.prev = 4444 > > xattrs of 'test2' (2222): > trusted.dirmap.0001.next = 0001 > trusted.dirmap.0001.prev = 1111 > trusted.dirmap.2222.next = 4444 > trusted.dirmap.2222.prev = 6666 > > xattrs of 'entry1' (3333): > trusted.dirmap.1111.next = 4444 > trusted.dirmap.1111.prev = 1111 > > xattrs of 'entry2'/'entry3' (4444): > trusted.dirmap.1111.next = 1111 > trusted.dirmap.1111.prev = 3333 > trusted.dirmap.2222.next = 5555 > trusted.dirmap.2222.prev = 2222 > > xattrs of 'entry4' (5555): > trusted.dirmap.2222.next = 6666 > trusted.dirmap.2222.prev = 4444 > > xattrs of 'entry5' (6666): > trusted.dirmap.2222.next = 2222 > trusted.dirmap.2222.prev = 5555 > > It's easy to enumerate all entries from the beginning of a directory. > Also, since we return extra information from each inode in a directory, > accessing these new xattrs doesn't represent a big impact. > > Given a random d_off, it's relatively easy to find a gfid that starts > with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It > could also be easy to implement the "nearest" of d_off. > > This mechanism guarantees the same order and d_off on all bricks > (assuming that directory modification operations are done with some > locks held), and all management is made locally to each brick, being > transparent to the clients. It also has a very low probability of > collisions (we can use 63 or 64 bits for d_off instead of 48) and does > not require any transformation in dht/afr/ec. > > However some special operation would need to be implemented to allow > healing procedures to correctly add a directory entry in the correct > place to maintain order between bricks when things are not ok. > > It adds some complexity to ensure integrity, but since this job is done > locally on each brick, it seems easier to maintain. > > Xavi >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel