On Thu, 26 Sep 2013 18:07:59 +0400, Vyacheslav Dubeyko wrote:
>> For the tree of xanodes, it seems a bit inefficent because it is
>> implemented over xafile (meta data file), and this introduces
>> threefold tree lookups to access the target disk block:
>>
>> xanode tree lookup --> xafile btree lookup --> DAT btree lookup
>>
>> Simple block chaining over xafile may be better for most cases.
>>
>> I think the tree-over-xafile implementation is a candidate, but please
>> try considering ideas to replace the the tree based lookup into an
>> index based access over xafile.
>>
>
> First of all, in my concept the xafile is aggregate of xanodes. Every
> xanode is accessed by index in xafile. So, namely index based access
> over xafile is used in my implementation.
I said this is "For the tree of xanodes". You are misunderstanding.
I don't comment on shared xanodes here.
Yes, of course, I know block access over xafile i_xattr can be used
for that index access. That is what we intended to prepare the
i_xattr field and the reserved extended attribute metadata file
(NILFS_XATTR_INO)
> The xafile as container of xanodes is important concept from the GC
> point of view. Data blocks and xattr blocks have different nature of
> aging. So, mixture of data and xattr blocks can be not GC
> friendly.
Yes, I know. I don't deny using xafile. The xafile approach also has
the merit of disk format compatibility. These are exactly the reason
why original nilfs design is reserving xattr inode number.
> Access to xanodes by means of index over xafile is
> important concept too. Because namely index of xanode in xafile is
> stored as xanode number in header and other structures. So, xafile
> as metadata file hides problem of translation logical index (xanode
> number) into real block number.
>
> About lookup... On-disk inode keeps xattrs by itself in the inline
> xanode case. So, we don't need to search an inline xanode in such case.
> An inode keeps xanode number (or index of xafile's node) in the case of
> shared xanode. It means that we have metadata file lookup overhead only
> (xafile btree lookup --> DAT btree lookup). An inode keeps xanode number
> of a first xanode of dedicated xanodes' tree. Only this inode possesses
> by such dedicated tree. So, dedicated tree will not big in ordinary
> case.
I am not worrying about lookup of inline xattr or shared xattr. I
inteded to comment on the tree type xattr.
> The dedicated xanodes' tree can be big if an inode will have very
> big count of xattrs (or xattrs with big content size). Every xanode has
> small reserved space for index keys. But, initially, only first xanode
> is used for storing index keys. And reserved space for index keys is
> growing in the case of necessity to have more index keys. I suppose that
> one index level will be enough for most cases. So, usually, reading of
> first xanode will be enough for getting knowledge about all leaf xanodes
> in dedicated xanodes' tree.
Ok, you are pointing out that the depth of xattr trees is expected to
be small. That is reasonable.
> Yes, (xafile btree lookup --> DAT btree lookup) can be the reason of
> overhead. But I think that we need to optimize this overhead for all
> use-cases, anyway.
Agreed.
>> For the inline xattr, I think it should not be mandatory. It should
>> be an option for users who wants faster accesses to the extended
>> attributes and can accept reformat of the file system with 256-bit
>> size inodes for that purpose.
>>
>
> Yes, sure. Inline xattrs are optional. If a user hasn't opportunity (or
> desire) for reformat NILFS2 volume then it will not use inline xattrs
> completely.
Okay.
>> > /*
>> > * struct nilfs_inline_xanode_header - header of inline node
>> > * @magic: magic number for identification
>> > * @type: node type
>> > * @entries: number of entries in node
>> > * @log_node_size: log2(node size in bytes)
>>
>> log2(node size in bytes) - 10 ?
>>
>> 4-bit width log_node_size only can represent 0~15 bytes.
>>
>
> I think that xafile's xanode size can be in range: 1KB (1024) - 32KB
> (32768). I suppose that maximum value in 32KB is enough for all cases.
> So, it needs 15 bytes for 32KB node size.
Ok, this is my misunderstanding.
4-bit width log_node_size can represent 2^0 ~ 2^15 bytes (32KB).
>> > * @flags: node flags
>> > */
>> > struct nilfs_inline_xanode_header {
>> > __le16 magic;
>>
>> Is this magic field necessary? The reserved field
>> at the end of the current disk inode is valuable.
>>
>> Setting the same magic number to all inodes seems wasteful.
>>
>
> The magic field is useful for consistency checking. And magic number
> keeps for the whole xanode. I don't think that 2-byte value wastes 4KB
> xanode, for example. Moreover, I use magic for detecting presence of
> inline xanode in extended space of on-disk inode. Of course, we can
> leave padding field at the end of on-disk inode untouched. But it needs
> to use another way of detecting inline xanode presence. What do you
> suggest?
I am not denying adding 2-bytes magic number to 4k-bytes blocks on
xafile.
What I am saying is the magic number inserted in nilfs_inode.
i_pad is guaranteed to be zero-filled by mkfs, so you don't have to
add 16-bit magic field; a few bits magic number field is enough for
the inlined xattr header.
You can do the consistency check with the few bit magic number and
whether i_xattr is zero or not. (for compatibility reason, we must
define i_xattr is zero if and only if xattr is not used.)
We agreed that inline xattr should be optional, so, I think moving
nilfs_inline_xanode_header into the extended inode area is a
reasonable option.
Anyway, I want to reserve at least a 16-bit field for inode checksum
functionality.
>> > /*
>> > * struct nilfs_xattr_inline_key - xattr key for inline node
>> > * @type: key type
>> > * @flags: key's flags
>> > * @reserved: reserved field
>> > * @name_index: attribute name index
>> > * @name_hash: name hash
>> > * @entry_offset: entry offset inside the node
>> > * @entry_size: size of entry in bytes
>> > */
>> > struct nilfs_xattr_inline_key {
>> > u8 type : 2;
>> > u8 flags : 6;
>> > u8 reserved;
>> > __le16 name_index;
>> > __le32 name_hash;
>> > __le16 entry_offset;
>> > __le16 entry_size;
>> > } __packed;
>>
>> For what purpose is the type field used ?
>
> The purpose of "type" field is to distinguish different type of xanode's
> headers or keys. Namely, for example, detecting that we have header of
> inline xanode, shared xanode or tree xanode.
That looks to be defined as the purpose of "flags" field. See the
following definitions:
> /* Xattr key's flags */
> #define NILFS_INLINE_NODE_KEY 0
> #define NILFS_SHARED_XANODE_KEY 1
> #define NILFS_XANODE_INDEX_KEY 2
> #define NILFS_XANODE_LEAF_KEY 3
>> Again, the type field. What is the difference between the type field
>> and the flags field ?
>>
>
> The type field is part of flags field. So, it can be extracted from
> flags field by means of mask.
OK, please reorganize the two fields when you will get rid of
bit-field uses.
>> > struct nilfs_xattr_index_key {
>> > u8 type : 2;
>> > u8 flags : 6;
>> > u8 reserved1;
>> > __le16 name_index;
>> > __le32 name_hash;
>> > __le64 parent;
>> > __le64 leaf;
>> > } __packed;
>>
>> Why is the "parent" field required ?
>>
>> The size of this structure is 24 bytes.
>> It fits in 16 bytes structure if the parent field is eliminable.
>> Otherwise, you need 8 more pad bytes to align on-disk structures
>> to 8 byte boundary.
>
> I need to think over the real necessity of "parent" field. But my
> initial intention was to have backpointer on index xanode that describe
> this xanode as leaf.
I know it. What I was asking about is the necessity of the "parent"
field.
> But now I doubt that keeps this value in an index
> key is a good idea. Maybe, it makes sense to keep this value in xanode's
> header in the case of real necessity in "parent" backpointer. So, I need
> to reflect about it.
Ok, please reconsider it as needed when you proceed implementation.
Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html