On 25.04.2012 00:23, Mark Fasheh wrote:
> Jan, firstly thanks very much for the thorough review! I wanted to make sure
> I got the collision handling going before addressing the issues you found.
> Again thanks, and my notes are below.



> On Thu, Apr 12, 2012 at 03:08:27PM +0200, Jan Schmidt wrote:
>>> +/*
>>> + * Theoretical limit is larger, but we keep this down to a sane
>>> + * value. That should limit greatly the possibility of collisions on
>>> + * inode ref items.
>>> + */
>>> +#define BTRFS_LINK_MAX 65535U
>>
>> Do we really want an artificial limit like that?
> 
> I took a look at other file systems and this seems to be a pretty reasonable
> maximum. Ext4 and XFS in particular have similar limits.
> 
> Granted, btrfs should be better equipped to deal with large numbers of hard
> links - at least in the area of consistency checking.
> 
> That said, I think keeping this down to "only" 65 thousand hard links per
> inode should work out fine. Also it's not a large change to increase or
> remove the limit should we have to.

Agreed.

>>> -struct btrfs_inode_ref *
>>> +static int find_name_in_ext_backref(struct btrfs_path *path, const char 
>>> *name,
>>> +                              int name_len, struct btrfs_inode_extref 
>>> **ref_ret)
>>> +{
>>> +   struct extent_buffer *leaf;
>>> +   struct btrfs_inode_extref *ref;
>>> +   unsigned long ptr;
>>> +   unsigned long name_ptr;
>>> +   u32 item_size;
>>> +
>>> +   leaf = path->nodes[0];
>>> +   item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>>> +   ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>>> +   ref = (struct btrfs_inode_extref *)ptr;
>>> +
>>> +   /*
>>> +    * There isn't actually anything to "search" in an extended
>>> +    * backref - one name is directly attached to each
>>> +    * item. Instead what we're really doing here is some sort of
>>> +    * sanity check - does the name exist where it should have
>>> +    * been found. If all is well, we'll return success and the
>>> +    * inode ref object.
>>> +    */
>>
>> In that case, the name is not a good choice. However, if we change how
>> we deal with hash collisions, the name might be right and the comment
>> might go away as we really look through more than one item. If we leave
>> it like this, please rename that function.
> 
> Yeah, great catch. In the newer version as you've noted, it *is* actually a
> 'find me the name' function, so the comment (and resulting EROFS) have been
> removed.
> 
> 
>>> +   name_ptr = (unsigned long)(&ref->name);
>>> +   if (btrfs_inode_extref_name_len(leaf, ref) == name_len
>>> +       && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) {
>>> +           *ref_ret = ref;
>>> +           return 1;
>>> +   }
>>> +   return 0;
>>> +}
>>> +
>>> +/*
>>> + * Figure the key offset of an extended inode ref
>>> + */
>>> +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int 
>>> name_len)
>>
>> I'd suggest to call this one btrfs_extref_hash and move it to
>> fs/btrfs/hash.h. That way, we can also drop the include for
>> <linux/crc32c.h> above.
> 
> Done.
> 
> 
>>> +{
>>> +   return (u64) crc32c(parent_objectid, name, name_len);
>>> +}
>>> +
>>> +static struct btrfs_inode_ref *
>>>  btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
>>> -                   struct btrfs_root *root,
>>> -                   struct btrfs_path *path,
>>> -                   const char *name, int name_len,
>>> -                   u64 inode_objectid, u64 ref_objectid, int mod)
>>> +                  struct btrfs_root *root,
>>> +                  struct btrfs_path *path,
>>> +                  const char *name, int name_len,
>>> +                  u64 inode_objectid, u64 ref_objectid, int ins_len,
>>> +                  int cow)
>>>  {
>>> +   int ret;
>>>     struct btrfs_key key;
>>>     struct btrfs_inode_ref *ref;
>>> -   int ins_len = mod < 0 ? -1 : 0;
>>> -   int cow = mod != 0;
>>> -   int ret;
>>>  
>>>     key.objectid = inode_objectid;
>>>     key.type = BTRFS_INODE_REF_KEY;
>>> @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle 
>>> *trans,
>>>     return ref;
>>>  }
>>>  
>>> -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>> +struct btrfs_inode_extref *
>>> +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>>> +                     struct btrfs_root *root,
>>> +                     struct btrfs_path *path,
>>> +                     const char *name, int name_len,
>>> +                     u64 inode_objectid, u64 ref_objectid, int ins_len,
>>> +                     int cow)
>>> +{
>>> +   int ret;
>>> +   struct btrfs_key key;
>>> +   struct btrfs_inode_extref *ref;
>>                                    ^^^
>> Please use some other identifier than the one we use for struct
>> btrfs_inode_ref. Suggestion: extref or eref.
> 
> Yeah I'll make sure they're all changed to extref. Makes it much easier to
> read.
> 
> 
>>> +
>>> +   key.objectid = inode_objectid;
>>> +   key.type = BTRFS_INODE_EXTREF_KEY;
>>> +   key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>                      ^^^^^^^^^^^^^^^^^^^^
>> Get's clearer when that is called btrfs_extref_hash.
>>
>>> +
>>> +   ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
>>> +   if (ret < 0)
>>> +           return ERR_PTR(ret);
>>> +   if (ret > 0)
>>> +           return NULL;
>>> +   if (!find_name_in_ext_backref(path, name, name_len, &ref))
>>> +           return NULL;
>>> +   return ref;
>>> +}
>>> +
>>> +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans,
>>> +                         struct btrfs_root *root,
>>> +                         struct btrfs_path *path,
>>> +                         const char *name, int name_len,
>>> +                         u64 inode_objectid, u64 ref_objectid, int mod,
>>> +                         u64 *ret_index)
>>> +{
>>> +   struct btrfs_inode_ref *ref1;
>>> +   struct btrfs_inode_extref *ref2;
>>                                    ^^^^
>> Please, use "ref" and ("eref" or "extref").
>>
>>> +   int ins_len = mod < 0 ? -1 : 0;
>>> +   int cow = mod != 0;
>>> +
>>> +   ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len,
>>> +                                 inode_objectid, ref_objectid, ins_len,
>>> +                                 cow);
>>> +   if (IS_ERR(ref1))
>>> +           return PTR_ERR(ref1);
>>> +
>>> +   if (ref1 != NULL) {
>>> +           *ret_index = btrfs_inode_ref_index(path->nodes[0], ref1);
>>> +           return 0;
>>> +   }
>>> +
>>> +   btrfs_release_path(path);
>>> +
>>> +   ref2 = btrfs_lookup_inode_extref(trans, root, path, name,
>>> +                                    name_len, inode_objectid,
>>> +                                    ref_objectid, ins_len, cow);
>>> +   if (IS_ERR(ref2))
>>> +           return PTR_ERR(ref2);
>>> +
>>> +   if (ref2) {
>>> +           *ret_index = btrfs_inode_extref_index(path->nodes[0], ref2);
>>> +           return 0;
>>> +   }
>>> +
>>> +   return -ENOENT;
>>> +}
>>> +
>>> +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>>>                        struct btrfs_root *root,
>>>                        const char *name, int name_len,
>>>                        u64 inode_objectid, u64 ref_objectid, u64 *index)
>>>  {
>>>     struct btrfs_path *path;
>>>     struct btrfs_key key;
>>> +   struct btrfs_inode_extref *ref2;
>>                                    ^^^^
>>
>>> +   struct extent_buffer *leaf;
>>> +   int ret;
>>> +
>>> +   key.objectid = inode_objectid;
>>> +   btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY);
>>> +   key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>> +
>>> +   path = btrfs_alloc_path();
>>> +   if (!path)
>>> +           return -ENOMEM;
>>> +
>>> +   path->leave_spinning = 1;
>>> +
>>> +   ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>>> +   if (ret > 0) {
>>> +           ret = -ENOENT;
>>> +           goto out;
>>> +   } else if (ret < 0) {
>>> +           goto out;
>>> +   }
>>
>> Clearer (to me):
>>
>> if (ret > 0)
>>      ret = -ENOENT;
>> if (ret < 0)
>>      goto out;
> 
> Agreed, changed.
> 
>>
>>> +
>>> +   /*
>>> +    * Sanity check - did we find the right item for this name?
>>> +    * This should always succeed so error here will make the FS
>>> +    * readonly.
>>> +    */
>>> +   if (!find_name_in_ext_backref(path, name, name_len, &ref2)) {
>>> +           btrfs_std_error(root->fs_info, -ENOENT);
>>> +           ret = -EROFS;
>>
>> Simply returning -EROFS doesn't make the fs read-only, does it? I'd
>> suggest to return -EIO and drop the comment above.
> 
> Actually, the btrfs_std_error() line above that makes the FS readonly (in
> which case EROFS is a good return value).

Got it. -EROFS is a good choice, then.

> Btw, I think that since is this a request to delete the item that not
> finding it in the tree (even with the possiblity of collision handling) is
> still a possible corruption.
> 
> 
>>> +           goto out;
>>> +   }
>>> +
>>> +   leaf = path->nodes[0];
>>> +   if (index)
>>> +           *index = btrfs_inode_extref_index(leaf, ref2);
>>> +
>>> +   ret = btrfs_del_item(trans, root, path);
>>> +
>>> +out:
>>> +   btrfs_free_path(path);
>>> +
>>> +   return ret;
>>> +}
>>> +
>>> +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>> +                   struct btrfs_root *root,
>>> +                   const char *name, int name_len,
>>> +                   u64 inode_objectid, u64 ref_objectid, u64 *index)
>>> +{
>>> +   struct btrfs_path *path;
>>> +   struct btrfs_key key;
>>>     struct btrfs_inode_ref *ref;
>>>     struct extent_buffer *leaf;
>>>     unsigned long ptr;
>>>     unsigned long item_start;
>>>     u32 item_size;
>>>     u32 sub_item_len;
>>> -   int ret;
>>> +   int ret, search_ext_refs = 0;
>>                ^^^^^^^^^^^^^^^^^^^^^
>> Documentation/CodingStyle:
>> --
>> To this end, use just one data declaration per line (no commas for
>> multiple data declarations).
>> --
>>
>> You do this multiple times in your whole patch set.
> 
> Sure, I can change that. There's lots of kernel code that blatantly ignores
> this but I suppose that's no excuse...
> 
> 
>>>     int del_len = name_len + sizeof(*ref);
>>>  
>>>     key.objectid = inode_objectid;
>>> @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle 
>>> *trans,
>>>     ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>>>     if (ret > 0) {
>>>             ret = -ENOENT;
>>> +           search_ext_refs = 1;
>>>             goto out;
>>>     } else if (ret < 0) {
>>>             goto out;
>>>     }
>>>     if (!find_name_in_backref(path, name, name_len, &ref)) {
>>>             ret = -ENOENT;
>>> +           search_ext_refs = 1;
>>>             goto out;
>>>     }
>>>     leaf = path->nodes[0];
>>> @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle 
>>> *trans,
>>>                               item_size - sub_item_len, 1);
>>>  out:
>>>     btrfs_free_path(path);
>>> +
>>> +   if (search_ext_refs) {
>>
>> As far as I see it, you can drop search_ext_refs and simply go this way
>> on ret == -ENOENT.
> 
> I actually made it explicit on purpose. If you look at the function there's
> several places where ret can be set and then bubbled down. Instead of
> depending on the called functions never setting -ENOENT (and future
> programmers to get that it's a 'magic' value), I prefer to just explicitely
> set a single int variable.

Okay, sounds reasonable.

>>> +           /* 
>>                   ^
>> Catched by accident: trailing whitespace.
> 
> Oh thanks.
> 
>>
>>> +            * No refs were found, or we could not find the
>>> +            * name in our ref array. Find and remove the extended
>>> +            * inode ref then.
>>> +            */
>>> +           return btrfs_del_inode_extref(trans, root, name, name_len,
>>> +                                         inode_objectid, ref_objectid, 
>>> index);
>>> +   }
>>> +
>>> +   return ret;
>>> +}
>>> +
>>> +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>>> +                                struct btrfs_root *root,
>>> +                                const char *name, int name_len,
>>> +                                u64 inode_objectid, u64 ref_objectid, u64 
>>> index)
>>
>> Are we missing a check against BTRFS_LINK_MAX in this function?
> 
> No, the caller is supposed to check this, in theory before starting a
> transaction to change the FS. So btrfs_link() does the checking for us.

Is that worth a comment? Or is that clear to anybody familiar with such
things anyway (in which case I'd not add that comment)?

>>> +{
>>> +   struct btrfs_path *path;
>>> +   struct btrfs_key key;
>>> +   struct btrfs_inode_extref *ref;
>> extref
>>
>>> +   unsigned long ptr;
>>> +   int ret;
>>> +   int ins_len = name_len + sizeof(*ref);
>>> +
>>> +   key.objectid = inode_objectid;
>>> +   key.type = BTRFS_INODE_EXTREF_KEY;
>>> +   key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>> +
>>> +   path = btrfs_alloc_path();
>>> +   if (!path)
>>> +           return -ENOMEM;
>>> +
>>> +   path->leave_spinning = 1;
>>> +   ret = btrfs_insert_empty_item(trans, root, path, &key,
>>> +                                 ins_len);
>>> +   if (ret < 0)
>>> +           goto out;
>>> +
>>> +   ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
>>> +                        struct btrfs_inode_extref);
>>> +
>>> +   btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len);
>>> +   btrfs_set_inode_extref_index(path->nodes[0], ref, index);
>>> +   btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid);
>>> +
>>> +   ptr = (unsigned long)&ref->name;
>>> +   write_extent_buffer(path->nodes[0], name, ptr, name_len);
>>> +   btrfs_mark_buffer_dirty(path->nodes[0]);
>>> +
>>> +out:
>>> +   btrfs_free_path(path);
>>>     return ret;
>>>  }
>>>  
>>> @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle 
>>> *trans,
>>>  
>>>  out:
>>>     btrfs_free_path(path);
>>> +
>>> +   if (ret == -EMLINK) {
>>> +           struct btrfs_super_block *disk_super = 
>>> root->fs_info->super_copy;
>>> +           /* We ran out of space in the ref array. Need to
>>> +            * add an extended ref. */
>>> +           if (btrfs_super_incompat_flags(disk_super)
>>> +               & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
>>> +                   ret = btrfs_insert_inode_extref(trans, root, name,
>>> +                                                   name_len,
>>> +                                                   inode_objectid,
>>> +                                                   ref_objectid, index);
>>> +   }
>>> +
>>>     return ret;
>>>  }
>>>  
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 892b347..0ca525d 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle 
>>> *__unlink_start_trans(struct inode *dir,
>>>     struct btrfs_trans_handle *trans;
>>>     struct btrfs_root *root = BTRFS_I(dir)->root;
>>>     struct btrfs_path *path;
>>> -   struct btrfs_inode_ref *ref;
>>>     struct btrfs_dir_item *di;
>>>     struct inode *inode = dentry->d_inode;
>>>     u64 index;
>>> @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle 
>>> *__unlink_start_trans(struct inode *dir,
>>>     }
>>>     btrfs_release_path(path);
>>>  
>>> -   ref = btrfs_lookup_inode_ref(trans, root, path,
>>> -                           dentry->d_name.name, dentry->d_name.len,
>>> -                           ino, dir_ino, 0);
>>> -   if (IS_ERR(ref)) {
>>> -           err = PTR_ERR(ref);
>>> +   ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name,
>>> +                                   dentry->d_name.len, ino, dir_ino, 0, 
>>> &index);
>>
>> This line is >80 chars. Please use scripts/checkpatch.pl to catch those.
> 
> Fixed.
> 
> 
>>> +   if (ret) {
>>> +           err = ret;
>>>             goto out;
>>>     }
>>> -   BUG_ON(!ref);
>>> +
>>>     if (check_path_shared(root, path))
>>>             goto out;
>>> -   index = btrfs_inode_ref_index(path->nodes[0], ref);
>>> +
>>>     btrfs_release_path(path);
>>>  
>>>     /*
>>> @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct 
>>> btrfs_trans_handle *trans,
>>>     btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY);
>>>     key[0].offset = 0;
>>>  
>>> +   /*
>>> +    * Start new inodes with a V1 ref. This is slightly more
>>> +    * efficient for small numbers of hard links.
>>> +    */
>>
>> I'd drop that comment. extref is no "V2" kind of ref. But you can leave
>> it in if you feel it helps anyone later. 
> 
> Yeah at one point they were called 'v2' due to developer laziness :)
> 
> I've updated the comment:
> 
>       /*
>        * Start new inodes with an inode_ref. This is slightly more
>        * efficient for small numbers of hard links since they will
>        * be packed into one item. Extended refs will kick in if we
>        * add more hard links than can fit in the ref item.
>        */
> 
> 
> It could probably still be improved beyond that. Mostly I want to point out
> that we never start with extended refs so that anyone looking at the code
> will understand the 'flow' of btrfs inode references.

I think its fine like it stands above.

-Jan

>>>     key[1].objectid = objectid;
>>>     btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY);
>>>     key[1].offset = ref_objectid;
>>> @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, 
>>> struct inode *dir,
>>>     if (root->objectid != BTRFS_I(inode)->root->objectid)
>>>             return -EXDEV;
>>>  
>>> -   if (inode->i_nlink == ~0U)
>>> +   if (inode->i_nlink >= BTRFS_LINK_MAX)
>>>             return -EMLINK;
>>>  
>>>     err = btrfs_set_inode_index(dir, &index);
>>
>> Reviewed-by: Jan Schmidt <[email protected]>
> 
> Thanks again Jan. I'll be sure to CC you on my next drop!
>       --Mark
> 
> --
> Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to