Re: [PATCH v2] btrfs: add zstd compression level support

Omar Sandoval Mon, 19 Nov 2018 11:57:43 -0800

On Tue, Nov 13, 2018 at 04:29:33PM +0300, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 04:52, Nick Terrell <terre...@fb.com>:
> >
> >
> >
> > > On Nov 12, 2018, at 4:33 PM, David Sterba <dste...@suse.cz> wrote:
> > >
> > > On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote:
> > >> From: Jennifer Liu <jenniferliu...@fb.com>
> > >>
> > >> Adds zstd compression level support to btrfs. Zstd requires
> > >> different amounts of memory for each level, so the design had
> > >> to be modified to allow set_level() to allocate memory. We
> > >> preallocate one workspace of the maximum size to guarantee
> > >> forward progress. This feature is expected to be useful for
> > >> read-mostly filesystems, or when creating images.
> > >>
> > >> Benchmarks run in qemu on Intel x86 with a single core.
> > >> The benchmark measures the time to copy the Silesia corpus [0] to
> > >> a btrfs filesystem 10 times, then read it back.
> > >>
> > >> The two important things to note are:
> > >> - The decompression speed and memory remains constant.
> > >>  The memory required to decompress is the same as level 1.
> > >> - The compression speed and ratio will vary based on the source.
> > >>
> > >> Level        Ratio   Compression     Decompression   Compression Memory
> > >> 1            2.59    153 MB/s        112 MB/s        0.8 MB
> > >> 2            2.67    136 MB/s        113 MB/s        1.0 MB
> > >> 3            2.72    106 MB/s        115 MB/s        1.3 MB
> > >> 4            2.78    86  MB/s        109 MB/s        0.9 MB
> > >> 5            2.83    69  MB/s        109 MB/s        1.4 MB
> > >> 6            2.89    53  MB/s        110 MB/s        1.5 MB
> > >> 7            2.91    40  MB/s        112 MB/s        1.4 MB
> > >> 8            2.92    34  MB/s        110 MB/s        1.8 MB
> > >> 9            2.93    27  MB/s        109 MB/s        1.8 MB
> > >> 10           2.94    22  MB/s        109 MB/s        1.8 MB
> > >> 11           2.95    17  MB/s        114 MB/s        1.8 MB
> > >> 12           2.95    13  MB/s        113 MB/s        1.8 MB
> > >> 13           2.95    10  MB/s        111 MB/s        2.3 MB
> > >> 14           2.99    7   MB/s        110 MB/s        2.6 MB
> > >> 15           3.03    6   MB/s        110 MB/s        2.6 MB
> > >>
> > >> [0] 
> > >> https://urldefense.proofpoint.com/v2/url?u=http-3A__sun.aei.polsl.pl_-7Esdeor_index.php-3Fpage-3Dsilesia&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=HQM5IQdWOB8WaMoii2dYTw&m=5LQRTUqZnx_a8dGSa5bGsd0Fm4ejQQOcH50wi7nRewY&s=gFUm-SA3aeQI7PBe3zmxUuxk4AEEZegB0cRsbjWUToo&e=
> > >>
> > >> Signed-off-by: Jennifer Liu <jenniferliu...@fb.com>
> > >> Signed-off-by: Nick Terrell <terre...@fb.com>
> > >> Reviewed-by: Omar Sandoval <osan...@fb.com>
> > >> ---
> > >> v1 -> v2:
> > >> - Don't reflow the unchanged line.
> > >>
> > >> fs/btrfs/compression.c | 169 +++++++++++++++++++++++++----------------
> > >> fs/btrfs/compression.h |  18 +++--
> > >> fs/btrfs/lzo.c         |   5 +-
> > >> fs/btrfs/super.c       |   7 +-
> > >> fs/btrfs/zlib.c        |  33 ++++----
> > >> fs/btrfs/zstd.c        |  74 +++++++++++++-----
> > >> 6 files changed, 202 insertions(+), 104 deletions(-)
> > >>
> > >> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > >> index 2955a4ea2fa8..b46652cb653e 100644
> > >> --- a/fs/btrfs/compression.c
> > >> +++ b/fs/btrfs/compression.c
> > >> @@ -822,9 +822,12 @@ void __init btrfs_init_compress(void)
> > >>
> > >>              /*
> > >>               * Preallocate one workspace for each compression type so
> > >> -             * we can guarantee forward progress in the worst case
> > >> +             * we can guarantee forward progress in the worst case.
> > >> +             * Provide the maximum compression level to guarantee large
> > >> +             * enough workspace.
> > >>               */
> > >> -            workspace = btrfs_compress_op[i]->alloc_workspace();
> > >> +            workspace = btrfs_compress_op[i]->alloc_workspace(
> > >> +                            btrfs_compress_op[i]->max_level);
> >
> > We provide the max level here, so we have at least one workspace per
> > compression type that is large enough.
> >
> > >>              if (IS_ERR(workspace)) {
> > >>                      pr_warn("BTRFS: cannot preallocate compression 
> > >> workspace, will try later\n");
> > >>              } else {
> > >> @@ -835,23 +838,78 @@ void __init btrfs_init_compress(void)
> > >>      }
> > >> }
> > >>
> > >> +/*
> > >> + * put a workspace struct back on the list or free it if we have enough
> > >> + * idle ones sitting around
> > >> + */
> > >> +static void __free_workspace(int type, struct list_head *workspace,
> > >> +                         bool heuristic)
> > >> +{
> > >> +    int idx = type - 1;
> > >> +    struct list_head *idle_ws;
> > >> +    spinlock_t *ws_lock;
> > >> +    atomic_t *total_ws;
> > >> +    wait_queue_head_t *ws_wait;
> > >> +    int *free_ws;
> > >> +
> > >> +    if (heuristic) {
> > >> +            idle_ws  = &btrfs_heuristic_ws.idle_ws;
> > >> +            ws_lock  = &btrfs_heuristic_ws.ws_lock;
> > >> +            total_ws = &btrfs_heuristic_ws.total_ws;
> > >> +            ws_wait  = &btrfs_heuristic_ws.ws_wait;
> > >> +            free_ws  = &btrfs_heuristic_ws.free_ws;
> > >> +    } else {
> > >> +            idle_ws  = &btrfs_comp_ws[idx].idle_ws;
> > >> +            ws_lock  = &btrfs_comp_ws[idx].ws_lock;
> > >> +            total_ws = &btrfs_comp_ws[idx].total_ws;
> > >> +            ws_wait  = &btrfs_comp_ws[idx].ws_wait;
> > >> +            free_ws  = &btrfs_comp_ws[idx].free_ws;
> > >> +    }
> > >> +
> > >> +    spin_lock(ws_lock);
> > >> +    if (*free_ws <= num_online_cpus()) {
> > >> +            list_add(workspace, idle_ws);
> > >> +            (*free_ws)++;
> > >> +            spin_unlock(ws_lock);
> > >> +            goto wake;
> > >> +    }
> > >> +    spin_unlock(ws_lock);
> > >> +
> > >> +    if (heuristic)
> > >> +            free_heuristic_ws(workspace);
> > >> +    else
> > >> +            btrfs_compress_op[idx]->free_workspace(workspace);
> > >> +    atomic_dec(total_ws);
> > >> +wake:
> > >> +    cond_wake_up(ws_wait);
> > >> +}
> > >> +
> > >> +static void free_workspace(int type, struct list_head *ws)
> > >> +{
> > >> +    return __free_workspace(type, ws, false);
> > >> +}
> > >> +
> > >> /*
> > >>  * This finds an available workspace or allocates a new one.
> > >>  * If it's not possible to allocate a new one, waits until there's one.
> > >>  * Preallocation makes a forward progress guarantees and we do not return
> > >>  * errors.
> > >>  */
> > >> -static struct list_head *__find_workspace(int type, bool heuristic)
> > >> +static struct list_head *__find_workspace(unsigned int type_level,
> > >> +                                      bool heuristic)
> > >> {
> > >>      struct list_head *workspace;
> > >>      int cpus = num_online_cpus();
> > >> +    int type = type_level & 0xF;
> > >>      int idx = type - 1;
> > >> -    unsigned nofs_flag;
> > >> +    unsigned int level = (type_level & 0xF0) >> 4;
> > >> +    unsigned int nofs_flag;
> > >>      struct list_head *idle_ws;
> > >>      spinlock_t *ws_lock;
> > >>      atomic_t *total_ws;
> > >>      wait_queue_head_t *ws_wait;
> > >>      int *free_ws;
> > >> +    int ret;
> > >>
> > >>      if (heuristic) {
> > >>              idle_ws  = &btrfs_heuristic_ws.idle_ws;
> > >> @@ -874,8 +932,17 @@ static struct list_head *__find_workspace(int type, 
> > >> bool heuristic)
> > >>              list_del(workspace);
> > >>              (*free_ws)--;
> > >>              spin_unlock(ws_lock);
> > >> +            if (!heuristic) {
> > >> +                    nofs_flag = memalloc_nofs_save();
> > >> +                    ret = btrfs_compress_op[idx]->set_level(workspace,
> > >> +                                                            level);
> > >> +                    memalloc_nofs_restore(nofs_flag);
> > >> +                    if (!ret) {
> > >> +                            free_workspace(type, workspace);
> > >> +                            goto again;
> > >> +                    }
> > >> +            }
> > >>              return workspace;
> > >> -
> > >>      }
> > >>      if (atomic_read(total_ws) > cpus) {
> > >>              DEFINE_WAIT(wait);
> > >> @@ -899,7 +966,8 @@ static struct list_head *__find_workspace(int type, 
> > >> bool heuristic)
> > >>      if (heuristic)
> > >>              workspace = alloc_heuristic_ws();
> > >>      else
> > >> -            workspace = btrfs_compress_op[idx]->alloc_workspace();
> > >> +            workspace = btrfs_compress_op[idx]->alloc_workspace(level);
> > >> +
> > >>      memalloc_nofs_restore(nofs_flag);
> > >>
> > >>      if (IS_ERR(workspace)) {
> > >> @@ -930,60 +998,22 @@ static struct list_head *__find_workspace(int 
> > >> type, bool heuristic)
> > >>      return workspace;
> > >> }
> > >>
> > >> -static struct list_head *find_workspace(int type)
> > >> +static struct list_head *find_workspace(unsigned int type_level)
> > >> {
> > >> -    return __find_workspace(type, false);
> > >> +    return __find_workspace(type_level, false);
> > >> }
> > >>
> > >> -/*
> > >> - * put a workspace struct back on the list or free it if we have enough
> > >> - * idle ones sitting around
> > >> - */
> > >> -static void __free_workspace(int type, struct list_head *workspace,
> > >> -                         bool heuristic)
> > >> +static struct list_head *find_decompression_workspace(unsigned int type)
> > >> {
> > >> -    int idx = type - 1;
> > >> -    struct list_head *idle_ws;
> > >> -    spinlock_t *ws_lock;
> > >> -    atomic_t *total_ws;
> > >> -    wait_queue_head_t *ws_wait;
> > >> -    int *free_ws;
> > >> +    /*
> > >> +     * Use the lowest level for decompression, since we don't need to
> > >> +     * compress. This can help us save memory when using levels lower 
> > >> than
> > >> +     * the default level.
> > >> +     */
> > >> +    const unsigned int level = 1;
> > >> +    const unsigned int type_level = (level << 4) | (type & 0xF);
> > >>
> > >> -    if (heuristic) {
> > >> -            idle_ws  = &btrfs_heuristic_ws.idle_ws;
> > >> -            ws_lock  = &btrfs_heuristic_ws.ws_lock;
> > >> -            total_ws = &btrfs_heuristic_ws.total_ws;
> > >> -            ws_wait  = &btrfs_heuristic_ws.ws_wait;
> > >> -            free_ws  = &btrfs_heuristic_ws.free_ws;
> > >> -    } else {
> > >> -            idle_ws  = &btrfs_comp_ws[idx].idle_ws;
> > >> -            ws_lock  = &btrfs_comp_ws[idx].ws_lock;
> > >> -            total_ws = &btrfs_comp_ws[idx].total_ws;
> > >> -            ws_wait  = &btrfs_comp_ws[idx].ws_wait;
> > >> -            free_ws  = &btrfs_comp_ws[idx].free_ws;
> > >> -    }
> > >> -
> > >> -    spin_lock(ws_lock);
> > >> -    if (*free_ws <= num_online_cpus()) {
> > >> -            list_add(workspace, idle_ws);
> > >> -            (*free_ws)++;
> > >> -            spin_unlock(ws_lock);
> > >> -            goto wake;
> > >> -    }
> > >> -    spin_unlock(ws_lock);
> > >> -
> > >> -    if (heuristic)
> > >> -            free_heuristic_ws(workspace);
> > >> -    else
> > >> -            btrfs_compress_op[idx]->free_workspace(workspace);
> > >> -    atomic_dec(total_ws);
> > >> -wake:
> > >> -    cond_wake_up(ws_wait);
> > >> -}
> > >> -
> > >> -static void free_workspace(int type, struct list_head *ws)
> > >> -{
> > >> -    return __free_workspace(type, ws, false);
> > >> +    return find_workspace(type_level);
> > >> }
> > >>
> > >> /*
> > >> @@ -1044,9 +1074,7 @@ int btrfs_compress_pages(unsigned int type_level, 
> > >> struct address_space *mapping,
> > >>      int ret;
> > >>      int type = type_level & 0xF;
> > >>
> > >> -    workspace = find_workspace(type);
> > >> -
> > >> -    btrfs_compress_op[type - 1]->set_level(workspace, type_level);
> > >> +    workspace = find_workspace(type_level);
> > >>      ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
> > >>                                                    start, pages,
> > >>                                                    out_pages,
> > >> @@ -1075,7 +1103,7 @@ static int btrfs_decompress_bio(struct 
> > >> compressed_bio *cb)
> > >>      int ret;
> > >>      int type = cb->compress_type;
> > >>
> > >> -    workspace = find_workspace(type);
> > >> +    workspace = find_decompression_workspace(type);
> > >>      ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb);
> > >>      free_workspace(type, workspace);
> > >>
> > >> @@ -1093,7 +1121,7 @@ int btrfs_decompress(int type, unsigned char 
> > >> *data_in, struct page *dest_page,
> > >>      struct list_head *workspace;
> > >>      int ret;
> > >>
> > >> -    workspace = find_workspace(type);
> > >> +    workspace = find_decompression_workspace(type);
> > >>
> > >>      ret = btrfs_compress_op[type-1]->decompress(workspace, data_in,
> > >>                                                dest_page, start_byte,
> > >> @@ -1591,12 +1619,23 @@ int btrfs_compress_heuristic(struct inode 
> > >> *inode, u64 start, u64 end)
> > >>
> > >> unsigned int btrfs_compress_str2level(const char *str)
> > >> {
> > >> -    if (strncmp(str, "zlib", 4) != 0)
> > >> +    int ret;
> > >> +    int type;
> > >> +    unsigned int level;
> > >> +
> > >> +    if (strncmp(str, "zlib", 4) == 0)
> > >> +            type = BTRFS_COMPRESS_ZLIB;
> > >> +    else if (strncmp(str, "zstd", 4) == 0)
> > >> +            type = BTRFS_COMPRESS_ZSTD;
> > >> +    else
> > >>              return 0;
> > >>
> > >> -    /* Accepted form: zlib:1 up to zlib:9 and nothing left after the 
> > >> number */
> > >> -    if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0)
> > >> -            return str[5] - '0';
> > >> +    if (str[4] == ':') {
> > >> +            ret = kstrtouint(str + 5, 10, &level);
> > >> +            if (ret == 0 && 0 < level &&
> > >> +                level <= btrfs_compress_op[type-1]->max_level)
> > >> +                    return level;
> > >> +    }
> > >>
> > >> -    return BTRFS_ZLIB_DEFAULT_LEVEL;
> > >> +    return btrfs_compress_op[type-1]->default_level;
> > >> }
> > >> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> > >> index ddda9b80bf20..a582a4483077 100644
> > >> --- a/fs/btrfs/compression.h
> > >> +++ b/fs/btrfs/compression.h
> > >> @@ -23,8 +23,6 @@
> > >> /* Maximum size of data before compression */
> > >> #define BTRFS_MAX_UNCOMPRESSED               (SZ_128K)
> > >>
> > >> -#define     BTRFS_ZLIB_DEFAULT_LEVEL                3
> > >> -
> > >> struct compressed_bio {
> > >>      /* number of bios pending for this compressed extent */
> > >>      refcount_t pending_bios;
> > >> @@ -87,7 +85,7 @@ blk_status_t btrfs_submit_compressed_write(struct 
> > >> inode *inode, u64 start,
> > >> blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct 
> > >> bio *bio,
> > >>                               int mirror_num, unsigned long bio_flags);
> > >>
> > >> -unsigned btrfs_compress_str2level(const char *str);
> > >> +unsigned int btrfs_compress_str2level(const char *str);
> > >>
> > >> enum btrfs_compression_type {
> > >>      BTRFS_COMPRESS_NONE  = 0,
> > >> @@ -98,7 +96,7 @@ enum btrfs_compression_type {
> > >> };
> > >>
> > >> struct btrfs_compress_op {
> > >> -    struct list_head *(*alloc_workspace)(void);
> > >> +    struct list_head *(*alloc_workspace)(unsigned int level);
> > >>
> > >>      void (*free_workspace)(struct list_head *workspace);
> > >>
> > >> @@ -119,7 +117,17 @@ struct btrfs_compress_op {
> > >>                        unsigned long start_byte,
> > >>                        size_t srclen, size_t destlen);
> > >>
> > >> -    void (*set_level)(struct list_head *ws, unsigned int type);
> > >> +    /*
> > >> +     * Check if memory allocated in workspace is greater than
> > >> +     * or equal to memory needed to compress with given level.
> > >> +     * If not, try to reallocate memory for the compression level.
> > >> +     * Returns true on success.
> > >> +     */
> > >> +    bool (*set_level)(struct list_head *ws, unsigned int level);
> > >> +
> > >> +    unsigned int max_level;
> > >> +
> > >> +    unsigned int default_level;
> > >> };
> > >>
> > >> extern const struct btrfs_compress_op btrfs_zlib_compress;
> > >> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> > >> index b6a4cc178bee..ed9f0da8ceda 100644
> > >> --- a/fs/btrfs/lzo.c
> > >> +++ b/fs/btrfs/lzo.c
> > >> @@ -71,7 +71,7 @@ static void lzo_free_workspace(struct list_head *ws)
> > >>      kfree(workspace);
> > >> }
> > >>
> > >> -static struct list_head *lzo_alloc_workspace(void)
> > >> +static struct list_head *lzo_alloc_workspace(unsigned int level)
> > >> {
> > >>      struct workspace *workspace;
> > >>
> > >> @@ -485,8 +485,9 @@ static int lzo_decompress(struct list_head *ws, 
> > >> unsigned char *data_in,
> > >>      return ret;
> > >> }
> > >>
> > >> -static void lzo_set_level(struct list_head *ws, unsigned int type)
> > >> +static bool lzo_set_level(struct list_head *ws, unsigned int level)
> > >> {
> > >> +    return true;
> > >> }
> > >>
> > >> const struct btrfs_compress_op btrfs_lzo_compress = {
> > >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > >> index b362b45dd757..77ebd69371f1 100644
> > >> --- a/fs/btrfs/super.c
> > >> +++ b/fs/btrfs/super.c
> > >> @@ -520,7 +520,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > >> char *options,
> > >>                              compress_type = "zlib";
> > >>
> > >>                              info->compress_type = BTRFS_COMPRESS_ZLIB;
> > >> -                            info->compress_level = 
> > >> BTRFS_ZLIB_DEFAULT_LEVEL;
> > >> +                            info->compress_level =
> > >> +                                    btrfs_zlib_compress.default_level;
> > >>                              /*
> > >>                               * args[0] contains uninitialized data since
> > >>                               * for these tokens we don't expect any
> > >> @@ -542,9 +543,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > >> char *options,
> > >>                              btrfs_clear_opt(info->mount_opt, NODATASUM);
> > >>                              btrfs_set_fs_incompat(info, COMPRESS_LZO);
> > >>                              no_compress = 0;
> > >> -                    } else if (strcmp(args[0].from, "zstd") == 0) {
> > >> +                    } else if (strncmp(args[0].from, "zstd", 4) == 0) {
> > >>                              compress_type = "zstd";
> > >>                              info->compress_type = BTRFS_COMPRESS_ZSTD;
> > >> +                            info->compress_level =
> > >> +                                    
> > >> btrfs_compress_str2level(args[0].from);
> > >>                              btrfs_set_opt(info->mount_opt, COMPRESS);
> > >>                              btrfs_clear_opt(info->mount_opt, NODATACOW);
> > >>                              btrfs_clear_opt(info->mount_opt, NODATASUM);
> > >> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
> > >> index 970ff3e35bb3..4c30a99b80f6 100644
> > >> --- a/fs/btrfs/zlib.c
> > >> +++ b/fs/btrfs/zlib.c
> > >> @@ -20,6 +20,9 @@
> > >> #include <linux/refcount.h>
> > >> #include "compression.h"
> > >>
> > >> +#define BTRFS_ZLIB_DEFAULT_LEVEL 3
> > >> +#define BTRFS_ZLIB_MAX_LEVEL 9
> > >> +
> > >> struct workspace {
> > >>      z_stream strm;
> > >>      char *buf;
> > >> @@ -36,7 +39,19 @@ static void zlib_free_workspace(struct list_head *ws)
> > >>      kfree(workspace);
> > >> }
> > >>
> > >> -static struct list_head *zlib_alloc_workspace(void)
> > >> +static bool zlib_set_level(struct list_head *ws, unsigned int level)
> > >> +{
> > >> +    struct workspace *workspace = list_entry(ws, struct workspace, 
> > >> list);
> > >> +
> > >> +    if (level > BTRFS_ZLIB_MAX_LEVEL)
> > >> +            level = BTRFS_ZLIB_MAX_LEVEL;
> > >> +
> > >> +    workspace->level = level > 0 ? level : BTRFS_ZLIB_DEFAULT_LEVEL;
> > >> +
> > >> +    return true;
> > >> +}
> > >> +
> > >> +static struct list_head *zlib_alloc_workspace(unsigned int level)
> > >> {
> > >>      struct workspace *workspace;
> > >>      int workspacesize;
> > >> @@ -53,6 +68,7 @@ static struct list_head *zlib_alloc_workspace(void)
> > >>              goto fail;
> > >>
> > >>      INIT_LIST_HEAD(&workspace->list);
> > >> +    zlib_set_level(&workspace->list, level);
> > >>
> > >>      return &workspace->list;
> > >> fail:
> > >> @@ -390,22 +406,13 @@ static int zlib_decompress(struct list_head *ws, 
> > >> unsigned char *data_in,
> > >>      return ret;
> > >> }
> > >>
> > >> -static void zlib_set_level(struct list_head *ws, unsigned int type)
> > >> -{
> > >> -    struct workspace *workspace = list_entry(ws, struct workspace, 
> > >> list);
> > >> -    unsigned level = (type & 0xF0) >> 4;
> > >> -
> > >> -    if (level > 9)
> > >> -            level = 9;
> > >> -
> > >> -    workspace->level = level > 0 ? level : 3;
> > >> -}
> > >> -
> > >> const struct btrfs_compress_op btrfs_zlib_compress = {
> > >>      .alloc_workspace        = zlib_alloc_workspace,
> > >>      .free_workspace         = zlib_free_workspace,
> > >>      .compress_pages         = zlib_compress_pages,
> > >>      .decompress_bio         = zlib_decompress_bio,
> > >>      .decompress             = zlib_decompress,
> > >> -    .set_level              = zlib_set_level,
> > >> +    .set_level              = zlib_set_level,
> > >> +    .max_level              = BTRFS_ZLIB_MAX_LEVEL,
> > >> +    .default_level          = BTRFS_ZLIB_DEFAULT_LEVEL,
> > >> };
> > >> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> > >> index af6ec59972f5..e5d7c2eae65c 100644
> > >> --- a/fs/btrfs/zstd.c
> > >> +++ b/fs/btrfs/zstd.c
> > >> @@ -19,12 +19,13 @@
> > >>
> > >> #define ZSTD_BTRFS_MAX_WINDOWLOG 17
> > >> #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
> > >> -#define ZSTD_BTRFS_DEFAULT_LEVEL 3
> > >> +#define BTRFS_ZSTD_DEFAULT_LEVEL 3
> > >> +#define BTRFS_ZSTD_MAX_LEVEL 15
> > >>
> > >> -static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
> > >> +static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len,
> > >> +                                             unsigned int level)
> > >> {
> > >> -    ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
> > >> -                                            src_len, 0);
> > >> +    ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
> > >>
> > >>      if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
> > >>              params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
> > >> @@ -37,10 +38,25 @@ struct workspace {
> > >>      size_t size;
> > >>      char *buf;
> > >>      struct list_head list;
> > >> +    unsigned int level;
> > >>      ZSTD_inBuffer in_buf;
> > >>      ZSTD_outBuffer out_buf;
> > >> };
> > >>
> > >> +static bool zstd_reallocate_mem(struct workspace *workspace, int size)
> > >> +{
> > >> +    void *new_mem;
> > >> +
> > >> +    new_mem = kvmalloc(size, GFP_KERNEL);
> > >> +    if (new_mem) {
> > >> +            kvfree(workspace->mem);
> > >> +            workspace->mem = new_mem;
> > >> +            workspace->size = size;
> > >> +            return true;
> > >> +    }
> > >> +    return false;
> > >> +}
> > >> +
> > >> static void zstd_free_workspace(struct list_head *ws)
> > >> {
> > >>      struct workspace *workspace = list_entry(ws, struct workspace, 
> > >> list);
> > >> @@ -50,10 +66,34 @@ static void zstd_free_workspace(struct list_head *ws)
> > >>      kfree(workspace);
> > >> }
> > >>
> > >> -static struct list_head *zstd_alloc_workspace(void)
> > >> +static bool zstd_set_level(struct list_head *ws, unsigned int level)
> > >> +{
> > >> +    struct workspace *workspace = list_entry(ws, struct workspace, 
> > >> list);
> > >> +    ZSTD_parameters params;
> > >> +    int size;
> > >> +
> > >> +    if (level > BTRFS_ZSTD_MAX_LEVEL)
> > >> +            level = BTRFS_ZSTD_MAX_LEVEL;
> > >> +
> > >> +    if (level == 0)
> > >> +            level = BTRFS_ZSTD_DEFAULT_LEVEL;
> > >> +
> > >> +    params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0);
> > >> +    size = max_t(size_t,
> > >> +                    ZSTD_CStreamWorkspaceBound(params.cParams),
> > >> +                    ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
> > >> +    if (size > workspace->size) {
> > >> +            if (!zstd_reallocate_mem(workspace, size))
> > >
> > > This can allocate memory and this can appen on the writeout path, ie.
> > > one of the reasons for that might be that system needs more memory.
> > >
> > > By the table above, the size can be up to 2.6MiB, which is a lot in
> > > terms of kernel memory as there must be either contiguous unmapped
> > > memory, the virtual mappings must be created. Both are scarce resource
> > > or should be treated as such.
> > >
> > > Given that there's no logic that would try to minimize the usage for
> > > workspaces, this can allocate many workspaces of that size.
> > >
> > > Currently the workspace allocations have been moved to the early module
> > > loading phase so that they don't happen later and we don't have to
> > > allocate memory nor handle the failures. Your patch brings that back.
> >
> > Before this patch, the compression module initialization preloads one
> > workspace per algorithm. We added the compression level as a parameter
> > to the allocation, and the initialization uses the maximum level, 15. This
> > guarantees forward progress, since each algorithm will have at least one
> > workspace that will work for every compression level.
> >
> > The only new failure condition is when the compression level changes,
> > we have to reallocate the workspaces next time we use them. If we run
> > into memory pressure, we might free every workspace but one, reducing
> > throughput, but maintaining correctness. This is basically the same scenario
> > as before, but now it can occur when writing after changing compression
> > levels, not just changing compression algorithm.
> >
> > I'm not too worried about increased memory usage. We only preallocate
> > one of the maximum level (1 MB wasted if not used). Then we only up-size
> > the workspaces when needed. I don't expect that high compression levels
> > will be a common use case.
> >
> > The one potential problem with this solution is if the user rarely mounts
> > with level 15 (say once a day), but commonly uses level 3. Then their
> > workspaces will be sized at 15 forever. However, this is the same
> > problem as if the user commonly uses no compression, but occasionally
> > uses zstd. The zstd contexts will stick around forever. Ideally we would
> > delete workspaces that go unused for a certain amount of time (leaving
> > the single preallocated workspace). I think this belongs in a separate 
> > patch.
> > I'll plan on looking into it as a follow up.
> >
> > > The solution I'm currently thinking about can make the levels work but
> > > would be limited in throughput as a trade-off for the memory
> > > consumption.
> > >
> > > - preallocate one workspace for level 15 per mounted filesystem, using
> > >  get_free_pages_exact or kvmalloc
> > >
> > > - preallocate workspaces for the default level, the number same as for
> > >  lzo/zlib
> > >
> > > - add logic to select the zstd:15 workspace last for other compressors,
> > >  ie. make it almost exclusive for zstd levels > default
> > >
> > > Further refinement could allocate the workspaces on-demand and free if
> > > not used. Allocation failures would still be handled gracefully at the
> > > cost of waiting for the remainig worspaces, at least one would be always
> > > available.
> >
> > Nick
> 
> May be i will say some crazy things,
> but i think simpler solution, will be fallback to low compression levels
> on memory starvation.
> 
> i.e. any memory size what used in kernel and can't be handled by kmalloc,
> is a huge amount of memory (IMHO).
> 
> And if user get situation where we not have enough free memory, to handle
> high compression ratio, may be that better to not try handle that
> compression level.
> And just fallback to level, with available preallocated memory areas.


What if I chose the highest compression level because it's very
important to me that the filesystem be as small as possible? In that
case, I'd rather wait for the pre-allocated maximum level workspace to
be available than fall back to worse compression. I'd imagine that most
people setting the maximum compression level really want maximum
compression.

Re: [PATCH v2] btrfs: add zstd compression level support

Reply via email to