On 2023/12/31 9:09, Gao Xiang wrote: > [你通常不会收到来自 [email protected] 的电子邮件。请访问 > https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要] > > On 2023/12/29 12:48, Chunhai Guo wrote: >>> Hi Chunhai, >>> >>> On 2023/12/28 21:00, Chunhai Guo wrote: >>>> Using a global page pool for LZ4 decompression significantly reduces >>>> the time spent on page allocation in low memory scenarios. >>>> >>>> The table below shows the reduction in time spent on page allocation >>>> for >>>> LZ4 decompression when using a global page pool. >>>> The results were obtained from multi-app launch benchmarks on ARM64 >>>> Android devices running the 5.15 kernel. >>>> +--------------+---------------+--------------+---------+ >>>> | | w/o page pool | w/ page pool | diff | >>>> +--------------+---------------+--------------+---------+ >>>> | Average (ms) | 3434 | 21 | -99.38% | >>>> +--------------+---------------+--------------+---------+ >>>> >>>> Based on the benchmark logs, it appears that 256 pages are sufficient >>>> for most cases, but this can be adjusted as needed. Additionally, >>>> turning on CONFIG_EROFS_FS_DEBUG will simplify the tuning process. >>> Thanks for the patch. I have some questions: >>> - what pcluster sizes are you using? 4k or more? >> We currently use a 4k pcluster size. >> >>> - what the detailed configuration are you using for the multi-app >>> launch workload? Such as CPU / Memory / the number of apps. >> We ran the benchmark on Android devices with the following configuration. >> In the benchmark, we launched 16 frequently-used apps, and the camera app >> was the last one in each round. The results in the table above were >> obtained from the launching process of the camera app. >> CPU: 8 cores >> Memory: 8GB > It's the accumulated time of camera app for all rounds or the average > time of camera app for each round?
It's the the average time of camera app for each round. >>>> This patch currently only supports the LZ4 decompressor, other >>>> decompressors will be supported in the next step. >>>> >>>> Signed-off-by: Chunhai Guo <[email protected]> >>>> --- >>>> fs/erofs/compress.h | 1 + >>>> fs/erofs/decompressor.c | 42 ++++++++++++-- >>>> fs/erofs/internal.h | 5 ++ >>>> fs/erofs/super.c | 1 + >>>> fs/erofs/utils.c | 121 ++++++++++++++++++++++++++++++++++++++++ >>>> 5 files changed, 165 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/fs/erofs/compress.h b/fs/erofs/compress.h index >>>> 279933e007d2..67202b97d47b 100644 >>>> --- a/fs/erofs/compress.h >>>> +++ b/fs/erofs/compress.h >>>> @@ -31,6 +31,7 @@ struct z_erofs_decompressor { >>>> /* some special page->private (unsigned long, see below) */ >>>> #define Z_EROFS_SHORTLIVED_PAGE (-1UL << 2) >>>> #define Z_EROFS_PREALLOCATED_PAGE (-2UL << 2) >>>> +#define Z_EROFS_POOL_PAGE (-3UL << 2) >>>> >>>> /* >>>> * For all pages in a pcluster, page->private should be one of diff >>>> --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c index >>>> d08a6ee23ac5..41b34f01416f 100644 >>>> --- a/fs/erofs/decompressor.c >>>> +++ b/fs/erofs/decompressor.c >>>> @@ -54,6 +54,7 @@ static int z_erofs_load_lz4_config(struct super_block >>>> *sb, >>>> sbi->lz4.max_distance_pages = distance ? >>>> DIV_ROUND_UP(distance, PAGE_SIZE) >>>> + 1 : >>>> LZ4_MAX_DISTANCE_PAGES; >>>> + erofs_global_page_pool_init(); >>>> return erofs_pcpubuf_growsize(sbi->lz4.max_pclusterblks); >>>> } >>>> >>>> @@ -111,15 +112,42 @@ static int z_erofs_lz4_prepare_dstpages(struct >>>> z_erofs_lz4_decompress_ctx *ctx, >>>> victim = availables[--top]; >>>> get_page(victim); >>>> } else { >>>> - victim = erofs_allocpage(pagepool, >>>> + victim = erofs_allocpage_for_decmpr(pagepool, >>>> GFP_KERNEL | >>>> __GFP_NOFAIL); >>> For each read request, the extreme case here will be 15 pages for 64k LZ4 >>> sliding window (60k = 64k-4k). You could reduce >>> LZ4 sliding window to save more pages with slight compression ratio loss. >> OK, we will do the test on this. However, based on the data we have, 97% of >> the compressed pages that have been read can be decompressed to less than 4 >> pages. Therefore, we may not put too much hope on this. > Yes, but I'm not sure if just 3% of compressed data denodes the majority of > latencies. It'd be better to try it out anyway. OK, we will do the test on this. > >>> Or, here __GFP_NOFAIL is actually unnecessary since we could bail out this >>> if allocation failed for all readahead requests >>> and only address __read requests__. I have some plan to do >>> this but it's too close to the next merge window. So I was once to work >>> this out for Linux 6.9. >> This sounds great. It is more likely another optimization related to this >> case. >> >>> Anyway, I'm not saying mempool is not a good idea, but I tend to reserve >>> memory as less as possible if there are some other way to mitigate the same >>> workload since reserving memory is not free (which means 1 MiB memory will >>> be only used for this.) Even we will do a mempool, I wonder if we could >>> unify pcpubuf and mempool together to make a better pool. >> I totally agree with your opinion. We use 256 pages for the worst-case >> scenario, and 1 MiB is acceptable in 8GB devices. However, for 95% of >> scenarios, 64 pages are sufficient and more acceptable for other devices. >> And you are right, I will create a patch to unify the pcpubuf and mempool >> in the next step. > Anyway, if a global mempool is really needed. I'd like to add > some new sysfs interface to change this value (by default, 0). > Also you could reuse some of shortlived interfaces for global > pool rather than introduce another type of pages. OK. I will make another patch for this. Thanks, > > Thanks, > Gao Xiang
