Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after 
>> spending time looking at the smgr/md/fd API changes required to find 
>> shortcut, but I'm getting actually a little bit worse timings at least on 
>> "laptop DB tests". One thing that I've noticed is that needs to be only for 
>> automatic-analyze, but not for automatic-vacuum where apparently there is 
>> some boost due to readahead.

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation 
is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is 
different layer and might have different storage settings), VG on top of dm-0, 
LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
output of PgSQL12: 
INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows 
and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent 
heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as 
reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average 
request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, 
total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), 
total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, double-confirmed no readaheads [ 
pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are 
being 
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O 
layer waiting. 

    54.67%     1.33%  postgres  postgres            [.] FileRead
            ---FileRead
                --53.33%--__pread_nocancel
                           --50.67%--entry_SYSCALL_64_after_hwframe
                                     do_syscall_64
                                     sys_pread64
                                     |--49.33%--vfs_read
                                     |           --48.00%--__vfs_read
                                     |                     
|--45.33%--generic_file_read_iter
                                     |                     |          
|--42.67%--ondemand_readahead
                                     |                     |          |         
 __do_page_cache_readahead
                                     |                     |          |         
 |--25.33%--ext4_mpage_readpages
                                     |                     |          |         
 |          |--10.67%--submit_bio
                                     |                     |          |         
 |          |          generic_make_request
                                     |                     |          |         
 |          |          |--8.00%--blk_mq_make_request
                                     |                     |          |         
 |          |          |          |--4.00%--blk_mq_get_request
                                     |                     |          |         
 |          |          |          |          |--1.33%--blk_mq_get_tag
                                     |                     |          |         
 |          |          |          |           --1.33%--sched_clock
                                     |                     |          |         
 |          |          |          |                     xen_sched_clock
                                     |                     |          |         
 |          |          |          |                     pvclock_clocksource_read
                                     |                     |          |         
 |          |          |          |--1.33%--bio_integrity_prep
                                     |                     |          |         
 |          |          |           --1.33%--blk_account_io_start
                                     |                     |          |         
 |          |          |                     part_round_stats
                                     |                     |          |         
 |          |          |                     blk_mq_in_flight
                                     |                     |          |         
 |          |          |                     blk_mq_queue_tag_busy_iter
                                     |                     |          |         
 |          |           --2.67%--dm_make_request
                                     |                     |          |         
 |          |                     __split_and_process_bio
                                     |                     |          |         
 |          |                     __split_and_process_non_flush
                                     |                     |          |         
 |          |                     |--1.33%--__map_bio
                                     |                     |          |         
 |          |                     |          generic_make_request
                                     |                     |          |         
 |          |                     |          generic_make_request_checks
                                     |                     |          |         
 |          |                     |          percpu_counter_add_batch
                                     |                     |          |         
 |          |                      --1.33%--bio_alloc_bioset
                                     |                     |          |         
 |          |                                mempool_alloc
                                     |                     |          |         
 |          |                                kmem_cache_alloc
                                     |                     |          |         
 |          |--6.67%--ext4_map_blocks
                                     |                     |          |         
 |          |          |--4.00%--ext4_es_lookup_extent
                                     |                     |          |         
 |          |          |           --2.67%--_raw_read_lock
                                     |                     |          |         
 |          |           --2.67%--__check_block_validity.constprop.81
                                     |                     |          |         
 |          |                     ext4_data_block_valid
                                     |                     |          |         
 |           --6.67%--add_to_page_cache_lru
                                     |                     |          |         
 |                     |--4.00%--__add_to_page_cache_locked
                                     |                     |          |         
 |                     |           --1.33%--mem_cgroup_try_charge
                                     |                     |          |         
 |                     |                     get_mem_cgroup_from_mm
                                     |                     |          |         
 |                      --2.67%--__lru_cache_add
                                     |                     |          |         
 |                                pagevec_lru_move_fn
                                     |                     |          |         
 |                                __lock_text_start
                                     |                     |          |         
 |--12.00%--blk_finish_plug
                                     |                     |          |         
 |          blk_flush_plug_list
                                     |                     |          |         
 |          blk_mq_flush_plug_list
                                     |                     |          |         
 |          |--10.67%--__blk_mq_delay_run_hw_queue
                                     |                     |          |         
 |          |          __blk_mq_run_hw_queue
                                     |                     |          |         
 |          |          blk_mq_sched_dispatch_requests
                                     |                     |          |         
 |          |           --9.33%--blk_mq_dispatch_rq_list
                                     |                     |          |         
 |          |                     nvme_queue_rq
                                     |                     |          |         
 |          |                      --1.33%--blk_mq_start_request




>The VACUUM case is going to be complicated by what's in the visibility
>map. (..) 

After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux 
read-ahead is beneficial in even simple cases.

>> My only idea would be that a lot of those blocks could be read 
>> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range 
>> before, so maybe the the optimization is possible, but not until we'll have 
>> AIO ;)
>
> (..)AIO is a whole other animal that's been discussed off and on
>around here but it's a much larger and more invasive change than just
>calling posix_fadvise().

Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... 

The ANALYZE just seem fit to be natural candidate to use it. The only easy 
chance 
of acceleration of stats gathering - at least to me and enduser point of view - 
is to have more parallel autoanalyze workers running to drive more I/O 
concurrency 
(by e.g. partitioning the data),  both in readahead and non-readahead 
scenarios. 
Which is a pity because 70-80% of such process sits idle. The readahead might 
read 
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once 
AIO 
would be it could throw thousands of requests without readahead and achieve 
much 
better efficiency probably</speculation>

I hope the previous simple patch goes into master and helps other people 
understand
the picture more easily.

-J.

Reply via email to