On Sat, May 6, 2017 at 7:34 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 4, 2017 at 10:20 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> Now I'm not going to pretend that this patch is ready for the
>> prime-time. I've not yet worked out how to properly report sync-scan
>> locations without risking reporting later pages after reporting the
>> end of the scan. What I have at the moment could cause a report to be
>> missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
>> size. I'm also not sure how batching like this affect read-aheads, but
>> at least the numbers above speak for something. Although none of the
>> pages in this case came from disk.
>
> This kind of approach has also been advocated within EnterpriseDB, and
> I immediately thought of the read-ahead problem.  I think we need more
> research into how Parallel Seq Scan interacts with OS readahead
> behavior on various operating systems.  It seem possible that Parallel
> Seq Scan frustrates operating system read-ahead even without this
> change on at least some systems (because maybe they can only detect
> ascending block number requests within a single process) and even more
> possible that you run into problems with the block number requests are
> no longer precisely in order (which, at present, they should be, or
> very close).  If it turns out to be a problem, either currently or
> with this patch, we might need to add explicit prefetching logic to
> Parallel Seq Scan.

I don't know much about this stuff, but I was curious to go looking at
source code.  I hope someone will correct me if I'm wrong but here's
what I could glean:

In Linux, each process that opens a file gets its own 'file'
object[1][5].  Each of those has it's own 'file_ra_state'
object[2][3], used by ondemand_readahead[4] for sequential read
detection.  So I speculate that page-at-a-time parallel seq scan must
look like random access to Linux.

In FreeBSD the situation looks similar.  Each process that opens a
file gets a 'file' object[8] which has members 'f_seqcount' and
'f_nextoff'[6].  These are used by the 'sequential_heuristics'
function[7] which affects the ioflag which UFS/FFS uses to control
read ahead (see ffs_read).  So I speculate that page-at-a-time
parallel seq scan must look like random access to FreeBSD too.

In both cases I suspect that if you'd inherited (or sent the file
descriptor to the other process via obscure tricks), it would actually
work because they'd have the same 'file' entry, but that's clearly not
workable for md.c.

Experimentation required...

[1] 
https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L837
[2] 
https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L858
[3] 
https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/include/linux/fs.h#L817
[4] 
https://github.com/torvalds/linux/blob/a3719f34fdb664ffcfaec2160ef20fca7becf2ee/mm/readahead.c#L376
[5] http://www.makelinux.net/ldd3/chp-3-sect-3 "There can be numerous
file structures representing multiple open descriptors on a single
file, but they all point to a single inode structure."
[6] 
https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/sys/file.h#L180
[7] 
https://github.com/freebsd/freebsd/blob/7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da/sys/kern/vfs_vnops.c#L477
[8] Page 319 of 'Design and Implementation of the FreeBSD Operating
System' 2nd Edition

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to