write overhead becomes visible at scale ..

Tobias Oberstein Tue, 24 Jan 2017 10:26:59 -0800

Hi,

 pid |                syscall                |   cnt   | cnt_per_sec
-----+---------------------------------------+---------+-------------
     | syscalls:sys_enter_lseek              | 4091584 |      136386
     | syscalls:sys_enter_newfstat           | 2054988 |       68500
     | syscalls:sys_enter_read               |  767990 |       25600
     | syscalls:sys_enter_close              |  503803 |       16793
     | syscalls:sys_enter_newstat            |  434080 |       14469
     | syscalls:sys_enter_open               |  380382 |       12679


Note: there isn't a lot of load currently (this is from production).


That doesn't really mean that much - sure it shows that lseek is
frequent, but it doesn't tell you how much impact this has to the

Above is on a mostly idle system ("idle" for our loads) .. when thingsget hot, lseek calls can reach into the millions/sec.

Doing 5 million syscalls per sec comes with overhead no matter howlightweight the syscall is, doesn't it?


Using pread instead of lseek+read halfes the syscalls.

I really don't understand what you are fighting here ..

overall workload.  For that'd you'd need a generic (i.e. not syscall
tracepoint, but cpu cycle) perf profile, and look in the call graph (via
perf report --children) how much of that is below the lseek syscall.


I see. I might find time to extend our helper function f_perf_syscalls.

I'm much less against this change than Tom, but doing artificial syscall
microbenchmark seems unlikely to make a big case for using it in


This isn't a syscall benchmark, but FIO.


There's not really a difference between those, when you use fio to
benchmark seek vs pseek.


Sorry, I don't understand what you are talking about.


Fio as you appear to have used is a microbenchmark benchmarking
individual syscalls.

I am benchmarking IOPS, and while doing so, it becomes apparent that atthese scales it does matter _how_ IO is done.

The most efficient way is libaio. I get 9.7 million/sec IOPS with lowCPU load. Using any synchronous IO engine is slower and produces higherload.

I do understand that switching to libaio isn't going to fly for PG(completely different approach). But doing pread instead of lseek+readseems simple enough. But then, I don't know about the PG codebase ..


Among the synchronous methods of doing IO, psync is much better than sync.

pvsync, pvsync2 and pvsync2 + hipri (busy polling, no interrupts) arebetter, but the gain is smaller, and all of them are inferior to libaio.

Glad to hear it.


With 3TB RAM, huge pages is absolutely essential (otherwise, the system bogs
down in TLB etc overhead).


I was one of the people working on adding hugepage support to pg, that's
why I was glad ;)


Ahh;) Sorry, wasn't aware. This is really invaluable. Thanks for that!

Cheers,
/Tobias



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lseek/read/write overhead becomes visible at scale ..

Reply via email to