Re: (ITS#9017) Improving performance of commit sync in Windows

kriszyp Wed, 18 Sep 2019 11:58:25 -0700

--000000000000d8a73d0592d86418
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable


Checking on this again, is this still a possibility for merging into LMDB?
This fix is still working great (improved performance) on our systems.
Thanks,
Kris

On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp <kris...@gmail.com> wrote:

> Is this still being considered/reviewed? Let me know if there are any
> other changes you would like me to make. This patch has continued to yiel=
d
> significant and reliable performance improvements for us, and seems like =
it
> would be nice for this to be available for other Windows users.
>
> On Fri, May 3, 2019 at 3:52 PM Kris Zyp <kris...@gmail.com> wrote:
>
>> For the sake of putting this in the email thread (other code discussion
>> in GitHub), here is the latest squashed commit of the proposed patch (wi=
th
>> the on-demand, retained overlapped array to reduce re-malloc and opening
>> event handles):
>> https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75e=
e222072b990f
>>
>>
>>
>> Thanks,
>> Kris
>>
>>
>>
>> *From: *Kris Zyp <kris...@gmail.com>
>> *Sent: *April 30, 2019 12:43 PM
>> *To: *Howard Chu <h...@symas.com>; openldap-its@OpenLDAP.org
>> *Subject: *RE: (ITS#9017) Improving performance of commit sync in Window=
s
>>
>>
>>
>> > What is the point of using writemap mode if you still need to use
>> WriteFile
>>
>> > on every individual page?
>>
>>
>>
>> As I understood from the documentation, and have observed, using writema=
p
>> mode is faster (and uses less temporary memory) because it doesn=E2=80=
=99t require
>> mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fewer =
mallocs=E2=80=9D).
>> To be clear though, LMDB is so incredibly fast and efficient, that in
>> sync-mode, it takes enormous transactions before the time spent allocati=
ng
>> and creating the dirty pages with the updated b-tree is anywhere even
>> remotely close to the time it takes to wait for disk flushing, even with=
 an
>> SSD. But the more pertinent question is efficiency, and measuring CPU
>> cycles rather than time spent (efficiency is more important than just ti=
me
>> spent). When I ran my tests this morning of 100 (sync) transactions with
>> 100 puts per transaction, times varied quite a bit, but it seemed like
>> running with writemap enabled typically averages about 500ms of CPU and
>> with writemap disabled it typically averages around 600ms. Not a huge
>> difference, but still definitely worthwhile, I think.
>>
>>
>>
>> Caveat emptor: Measuring LMDB performance with sync interactions on
>> Windows is one of the most frustratingly erratic things to measure. It i=
s
>> sunny outside right now, times could be different when it starts raining
>> later, but, this is what I saw this morning...
>>
>>
>>
>> > What is the performance difference between your patch using writemap,
>> and just
>>
>> > not using writemap in the first place?
>>
>>
>>
>> Running 1000 sync transactions on 3GB db with a single put per
>> transaction, without writemap map, without the patch took about 60 secon=
ds.
>> And it took about 1 second with the patch with writemap mode enabled!
>> (there is no significant difference in sync times with writemap enabled =
or
>> disabled with the patch.) So the difference was huge in my test. And not
>> only that, without the patch, the CPU usage was actually _*higher*_
>> during that 60 seconds (close to 100% of a core) than during the executi=
on
>> with the patch for one second (close to 50%).  Anyway, there are certain=
ly
>> tests I have run where the differences are not as large (doing small
>> commits on large dbs accentuates the differences), but the patch always
>> seems to win. It could also be that my particular configuration causes
>> bigger differences (on an SSD drive, and maybe a more fragmented file?).
>>
>>
>>
>> Anyway, I added error handling for the malloc, and fixed/changed the
>> other things you suggested. Be happy to make any other changes you want.
>> The updated patch is here:
>>
>>
>> https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17=
b9b62094acde
>>
>>
>>
>> > OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED));
>>
>> > Probably this ought to just be pre-allocated based on the maximum
>> number of dirty pages a txn allows.
>>
>>
>>
>> I wasn=E2=80=99t sure I understood this comment. Are you suggesting we m=
alloc(MDB_IDL_UM_MAX
>> * sizeof(OVERLAPPED)) for each environment, and retain it for the life o=
f
>> the environment? I think that is 4MB, if my math is right, which seems l=
ike
>> a lot of memory to keep allocated (we usually have a lot of open
>> environments). If the goal is to reduce the number of mallocs, how about=
 we
>> retain the OVERLAPPED array, and only free and re-malloc if the previous
>> allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unneces=
sary allocation,
>> and we only malloc when there is a bigger transaction than any previous.=
 I
>> put this together in a separate commit, as I wasn=E2=80=99t sure if this=
 what you
>> wanted (can squash if you prefer):
>> https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a=
4b2adefaac40
>>
>>
>>
>> Thank you for the review!
>>
>>
>>
>> Thanks,
>> Kris
>>
>>
>>
>> *From: *Howard Chu <h...@symas.com>
>> *Sent: *April 30, 2019 7:12 AM
>> *To: *kris...@gmail.com; openldap-its@OpenLDAP.org
>> *Subject: *Re: (ITS#9017) Improving performance of commit sync in Window=
s
>>
>>
>>
>> kris...@gmail.com wrote:
>>
>> > Full_Name: Kristopher William Zyp
>>
>> > Version: LMDB 0.9.23
>>
>> > OS: Windows
>>
>> > URL:
>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab=
9332b7fc4ce9
>>
>> > Submission from: (NULL) (71.199.6.148)
>>
>> >
>>
>> >
>>
>> > We have seen very poor performance on the sync of commits on large
>> databases in
>>
>> > Windows. On databases with 2GB of data, in writemap mode, the sync of
>> even small
>>
>> > commits is consistently well over 100ms (without writemap it is faster=
,
>> but
>>
>> > still slow). It is expected that a sync should take some time while
>> waiting for
>>
>> > disk confirmation of the writes, but more concerning is that these syn=
c
>>
>> > operations (in writemap mode) are instead dominated by nearly 100%
>> system CPU
>>
>> > utilization, so operations that requires sub-millisecond b-tree update
>>
>> > operations are then dominated by very large amounts of system CPU
>> cycles during
>>
>> > the sync phase.
>>
>> >
>>
>> > I think that the fundamental problem is that FlushViewOfFile seems to
>> be an O(n)
>>
>> > operation where n is the size of the file (or map). I presume that
>> Windows is
>>
>> > scanning the entire map/file for dirty pages to flush, I'm guessing
>> because it
>>
>> > doesn't have an internal index of all the dirty pages for every
>> file/map-view in
>>
>> > the OS disk cache. Therefore, the turns into an extremely expensive,
>> CPU-bound
>>
>> > operation to find the dirty pages for (large file) and initiate their
>> writes,
>>
>> > which, of course, is contrary to the whole goal of a scalable database
>> system.
>>
>> > And FlushFileBuffers is also relatively slow as well. We have attempte=
d
>> to batch
>>
>> > as many operations into single transaction as possible, but this is
>> still a very
>>
>> > large overhead.
>>
>> >
>>
>> > The Windows docs for FlushFileBuffers itself warns about the
>> inefficiencies of
>>
>> > this function (
>> https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-=
flushfilebuffers
>> ).
>>
>> > Which also points to the solution: it is much faster to write out the
>> dirty
>>
>> > pages with WriteFile through a sync file handle
>> (FILE_FLAG_WRITE_THROUGH).
>>
>> >
>>
>> > The associated patch
>>
>> > (
>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab=
9332b7fc4ce9
>> )
>>
>> > is my attempt at implementing this solution, for Windows. Fortunately,
>> with the
>>
>> > design of LMDB, this is relatively straightforward. LMDB already
>> supports
>>
>> > writing out dirty pages with WriteFile calls. I added a write-through
>> handle for
>>
>> > sending these writes directly to disk. I then made that file-handle
>>
>> > overlapped/asynchronously, so all the writes for a commit could be
>> started in
>>
>> > overlap mode, and (at least theoretically) transfer in parallel to the
>> drive and
>>
>> > then used GetOverlappedResult to wait for the completion. So basically
>>
>> > mdb_page_flush becomes the sync. I extended the writing of dirty pages
>> through
>>
>> > WriteFile to writemap mode as well (for writing meta too), so that
>> WriteFile
>>
>> > with write-through can be used to flush the data without ever needing
>> to call
>>
>> > FlushViewOfFile or FlushFileBuffers. I also implemented support for
>> write
>>
>> > gathering in writemap mode where contiguous file positions infers
>> contiguous
>>
>> > memory (by tracking the starting position with wdp and writing
>> contiguous pages
>>
>> > in single operations). Sorting of the dirty list is maintained even in
>> writemap
>>
>> > mode for this purpose.
>>
>>
>>
>> What is the point of using writemap mode if you still need to use
>> WriteFile
>>
>> on every individual page?
>>
>>
>>
>> > The performance benefits of this patch, in my testing, are
>> considerable. Writing
>>
>> > out/syncing transactions is typically over 5x faster in writemap mode,
>> and 2x
>>
>> > faster in standard mode. And perhaps more importantly (especially in
>> environment
>>
>> > with many threads/processes), the efficiency benefits are even larger,
>>
>> > particularly in writemap mode, where there can be a 50-100x reduction
>> in the
>>
>> > system CPU usage by using this patch. This brings windows performance
>> with
>>
>> > sync'ed transactions in LMDB back into the range of "lightning"
>> performance :).
>>
>>
>>
>> What is the performance difference between your patch using writemap, an=
d
>> just
>>
>> not using writemap in the first place?
>>
>>
>>
>> --
>>
>>   -- Howard Chu
>>
>>   CTO, Symas Corp.           http://www.symas.com
>>
>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>
>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>>
>>
>>
>>
>>
>

--000000000000d8a73d0592d86418
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Checking on this again, is this still a possibility for me=
rging into LMDB? This fix is still working great (improved performance) on =
our systems.<div>Thanks,</div><div>Kris</div></div><br><div class=3D"gmail_=
quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Jun 17, 2019 at 1:04 P=
M Kris Zyp &lt;<a href=3D"mailto:kris...@gmail.com";>kris...@gmail.com</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d=
ir=3D"ltr">Is this still being considered/reviewed? Let me know if there ar=
e any other changes you would like me to make. This patch has continued to =
yield significant and reliable performance improvements for us, and seems l=
ike it would be nice for this to be available for other Windows users.</div=
><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fr=
i, May 3, 2019 at 3:52 PM Kris Zyp &lt;<a href=3D"mailto:kris...@gmail.com"=
 target=3D"_blank">kris...@gmail.com</a>&gt; wrote:<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex"><div lang=3D"EN-CA"><div class=3D"gmail=
-m_-3761698270430747347gmail-m_5242437559147988140WordSection1"><p class=3D=
"MsoNormal">For the sake of putting this in the email thread (other code di=
scussion in GitHub), here is the latest squashed commit of the proposed pat=
ch (with the on-demand, retained overlapped array to reduce re-malloc and o=
pening event handles): <a href=3D"https://github.com/kriszyp/node-lmdb/comm=
it/726a9156662c703bf3d453aab75ee222072b990f" target=3D"_blank">https://gith=
ub.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f</a=
></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">=
Thanks,<br>Kris</p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div styl=
e=3D"border-right:none;border-bottom:none;border-left:none;border-top:1pt s=
olid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal" style=3D"=
border:none;padding:0cm"><b>From: </b><a href=3D"mailto:kris...@gmail.com"; =
target=3D"_blank">Kris Zyp</a><br><b>Sent: </b>April 30, 2019 12:43 PM<br><=
b>To: </b><a href=3D"mailto:h...@symas.com"; target=3D"_blank">Howard Chu</a>=
; <a href=3D"mailto:openldap-its@OpenLDAP.org"; target=3D"_blank">openldap-i=
t...@openldap.org</a><br><b>Subject: </b>RE: (ITS#9017) Improving performance=
 of commit sync in Windows</p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u>=
</u></p><p class=3D"MsoNormal">&gt; What is the point of using writemap mod=
e if you still need to use WriteFile<u></u><u></u></p><p class=3D"MsoNormal=
">&gt; on every individual page?<u></u><u></u></p><p class=3D"MsoNormal"><u=
></u>=C2=A0<u></u></p><p class=3D"MsoNormal">As I understood from the docum=
entation, and have observed, using writemap mode is faster (and uses less t=
emporary memory) because it doesn=E2=80=99t require mallocs to allocate pag=
es (docs: =E2=80=9CThis is faster and uses fewer mallocs=E2=80=9D). To be c=
lear though, LMDB is so incredibly fast and efficient, that in sync-mode, i=
t takes enormous transactions before the time spent allocating and creating=
 the dirty pages with the updated b-tree is anywhere even remotely close to=
 the time it takes to wait for disk flushing, even with an SSD. But the mor=
e pertinent question is efficiency, and measuring CPU cycles rather than ti=
me spent (efficiency is more important than just time spent). When I ran my=
 tests this morning of 100 (sync) transactions with 100 puts per transactio=
n, times varied quite a bit, but it seemed like running with writemap enabl=
ed typically averages about 500ms of CPU and with writemap disabled it typi=
cally averages around 600ms. Not a huge difference, but still definitely wo=
rthwhile, I think.<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u>=
</u></p><p class=3D"MsoNormal">Caveat emptor: Measuring LMDB performance wi=
th sync interactions on Windows is one of the most frustratingly erratic th=
ings to measure. It is sunny outside right now, times could be different wh=
en it starts raining later, but, this is what I saw this morning...<u></u><=
u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor=
mal">&gt; What is the performance difference between your patch using write=
map, and just<u></u><u></u></p><p class=3D"MsoNormal">&gt; not using writem=
ap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=
=A0<u></u></p><p class=3D"MsoNormal">Running 1000 sync transactions on 3GB =
db with a single put per transaction, without writemap map, without the pat=
ch took about 60 seconds. And it took about 1 second with the patch with wr=
itemap mode enabled! (there is no significant difference in sync times with=
 writemap enabled or disabled with the patch.) So the difference was huge i=
n my test. And not only that, without the patch, the CPU usage was actually=
 _<i>higher</i>_ during that 60 seconds (close to 100% of a core) than duri=
ng the execution with the patch for one second (close to 50%).=C2=A0 Anyway=
, there are certainly tests I have run where the differences are not as lar=
ge (doing small commits on large dbs accentuates the differences), but the =
patch always seems to win. It could also be that my particular configuratio=
n causes bigger differences (on an SSD drive, and maybe a more fragmented f=
ile?).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p c=
lass=3D"MsoNormal">Anyway, I added error handling for the malloc, and fixed=
/changed the other things you suggested. Be happy to make any other changes=
 you want. The updated patch is here:<u></u><u></u></p><p class=3D"MsoNorma=
l"><a href=3D"https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf=
6637f43ec17b9b62094acde" target=3D"_blank">https://github.com/kriszyp/node-=
lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde</a><u></u><u></u></p><=
p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">&gt;<s=
pan class=3D"gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-co=
de-inner"><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41=
,46)"> OVERLAPPED* ov =3D </span></span><span class=3D"gmail-m_-37616982704=
30747347gmail-m_5242437559147988140pl-c1"><span style=3D"font-size:9pt;font=
-family:Consolas;color:rgb(0,92,197)">malloc</span></span><span class=3D"gm=
ail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span=
 style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">((pagecou=
nt - keep) * </span></span><span class=3D"gmail-m_-3761698270430747347gmail=
-m_5242437559147988140pl-k"><span style=3D"font-size:9pt;font-family:Consol=
as;color:rgb(215,58,73)">sizeof</span></span><span class=3D"gmail-m_-376169=
8270430747347gmail-m_5242437559147988140blob-code-inner"><span style=3D"fon=
t-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERLAPPED));</span><=
/span><span class=3D"gmail-m_-3761698270430747347gmail-m_524243755914798814=
0blob-code-inner"><span style=3D"font-size:9pt;font-family:Consolas;color:r=
gb(36,41,46)"><u></u><u></u></span></span></p><p class=3D"MsoNormal"><span =
class=3D"gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-i=
nner"><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)=
">&gt; </span></span><span style=3D"font-size:10.5pt;font-family:&quot;Sego=
e UI&quot;,sans-serif;color:rgb(36,41,46);background:white">Probably this o=
ught to just be pre-allocated based on the maximum number of dirty pages a =
txn allows.</span><span style=3D"font-size:10.5pt;font-family:&quot;Segoe U=
I&quot;,sans-serif;background:white"><u></u><u></u></span></p><p class=3D"M=
soNormal"><span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&quot;,=
sans-serif;color:rgb(36,41,46);background:white"><u></u>=C2=A0<u></u></span=
></p><p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&qu=
ot;Segoe UI&quot;,sans-serif;color:rgb(36,41,46);background:white">I wasn=
=E2=80=99t sure I understood this comment. Are you suggesting we </span>mal=
loc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain i=
t for the life of the environment? I think that is 4MB, if my math is right=
, which seems like a lot of memory to keep allocated (we usually have a lot=
 of open environments). If the goal is to reduce the number of mallocs, how=
 about we retain the OVERLAPPED array, and only free and re-malloc if the p=
revious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t un=
necessary allocation, and we only malloc when there is a bigger transaction=
 than any previous. I put this together in a separate commit, as I wasn=E2=
=80=99t sure if this what you wanted (can squash if you prefer): <a href=3D=
"https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b=
2adefaac40" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/2=
fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p class=3D"Ms=
oNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thank you for the r=
eview! <span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&quot;,san=
s-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><p c=
lass=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thanks,<b=
r>Kris<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div=
 style=3D"border-right:none;border-bottom:none;border-left:none;border-top:=
1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal"><b>F=
rom: </b><a href=3D"mailto:h...@symas.com"; target=3D"_blank">Howard Chu</a><=
br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a href=3D"mailto:kris=
z...@gmail.com" target=3D"_blank">kris...@gmail.com</a>; <a href=3D"mailto:o=
penldap-...@openldap.org" target=3D"_blank">openldap-its@OpenLDAP.org</a><b=
r><b>Subject: </b>Re: (ITS#9017) Improving performance of commit sync in Wi=
ndows<u></u><u></u></p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p=
><p class=3D"MsoNormal"><a href=3D"mailto:kris...@gmail.com"; target=3D"_bla=
nk">kris...@gmail.com</a> wrote:<u></u><u></u></p><p class=3D"MsoNormal">&g=
t; Full_Name: Kristopher William Zyp<u></u><u></u></p><p class=3D"MsoNormal=
">&gt; Version: LMDB 0.9.23<u></u><u></u></p><p class=3D"MsoNormal">&gt; OS=
: Windows<u></u><u></u></p><p class=3D"MsoNormal">&gt; URL: <a href=3D"http=
s://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7f=
c4ce9" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/7ff525=
ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p class=3D"MsoNorm=
al">&gt; Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><u=
></u></p><p class=3D"MsoNormal">&gt; We have seen very poor performance on =
the sync of commits on large databases in<u></u><u></u></p><p class=3D"MsoN=
ormal">&gt; Windows. On databases with 2GB of data, in writemap mode, the s=
ync of even small<u></u><u></u></p><p class=3D"MsoNormal">&gt; commits is c=
onsistently well over 100ms (without writemap it is faster, but<u></u><u></=
u></p><p class=3D"MsoNormal">&gt; still slow). It is expected that a sync s=
hould take some time while waiting for<u></u><u></u></p><p class=3D"MsoNorm=
al">&gt; disk confirmation of the writes, but more concerning is that these=
 sync<u></u><u></u></p><p class=3D"MsoNormal">&gt; operations (in writemap =
mode) are instead dominated by nearly 100% system CPU<u></u><u></u></p><p c=
lass=3D"MsoNormal">&gt; utilization, so operations that requires sub-millis=
econd b-tree update<u></u><u></u></p><p class=3D"MsoNormal">&gt; operations=
 are then dominated by very large amounts of system CPU cycles during<u></u=
><u></u></p><p class=3D"MsoNormal">&gt; the sync phase.<u></u><u></u></p><p=
 class=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; I t=
hink that the fundamental problem is that FlushViewOfFile seems to be an O(=
n)<u></u><u></u></p><p class=3D"MsoNormal">&gt; operation where n is the si=
ze of the file (or map). I presume that Windows is<u></u><u></u></p><p clas=
s=3D"MsoNormal">&gt; scanning the entire map/file for dirty pages to flush,=
 I&#39;m guessing because it<u></u><u></u></p><p class=3D"MsoNormal">&gt; d=
oesn&#39;t have an internal index of all the dirty pages for every file/map=
-view in<u></u><u></u></p><p class=3D"MsoNormal">&gt; the OS disk cache. Th=
erefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p>=
<p class=3D"MsoNormal">&gt; operation to find the dirty pages for (large fi=
le) and initiate their writes,<u></u><u></u></p><p class=3D"MsoNormal">&gt;=
 which, of course, is contrary to the whole goal of a scalable database sys=
tem.<u></u><u></u></p><p class=3D"MsoNormal">&gt; And FlushFileBuffers is a=
lso relatively slow as well. We have attempted to batch<u></u><u></u></p><p=
 class=3D"MsoNormal">&gt; as many operations into single transaction as pos=
sible, but this is still a very<u></u><u></u></p><p class=3D"MsoNormal">&gt=
; large overhead.<u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><u></=
u></p><p class=3D"MsoNormal">&gt; The Windows docs for FlushFileBuffers its=
elf warns about the inefficiencies of<u></u><u></u></p><p class=3D"MsoNorma=
l">&gt; this function (<a href=3D"https://docs.microsoft.com/en-us/windows/=
desktop/api/fileapi/nf-fileapi-flushfilebuffers" target=3D"_blank">https://=
docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebu=
ffers</a>).<u></u><u></u></p><p class=3D"MsoNormal">&gt; Which also points =
to the solution: it is much faster to write out the dirty<u></u><u></u></p>=
<p class=3D"MsoNormal">&gt; pages with WriteFile through a sync file handle=
 (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p class=3D"MsoNormal">&gt; <u=
></u><u></u></p><p class=3D"MsoNormal">&gt; The associated patch<u></u><u><=
/u></p><p class=3D"MsoNormal">&gt; (<a href=3D"https://github.com/kriszyp/n=
ode-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" target=3D"_blank"=
>https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab93=
32b7fc4ce9</a>)<u></u><u></u></p><p class=3D"MsoNormal">&gt; is my attempt =
at implementing this solution, for Windows. Fortunately, with the<u></u><u>=
</u></p><p class=3D"MsoNormal">&gt; design of LMDB, this is relatively stra=
ightforward. LMDB already supports<u></u><u></u></p><p class=3D"MsoNormal">=
&gt; writing out dirty pages with WriteFile calls. I added a write-through =
handle for<u></u><u></u></p><p class=3D"MsoNormal">&gt; sending these write=
s directly to disk. I then made that file-handle<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; overlapped/asynchronously, so all the writes for a comm=
it could be started in<u></u><u></u></p><p class=3D"MsoNormal">&gt; overlap=
 mode, and (at least theoretically) transfer in parallel to the drive and<u=
></u><u></u></p><p class=3D"MsoNormal">&gt; then used GetOverlappedResult t=
o wait for the completion. So basically<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; mdb_page_flush becomes the sync. I extended the writing of dirty =
pages through<u></u><u></u></p><p class=3D"MsoNormal">&gt; WriteFile to wri=
temap mode as well (for writing meta too), so that WriteFile<u></u><u></u><=
/p><p class=3D"MsoNormal">&gt; with write-through can be used to flush the =
data without ever needing to call<u></u><u></u></p><p class=3D"MsoNormal">&=
gt; FlushViewOfFile or FlushFileBuffers. I also implemented support for wri=
te<u></u><u></u></p><p class=3D"MsoNormal">&gt; gathering in writemap mode =
where contiguous file positions infers contiguous<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; memory (by tracking the starting position with wdp and =
writing contiguous pages<u></u><u></u></p><p class=3D"MsoNormal">&gt; in si=
ngle operations). Sorting of the dirty list is maintained even in writemap<=
u></u><u></u></p><p class=3D"MsoNormal">&gt; mode for this purpose.<u></u><=
u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor=
mal">What is the point of using writemap mode if you still need to use Writ=
eFile<u></u><u></u></p><p class=3D"MsoNormal">on every individual page?<u><=
/u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"Ms=
oNormal">&gt; The performance benefits of this patch, in my testing, are co=
nsiderable. Writing<u></u><u></u></p><p class=3D"MsoNormal">&gt; out/syncin=
g transactions is typically over 5x faster in writemap mode, and 2x<u></u><=
u></u></p><p class=3D"MsoNormal">&gt; faster in standard mode. And perhaps =
more importantly (especially in environment<u></u><u></u></p><p class=3D"Ms=
oNormal">&gt; with many threads/processes), the efficiency benefits are eve=
n larger,<u></u><u></u></p><p class=3D"MsoNormal">&gt; particularly in writ=
emap mode, where there can be a 50-100x reduction in the<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; system CPU usage by using this patch. This bring=
s windows performance with<u></u><u></u></p><p class=3D"MsoNormal">&gt; syn=
c&#39;ed transactions in LMDB back into the range of &quot;lightning&quot; =
performance :).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u=
></p><p class=3D"MsoNormal">What is the performance difference between your=
 patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNormal">not=
 using writemap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"=
><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">-- <u></u><u></u></p><p cla=
ss=3D"MsoNormal">=C2=A0=C2=A0-- Howard Chu<u></u><u></u></p><p class=3D"Mso=
Normal">=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 <a href=3D"http://www.symas.com"; target=3D"_blank">http:=
//www.symas.com</a><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0 Director=
, Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 <a href=3D"http://highlandsun.com/hy=
c/" target=3D"_blank">http://highlandsun.com/hyc/</a><u></u><u></u></p><p c=
lass=3D"MsoNormal">=C2=A0 Chief Architect, OpenLDAP=C2=A0 <a href=3D"http:/=
/www.openldap.org/project/" target=3D"_blank">http://www.openldap.org/proje=
ct/</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p =
class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div></div></blockquote></div>
</blockquote></div>

--000000000000d8a73d0592d86418--

Re: (ITS#9017) Improving performance of commit sync in Windows

Reply via email to