--_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"
> What is the point of using writemap mode if you still need to use WriteFi= le > on every individual page? As I understood from the documentation, and have observed, using writemap m= ode is faster (and uses less temporary memory) because it doesn=E2=80=99t r= equire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fe= wer mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e= fficient, that in sync-mode, it takes enormous transactions before the time= spent allocating and creating the dirty pages with the updated b-tree is a= nywhere even remotely close to the time it takes to wait for disk flushing,= even with an SSD. But the more pertinent question is efficiency, and measu= ring CPU cycles rather than time spent (efficiency is more important than j= ust time spent). When I ran my tests this morning of 100 (sync) transaction= s with 100 puts per transaction, times varied quite a bit, but it seemed li= ke running with writemap enabled typically averages about 500ms of CPU and = with writemap disabled it typically averages around 600ms. Not a huge diffe= rence, but still definitely worthwhile, I think. Caveat emptor: Measuring LMDB performance with sync interactions on Windows= is one of the most frustratingly erratic things to measure. It is sunny ou= tside right now, times could be different when it starts raining later, but= , this is what I saw this morning... > What is the performance difference between your patch using writemap, and= just > not using writemap in the first place? Running 1000 sync transactions on 3GB db with a single put per transaction,= without writemap map, without the patch took about 60 seconds. And it took= about 1 second with the patch with writemap mode enabled! (there is no sig= nificant difference in sync times with writemap enabled or disabled with th= e patch.) So the difference was huge in my test. And not only that, without= the patch, the CPU usage was actually _higher_ during that 60 seconds (clo= se to 100% of a core) than during the execution with the patch for one seco= nd (close to 50%). Anyway, there are certainly tests I have run where the = differences are not as large (doing small commits on large dbs accentuates = the differences), but the patch always seems to win. It could also be that = my particular configuration causes bigger differences (on an SSD drive, and= maybe a more fragmented file?). Anyway, I added error handling for the malloc, and fixed/changed the other = things you suggested. Be happy to make any other changes you want. The upda= ted patch is here: https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b= 62094acde > OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED)); > Probably this ought to just be pre-allocated based on the maximum number = of dirty pages a txn allows. I wasn=E2=80=99t sure I understood this comment. Are you suggesting we mall= oc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain it= for the life of the environment? I think that is 4MB, if my math is right,= which seems like a lot of memory to keep allocated (we usually have a lot = of open environments). If the goal is to reduce the number of mallocs, how = about we retain the OVERLAPPED array, and only free and re-malloc if the pr= evious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unn= ecessary allocation, and we only malloc when there is a bigger transaction = than any previous. I put this together in a separate commit, as I wasn=E2= =80=99t sure if this what you wanted (can squash if you prefer): https://gi= thub.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40 Thank you for the review!=20 Thanks, Kris From: Howard Chu Sent: April 30, 2019 7:12 AM To: kris...@gmail.com; openldap-its@OpenLDAP.org Subject: Re: (ITS#9017) Improving performance of commit sync in Windows kris...@gmail.com wrote: > Full_Name: Kristopher William Zyp > Version: LMDB 0.9.23 > OS: Windows > URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74= a0ab9332b7fc4ce9 > Submission from: (NULL) (71.199.6.148) >=20 >=20 > We have seen very poor performance on the sync of commits on large databa= ses in > Windows. On databases with 2GB of data, in writemap mode, the sync of eve= n small > commits is consistently well over 100ms (without writemap it is faster, b= ut > still slow). It is expected that a sync should take some time while waiti= ng for > disk confirmation of the writes, but more concerning is that these sync > operations (in writemap mode) are instead dominated by nearly 100% system= CPU > utilization, so operations that requires sub-millisecond b-tree update > operations are then dominated by very large amounts of system CPU cycles = during > the sync phase. >=20 > I think that the fundamental problem is that FlushViewOfFile seems to be = an O(n) > operation where n is the size of the file (or map). I presume that Window= s is > scanning the entire map/file for dirty pages to flush, I'm guessing becau= se it > doesn't have an internal index of all the dirty pages for every file/map-= view in > the OS disk cache. Therefore, the turns into an extremely expensive, CPU-= bound > operation to find the dirty pages for (large file) and initiate their wri= tes, > which, of course, is contrary to the whole goal of a scalable database sy= stem. > And FlushFileBuffers is also relatively slow as well. We have attempted t= o batch > as many operations into single transaction as possible, but this is still= a very > large overhead. >=20 > The Windows docs for FlushFileBuffers itself warns about the inefficienci= es of > this function (https://docs.microsoft.com/en-us/windows/desktop/api/filea= pi/nf-fileapi-flushfilebuffers). > Which also points to the solution: it is much faster to write out the dir= ty > pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH)= . >=20 > The associated patch > (https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab= 9332b7fc4ce9) > is my attempt at implementing this solution, for Windows. Fortunately, wi= th the > design of LMDB, this is relatively straightforward. LMDB already supports > writing out dirty pages with WriteFile calls. I added a write-through han= dle for > sending these writes directly to disk. I then made that file-handle > overlapped/asynchronously, so all the writes for a commit could be starte= d in > overlap mode, and (at least theoretically) transfer in parallel to the dr= ive and > then used GetOverlappedResult to wait for the completion. So basically > mdb_page_flush becomes the sync. I extended the writing of dirty pages th= rough > WriteFile to writemap mode as well (for writing meta too), so that WriteF= ile > with write-through can be used to flush the data without ever needing to = call > FlushViewOfFile or FlushFileBuffers. I also implemented support for write > gathering in writemap mode where contiguous file positions infers contigu= ous > memory (by tracking the starting position with wdp and writing contiguous= pages > in single operations). Sorting of the dirty list is maintained even in wr= itemap > mode for this purpose. What is the point of using writemap mode if you still need to use WriteFile on every individual page? > The performance benefits of this patch, in my testing, are considerable. = Writing > out/syncing transactions is typically over 5x faster in writemap mode, an= d 2x > faster in standard mode. And perhaps more importantly (especially in envi= ronment > with many threads/processes), the efficiency benefits are even larger, > particularly in writemap mode, where there can be a 50-100x reduction in = the > system CPU usage by using this patch. This brings windows performance wit= h > sync'ed transactions in LMDB back into the range of "lightning" performan= ce :). What is the performance difference between your patch using writemap, and j= ust not using writemap in the first place? --=20 -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ --_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" <html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc= hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of= fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta ht= tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name= =3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;} @font-face {font-family:"Segoe UI"; panose-1:2 11 5 2 4 2 4 2 2 3;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:#954F72; text-decoration:underline;} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {mso-style-priority:34; margin-top:0cm; margin-right:0cm; margin-bottom:0cm; margin-left:36.0pt; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif;} span.blob-code-inner {mso-style-name:blob-code-inner;} span.pl-c1 {mso-style-name:pl-c1;} span.pl-k {mso-style-name:pl-k;} .MsoChpDefault {mso-style-type:export-only;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style></head><body lang=3DEN-CA link=3Dblue vlink=3D"#954F72"><div cla= ss=3DWordSection1><p class=3DMsoNormal>> What is the point of using writ= emap mode if you still need to use WriteFile<o:p></o:p></p><p class=3DMsoNo= rmal>> on every individual page?</p><p class=3DMsoNormal><o:p> </o:= p></p><p class=3DMsoNormal>As I understood from the documentation, and have= observed, using writemap mode is faster (and uses less temporary memory) b= ecause it doesn=E2=80=99t require mallocs to allocate pages (docs: =E2=80= =9CThis is faster and uses fewer mallocs=E2=80=9D). To be clear though, LMD= B is so incredibly fast and efficient, that in sync-mode, it takes enormous= transactions before the time spent allocating and creating the dirty pages= with the updated b-tree is anywhere even remotely close to the time it tak= es to wait for disk flushing, even with an SSD. But the more pertinent ques= tion is efficiency, and measuring CPU cycles rather than time spent (effici= ency is more important than just time spent). When I ran my tests this morn= ing of 100 (sync) transactions with 100 puts per transaction, times varied = quite a bit, but it seemed like running with writemap enabled typically ave= rages about 500ms of CPU and with writemap disabled it typically averages a= round 600ms. Not a huge difference, but still definitely worthwhile, I thin= k.</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>Caveat= emptor: Measuring LMDB performance with sync interactions on Windows is on= e of the most frustratingly erratic things to measure. It is sunny outside = right now, times could be different when it starts raining later, but, this= is what I saw this morning...<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbs= p;</o:p></p><p class=3DMsoNormal>> What is the performance difference be= tween your patch using writemap, and just<o:p></o:p></p><p class=3DMsoNorma= l>> not using writemap in the first place?<o:p></o:p></p><p class=3DMsoN= ormal><o:p> </o:p></p><p class=3DMsoNormal>Running 1000 sync transacti= ons on 3GB db with a single put per transaction, without writemap map, with= out the patch took about 60 seconds. And it took about 1 second with the pa= tch with writemap mode enabled! (there is no significant difference in sync= times with writemap enabled or disabled with the patch.) So the difference= was huge in my test. And not only that, without the patch, the CPU usage w= as actually _<i>higher</i>_ during that 60 seconds (close to 100% of a core= ) than during the execution with the patch for one second (close to 50%). = =C2=A0Anyway, there are certainly tests I have run where the differences ar= e not as large (doing small commits on large dbs accentuates the difference= s), but the patch always seems to win. It could also be that my particular = configuration causes bigger differences (on an SSD drive, and maybe a more = fragmented file?).</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3D= MsoNormal>Anyway, I added error handling for the malloc, and fixed/changed = the other things you suggested. Be happy to make any other changes you want= . The updated patch is here:<o:p></o:p></p><p class=3DMsoNormal>https://git= hub.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde<o= :p></o:p></p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal= >><span class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-fami= ly:Consolas;color:#24292E'> OVERLAPPED* ov =3D </span></span><span class=3D= pl-c1><span style=3D'font-size:9.0pt;font-family:Consolas;color:#005CC5'>ma= lloc</span></span><span class=3Dblob-code-inner><span style=3D'font-size:9.= 0pt;font-family:Consolas;color:#24292E'>((pagecount - keep) * </span></span= ><span class=3Dpl-k><span style=3D'font-size:9.0pt;font-family:Consolas;col= or:#D73A49'>sizeof</span></span><span class=3Dblob-code-inner><span style= =3D'font-size:9.0pt;font-family:Consolas;color:#24292E'>(OVERLAPPED));<o:p>= </o:p></span></span></p><p class=3DMsoNormal><span class=3Dblob-code-inner>= <span style=3D'font-size:9.0pt;font-family:Consolas;color:#24292E'>> </s= pan></span><span style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-seri= f;color:#24292E;background:white'>Probably this ought to just be pre-alloca= ted based on the maximum number of dirty pages a txn allows.<o:p></o:p></sp= an></p><p class=3DMsoNormal><span style=3D'font-size:10.5pt;font-family:"Se= goe UI",sans-serif;color:#24292E;background:white'><o:p> </o:p></span>= </p><p class=3DMsoNormal><span style=3D'font-size:10.5pt;font-family:"Segoe= UI",sans-serif;color:#24292E;background:white'>I wasn=E2=80=99t sure I und= erstood this comment. Are you suggesting we </span>malloc(MDB_IDL_UM_MAX * = sizeof(OVERLAPPED)) for each environment, and retain it for the life of the= environment? I think that is 4MB, if my math is right, which seems like a = lot of memory to keep allocated (we usually have a lot of open environments= ). If the goal is to reduce the number of mallocs, how about we retain the = OVERLAPPED array, and only free and re-malloc if the previous allocation wa= sn=E2=80=99t large enough? Then there isn=E2=80=99t unnecessary allocation,= and we only malloc when there is a bigger transaction than any previous. I= put this together in a separate commit, as I wasn=E2=80=99t sure if this w= hat you wanted (can squash if you prefer): https://github.com/kriszyp/node-= lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40</p><p class=3DMsoNorma= l><o:p> </o:p></p><p class=3DMsoNormal>Thank you for the review! <span= style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;= background:white'><o:p></o:p></span></p><p class=3DMsoNormal><o:p> </o= :p></p><p class=3DMsoNormal>Thanks,<br>Kris</p><p class=3DMsoNormal><o:p>&n= bsp;</o:p></p><div style=3D'mso-element:para-border-div;border:none;border-= top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal sty= le=3D'border:none;padding:0cm'><b>From: </b><a href=3D"mailto:h...@symas.com= ">Howard Chu</a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a hr= ef=3D"mailto:kris...@gmail.com">kris...@gmail.com</a>; <a href=3D"mailto:op= enldap-...@openldap.org">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>R= e: (ITS#9017) Improving performance of commit sync in Windows</p></div><p c= lass=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>kris...@gmail.co= m wrote:</p><p class=3DMsoNormal>> Full_Name: Kristopher William Zyp</p>= <p class=3DMsoNormal>> Version: LMDB 0.9.23</p><p class=3DMsoNormal>>= OS: Windows</p><p class=3DMsoNormal>> URL: https://github.com/kriszyp/n= ode-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9</p><p class=3DMsoN= ormal>> Submission from: (NULL) (71.199.6.148)</p><p class=3DMsoNormal>&= gt; </p><p class=3DMsoNormal>> </p><p class=3DMsoNormal>> We have see= n very poor performance on the sync of commits on large databases in</p><p = class=3DMsoNormal>> Windows. On databases with 2GB of data, in writemap = mode, the sync of even small</p><p class=3DMsoNormal>> commits is consis= tently well over 100ms (without writemap it is faster, but</p><p class=3DMs= oNormal>> still slow). It is expected that a sync should take some time = while waiting for</p><p class=3DMsoNormal>> disk confirmation of the wri= tes, but more concerning is that these sync</p><p class=3DMsoNormal>> op= erations (in writemap mode) are instead dominated by nearly 100% system CPU= </p><p class=3DMsoNormal>> utilization, so operations that requires sub-= millisecond b-tree update</p><p class=3DMsoNormal>> operations are then = dominated by very large amounts of system CPU cycles during</p><p class=3DM= soNormal>> the sync phase.</p><p class=3DMsoNormal>> </p><p class=3DM= soNormal>> I think that the fundamental problem is that FlushViewOfFile = seems to be an O(n)</p><p class=3DMsoNormal>> operation where n is the s= ize of the file (or map). I presume that Windows is</p><p class=3DMsoNormal= >> scanning the entire map/file for dirty pages to flush, I'm guessing b= ecause it</p><p class=3DMsoNormal>> doesn't have an internal index of al= l the dirty pages for every file/map-view in</p><p class=3DMsoNormal>> t= he OS disk cache. Therefore, the turns into an extremely expensive, CPU-bou= nd</p><p class=3DMsoNormal>> operation to find the dirty pages for (larg= e file) and initiate their writes,</p><p class=3DMsoNormal>> which, of c= ourse, is contrary to the whole goal of a scalable database system.</p><p c= lass=3DMsoNormal>> And FlushFileBuffers is also relatively slow as well.= We have attempted to batch</p><p class=3DMsoNormal>> as many operations= into single transaction as possible, but this is still a very</p><p class= =3DMsoNormal>> large overhead.</p><p class=3DMsoNormal>> </p><p class= =3DMsoNormal>> The Windows docs for FlushFileBuffers itself warns about = the inefficiencies of</p><p class=3DMsoNormal>> this function (https://d= ocs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuf= fers).</p><p class=3DMsoNormal>> Which also points to the solution: it i= s much faster to write out the dirty</p><p class=3DMsoNormal>> pages wit= h WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH).</p><p cla= ss=3DMsoNormal>> </p><p class=3DMsoNormal>> The associated patch</p><= p class=3DMsoNormal>> (https://github.com/kriszyp/node-lmdb/commit/7ff52= 5ae57684a163d32af74a0ab9332b7fc4ce9)</p><p class=3DMsoNormal>> is my att= empt at implementing this solution, for Windows. Fortunately, with the</p><= p class=3DMsoNormal>> design of LMDB, this is relatively straightforward= . LMDB already supports</p><p class=3DMsoNormal>> writing out dirty page= s with WriteFile calls. I added a write-through handle for</p><p class=3DMs= oNormal>> sending these writes directly to disk. I then made that file-h= andle</p><p class=3DMsoNormal>> overlapped/asynchronously, so all the wr= ites for a commit could be started in</p><p class=3DMsoNormal>> overlap = mode, and (at least theoretically) transfer in parallel to the drive and</p= ><p class=3DMsoNormal>> then used GetOverlappedResult to wait for the co= mpletion. So basically</p><p class=3DMsoNormal>> mdb_page_flush becomes = the sync. I extended the writing of dirty pages through</p><p class=3DMsoNo= rmal>> WriteFile to writemap mode as well (for writing meta too), so tha= t WriteFile</p><p class=3DMsoNormal>> with write-through can be used to = flush the data without ever needing to call</p><p class=3DMsoNormal>> Fl= ushViewOfFile or FlushFileBuffers. I also implemented support for write</p>= <p class=3DMsoNormal>> gathering in writemap mode where contiguous file = positions infers contiguous</p><p class=3DMsoNormal>> memory (by trackin= g the starting position with wdp and writing contiguous pages</p><p class= =3DMsoNormal>> in single operations). Sorting of the dirty list is maint= ained even in writemap</p><p class=3DMsoNormal>> mode for this purpose.<= /p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>What is t= he point of using writemap mode if you still need to use WriteFile</p><p cl= ass=3DMsoNormal>on every individual page?</p><p class=3DMsoNormal><o:p>&nbs= p;</o:p></p><p class=3DMsoNormal>> The performance benefits of this patc= h, in my testing, are considerable. Writing</p><p class=3DMsoNormal>> ou= t/syncing transactions is typically over 5x faster in writemap mode, and 2x= </p><p class=3DMsoNormal>> faster in standard mode. And perhaps more imp= ortantly (especially in environment</p><p class=3DMsoNormal>> with many = threads/processes), the efficiency benefits are even larger,</p><p class=3D= MsoNormal>> particularly in writemap mode, where there can be a 50-100x = reduction in the</p><p class=3DMsoNormal>> system CPU usage by using thi= s patch. This brings windows performance with</p><p class=3DMsoNormal>> = sync'ed transactions in LMDB back into the range of "lightning" p= erformance :).</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoN= ormal>What is the performance difference between your patch using writemap,= and just</p><p class=3DMsoNormal>not using writemap in the first place?</p= ><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>-- </p><p c= lass=3DMsoNormal>=C2=A0=C2=A0-- Howard Chu</p><p class=3DMsoNormal>=C2=A0 C= TO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= http://www.symas.com</p><p class=3DMsoNormal>=C2=A0 Director, Highland Sun= =C2=A0=C2=A0=C2=A0=C2=A0 http://highlandsun.com/hyc/</p><p class=3DMsoNorma= l>=C2=A0 Chief Architect, OpenLDAP=C2=A0 http://www.openldap.org/project/</= p><p class=3DMsoNormal><o:p> </o:p></p></div></body></html>= --_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_--