--000000000000d8a73d0592d86418 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Checking on this again, is this still a possibility for merging into LMDB? This fix is still working great (improved performance) on our systems. Thanks, Kris On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp <kris...@gmail.com> wrote: > Is this still being considered/reviewed? Let me know if there are any > other changes you would like me to make. This patch has continued to yiel= d > significant and reliable performance improvements for us, and seems like = it > would be nice for this to be available for other Windows users. > > On Fri, May 3, 2019 at 3:52 PM Kris Zyp <kris...@gmail.com> wrote: > >> For the sake of putting this in the email thread (other code discussion >> in GitHub), here is the latest squashed commit of the proposed patch (wi= th >> the on-demand, retained overlapped array to reduce re-malloc and opening >> event handles): >> https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75e= e222072b990f >> >> >> >> Thanks, >> Kris >> >> >> >> *From: *Kris Zyp <kris...@gmail.com> >> *Sent: *April 30, 2019 12:43 PM >> *To: *Howard Chu <h...@symas.com>; openldap-its@OpenLDAP.org >> *Subject: *RE: (ITS#9017) Improving performance of commit sync in Window= s >> >> >> >> > What is the point of using writemap mode if you still need to use >> WriteFile >> >> > on every individual page? >> >> >> >> As I understood from the documentation, and have observed, using writema= p >> mode is faster (and uses less temporary memory) because it doesn=E2=80= =99t require >> mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fewer = mallocs=E2=80=9D). >> To be clear though, LMDB is so incredibly fast and efficient, that in >> sync-mode, it takes enormous transactions before the time spent allocati= ng >> and creating the dirty pages with the updated b-tree is anywhere even >> remotely close to the time it takes to wait for disk flushing, even with= an >> SSD. But the more pertinent question is efficiency, and measuring CPU >> cycles rather than time spent (efficiency is more important than just ti= me >> spent). When I ran my tests this morning of 100 (sync) transactions with >> 100 puts per transaction, times varied quite a bit, but it seemed like >> running with writemap enabled typically averages about 500ms of CPU and >> with writemap disabled it typically averages around 600ms. Not a huge >> difference, but still definitely worthwhile, I think. >> >> >> >> Caveat emptor: Measuring LMDB performance with sync interactions on >> Windows is one of the most frustratingly erratic things to measure. It i= s >> sunny outside right now, times could be different when it starts raining >> later, but, this is what I saw this morning... >> >> >> >> > What is the performance difference between your patch using writemap, >> and just >> >> > not using writemap in the first place? >> >> >> >> Running 1000 sync transactions on 3GB db with a single put per >> transaction, without writemap map, without the patch took about 60 secon= ds. >> And it took about 1 second with the patch with writemap mode enabled! >> (there is no significant difference in sync times with writemap enabled = or >> disabled with the patch.) So the difference was huge in my test. And not >> only that, without the patch, the CPU usage was actually _*higher*_ >> during that 60 seconds (close to 100% of a core) than during the executi= on >> with the patch for one second (close to 50%). Anyway, there are certain= ly >> tests I have run where the differences are not as large (doing small >> commits on large dbs accentuates the differences), but the patch always >> seems to win. It could also be that my particular configuration causes >> bigger differences (on an SSD drive, and maybe a more fragmented file?). >> >> >> >> Anyway, I added error handling for the malloc, and fixed/changed the >> other things you suggested. Be happy to make any other changes you want. >> The updated patch is here: >> >> >> https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17= b9b62094acde >> >> >> >> > OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED)); >> >> > Probably this ought to just be pre-allocated based on the maximum >> number of dirty pages a txn allows. >> >> >> >> I wasn=E2=80=99t sure I understood this comment. Are you suggesting we m= alloc(MDB_IDL_UM_MAX >> * sizeof(OVERLAPPED)) for each environment, and retain it for the life o= f >> the environment? I think that is 4MB, if my math is right, which seems l= ike >> a lot of memory to keep allocated (we usually have a lot of open >> environments). If the goal is to reduce the number of mallocs, how about= we >> retain the OVERLAPPED array, and only free and re-malloc if the previous >> allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unneces= sary allocation, >> and we only malloc when there is a bigger transaction than any previous.= I >> put this together in a separate commit, as I wasn=E2=80=99t sure if this= what you >> wanted (can squash if you prefer): >> https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a= 4b2adefaac40 >> >> >> >> Thank you for the review! >> >> >> >> Thanks, >> Kris >> >> >> >> *From: *Howard Chu <h...@symas.com> >> *Sent: *April 30, 2019 7:12 AM >> *To: *kris...@gmail.com; openldap-its@OpenLDAP.org >> *Subject: *Re: (ITS#9017) Improving performance of commit sync in Window= s >> >> >> >> kris...@gmail.com wrote: >> >> > Full_Name: Kristopher William Zyp >> >> > Version: LMDB 0.9.23 >> >> > OS: Windows >> >> > URL: >> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab= 9332b7fc4ce9 >> >> > Submission from: (NULL) (71.199.6.148) >> >> > >> >> > >> >> > We have seen very poor performance on the sync of commits on large >> databases in >> >> > Windows. On databases with 2GB of data, in writemap mode, the sync of >> even small >> >> > commits is consistently well over 100ms (without writemap it is faster= , >> but >> >> > still slow). It is expected that a sync should take some time while >> waiting for >> >> > disk confirmation of the writes, but more concerning is that these syn= c >> >> > operations (in writemap mode) are instead dominated by nearly 100% >> system CPU >> >> > utilization, so operations that requires sub-millisecond b-tree update >> >> > operations are then dominated by very large amounts of system CPU >> cycles during >> >> > the sync phase. >> >> > >> >> > I think that the fundamental problem is that FlushViewOfFile seems to >> be an O(n) >> >> > operation where n is the size of the file (or map). I presume that >> Windows is >> >> > scanning the entire map/file for dirty pages to flush, I'm guessing >> because it >> >> > doesn't have an internal index of all the dirty pages for every >> file/map-view in >> >> > the OS disk cache. Therefore, the turns into an extremely expensive, >> CPU-bound >> >> > operation to find the dirty pages for (large file) and initiate their >> writes, >> >> > which, of course, is contrary to the whole goal of a scalable database >> system. >> >> > And FlushFileBuffers is also relatively slow as well. We have attempte= d >> to batch >> >> > as many operations into single transaction as possible, but this is >> still a very >> >> > large overhead. >> >> > >> >> > The Windows docs for FlushFileBuffers itself warns about the >> inefficiencies of >> >> > this function ( >> https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-= flushfilebuffers >> ). >> >> > Which also points to the solution: it is much faster to write out the >> dirty >> >> > pages with WriteFile through a sync file handle >> (FILE_FLAG_WRITE_THROUGH). >> >> > >> >> > The associated patch >> >> > ( >> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab= 9332b7fc4ce9 >> ) >> >> > is my attempt at implementing this solution, for Windows. Fortunately, >> with the >> >> > design of LMDB, this is relatively straightforward. LMDB already >> supports >> >> > writing out dirty pages with WriteFile calls. I added a write-through >> handle for >> >> > sending these writes directly to disk. I then made that file-handle >> >> > overlapped/asynchronously, so all the writes for a commit could be >> started in >> >> > overlap mode, and (at least theoretically) transfer in parallel to the >> drive and >> >> > then used GetOverlappedResult to wait for the completion. So basically >> >> > mdb_page_flush becomes the sync. I extended the writing of dirty pages >> through >> >> > WriteFile to writemap mode as well (for writing meta too), so that >> WriteFile >> >> > with write-through can be used to flush the data without ever needing >> to call >> >> > FlushViewOfFile or FlushFileBuffers. I also implemented support for >> write >> >> > gathering in writemap mode where contiguous file positions infers >> contiguous >> >> > memory (by tracking the starting position with wdp and writing >> contiguous pages >> >> > in single operations). Sorting of the dirty list is maintained even in >> writemap >> >> > mode for this purpose. >> >> >> >> What is the point of using writemap mode if you still need to use >> WriteFile >> >> on every individual page? >> >> >> >> > The performance benefits of this patch, in my testing, are >> considerable. Writing >> >> > out/syncing transactions is typically over 5x faster in writemap mode, >> and 2x >> >> > faster in standard mode. And perhaps more importantly (especially in >> environment >> >> > with many threads/processes), the efficiency benefits are even larger, >> >> > particularly in writemap mode, where there can be a 50-100x reduction >> in the >> >> > system CPU usage by using this patch. This brings windows performance >> with >> >> > sync'ed transactions in LMDB back into the range of "lightning" >> performance :). >> >> >> >> What is the performance difference between your patch using writemap, an= d >> just >> >> not using writemap in the first place? >> >> >> >> -- >> >> -- Howard Chu >> >> CTO, Symas Corp. http://www.symas.com >> >> Director, Highland Sun http://highlandsun.com/hyc/ >> >> Chief Architect, OpenLDAP http://www.openldap.org/project/ >> >> >> >> >> > --000000000000d8a73d0592d86418 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Checking on this again, is this still a possibility for me= rging into LMDB? This fix is still working great (improved performance) on = our systems.<div>Thanks,</div><div>Kris</div></div><br><div class=3D"gmail_= quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Jun 17, 2019 at 1:04 P= M Kris Zyp <<a href=3D"mailto:kris...@gmail.com">kris...@gmail.com</a>&g= t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p= x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d= ir=3D"ltr">Is this still being considered/reviewed? Let me know if there ar= e any other changes you would like me to make. This patch has continued to = yield significant and reliable performance improvements for us, and seems l= ike it would be nice for this to be available for other Windows users.</div= ><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fr= i, May 3, 2019 at 3:52 PM Kris Zyp <<a href=3D"mailto:kris...@gmail.com"= target=3D"_blank">kris...@gmail.com</a>> wrote:<br></div><blockquote cl= ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid= rgb(204,204,204);padding-left:1ex"><div lang=3D"EN-CA"><div class=3D"gmail= -m_-3761698270430747347gmail-m_5242437559147988140WordSection1"><p class=3D= "MsoNormal">For the sake of putting this in the email thread (other code di= scussion in GitHub), here is the latest squashed commit of the proposed pat= ch (with the on-demand, retained overlapped array to reduce re-malloc and o= pening event handles): <a href=3D"https://github.com/kriszyp/node-lmdb/comm= it/726a9156662c703bf3d453aab75ee222072b990f" target=3D"_blank">https://gith= ub.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f</a= ></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">= Thanks,<br>Kris</p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div styl= e=3D"border-right:none;border-bottom:none;border-left:none;border-top:1pt s= olid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal" style=3D"= border:none;padding:0cm"><b>From: </b><a href=3D"mailto:kris...@gmail.com" = target=3D"_blank">Kris Zyp</a><br><b>Sent: </b>April 30, 2019 12:43 PM<br><= b>To: </b><a href=3D"mailto:h...@symas.com" target=3D"_blank">Howard Chu</a>= ; <a href=3D"mailto:openldap-its@OpenLDAP.org" target=3D"_blank">openldap-i= t...@openldap.org</a><br><b>Subject: </b>RE: (ITS#9017) Improving performance= of commit sync in Windows</p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u>= </u></p><p class=3D"MsoNormal">> What is the point of using writemap mod= e if you still need to use WriteFile<u></u><u></u></p><p class=3D"MsoNormal= ">> on every individual page?<u></u><u></u></p><p class=3D"MsoNormal"><u= ></u>=C2=A0<u></u></p><p class=3D"MsoNormal">As I understood from the docum= entation, and have observed, using writemap mode is faster (and uses less t= emporary memory) because it doesn=E2=80=99t require mallocs to allocate pag= es (docs: =E2=80=9CThis is faster and uses fewer mallocs=E2=80=9D). To be c= lear though, LMDB is so incredibly fast and efficient, that in sync-mode, i= t takes enormous transactions before the time spent allocating and creating= the dirty pages with the updated b-tree is anywhere even remotely close to= the time it takes to wait for disk flushing, even with an SSD. But the mor= e pertinent question is efficiency, and measuring CPU cycles rather than ti= me spent (efficiency is more important than just time spent). When I ran my= tests this morning of 100 (sync) transactions with 100 puts per transactio= n, times varied quite a bit, but it seemed like running with writemap enabl= ed typically averages about 500ms of CPU and with writemap disabled it typi= cally averages around 600ms. Not a huge difference, but still definitely wo= rthwhile, I think.<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u>= </u></p><p class=3D"MsoNormal">Caveat emptor: Measuring LMDB performance wi= th sync interactions on Windows is one of the most frustratingly erratic th= ings to measure. It is sunny outside right now, times could be different wh= en it starts raining later, but, this is what I saw this morning...<u></u><= u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor= mal">> What is the performance difference between your patch using write= map, and just<u></u><u></u></p><p class=3D"MsoNormal">> not using writem= ap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2= =A0<u></u></p><p class=3D"MsoNormal">Running 1000 sync transactions on 3GB = db with a single put per transaction, without writemap map, without the pat= ch took about 60 seconds. And it took about 1 second with the patch with wr= itemap mode enabled! (there is no significant difference in sync times with= writemap enabled or disabled with the patch.) So the difference was huge i= n my test. And not only that, without the patch, the CPU usage was actually= _<i>higher</i>_ during that 60 seconds (close to 100% of a core) than duri= ng the execution with the patch for one second (close to 50%).=C2=A0 Anyway= , there are certainly tests I have run where the differences are not as lar= ge (doing small commits on large dbs accentuates the differences), but the = patch always seems to win. It could also be that my particular configuratio= n causes bigger differences (on an SSD drive, and maybe a more fragmented f= ile?).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p c= lass=3D"MsoNormal">Anyway, I added error handling for the malloc, and fixed= /changed the other things you suggested. Be happy to make any other changes= you want. The updated patch is here:<u></u><u></u></p><p class=3D"MsoNorma= l"><a href=3D"https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf= 6637f43ec17b9b62094acde" target=3D"_blank">https://github.com/kriszyp/node-= lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde</a><u></u><u></u></p><= p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">><s= pan class=3D"gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-co= de-inner"><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41= ,46)"> OVERLAPPED* ov =3D </span></span><span class=3D"gmail-m_-37616982704= 30747347gmail-m_5242437559147988140pl-c1"><span style=3D"font-size:9pt;font= -family:Consolas;color:rgb(0,92,197)">malloc</span></span><span class=3D"gm= ail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span= style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">((pagecou= nt - keep) * </span></span><span class=3D"gmail-m_-3761698270430747347gmail= -m_5242437559147988140pl-k"><span style=3D"font-size:9pt;font-family:Consol= as;color:rgb(215,58,73)">sizeof</span></span><span class=3D"gmail-m_-376169= 8270430747347gmail-m_5242437559147988140blob-code-inner"><span style=3D"fon= t-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERLAPPED));</span><= /span><span class=3D"gmail-m_-3761698270430747347gmail-m_524243755914798814= 0blob-code-inner"><span style=3D"font-size:9pt;font-family:Consolas;color:r= gb(36,41,46)"><u></u><u></u></span></span></p><p class=3D"MsoNormal"><span = class=3D"gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-i= nner"><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)= ">> </span></span><span style=3D"font-size:10.5pt;font-family:"Sego= e UI",sans-serif;color:rgb(36,41,46);background:white">Probably this o= ught to just be pre-allocated based on the maximum number of dirty pages a = txn allows.</span><span style=3D"font-size:10.5pt;font-family:"Segoe U= I",sans-serif;background:white"><u></u><u></u></span></p><p class=3D"M= soNormal"><span style=3D"font-size:10.5pt;font-family:"Segoe UI",= sans-serif;color:rgb(36,41,46);background:white"><u></u>=C2=A0<u></u></span= ></p><p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&qu= ot;Segoe UI",sans-serif;color:rgb(36,41,46);background:white">I wasn= =E2=80=99t sure I understood this comment. Are you suggesting we </span>mal= loc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain i= t for the life of the environment? I think that is 4MB, if my math is right= , which seems like a lot of memory to keep allocated (we usually have a lot= of open environments). If the goal is to reduce the number of mallocs, how= about we retain the OVERLAPPED array, and only free and re-malloc if the p= revious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t un= necessary allocation, and we only malloc when there is a bigger transaction= than any previous. I put this together in a separate commit, as I wasn=E2= =80=99t sure if this what you wanted (can squash if you prefer): <a href=3D= "https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b= 2adefaac40" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/2= fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p class=3D"Ms= oNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thank you for the r= eview! <span style=3D"font-size:10.5pt;font-family:"Segoe UI",san= s-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><p c= lass=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thanks,<b= r>Kris<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><div= style=3D"border-right:none;border-bottom:none;border-left:none;border-top:= 1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal"><b>F= rom: </b><a href=3D"mailto:h...@symas.com" target=3D"_blank">Howard Chu</a><= br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a href=3D"mailto:kris= z...@gmail.com" target=3D"_blank">kris...@gmail.com</a>; <a href=3D"mailto:o= penldap-...@openldap.org" target=3D"_blank">openldap-its@OpenLDAP.org</a><b= r><b>Subject: </b>Re: (ITS#9017) Improving performance of commit sync in Wi= ndows<u></u><u></u></p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p= ><p class=3D"MsoNormal"><a href=3D"mailto:kris...@gmail.com" target=3D"_bla= nk">kris...@gmail.com</a> wrote:<u></u><u></u></p><p class=3D"MsoNormal">&g= t; Full_Name: Kristopher William Zyp<u></u><u></u></p><p class=3D"MsoNormal= ">> Version: LMDB 0.9.23<u></u><u></u></p><p class=3D"MsoNormal">> OS= : Windows<u></u><u></u></p><p class=3D"MsoNormal">> URL: <a href=3D"http= s://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7f= c4ce9" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/7ff525= ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p class=3D"MsoNorm= al">> Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p class= =3D"MsoNormal">> <u></u><u></u></p><p class=3D"MsoNormal">> <u></u><u= ></u></p><p class=3D"MsoNormal">> We have seen very poor performance on = the sync of commits on large databases in<u></u><u></u></p><p class=3D"MsoN= ormal">> Windows. On databases with 2GB of data, in writemap mode, the s= ync of even small<u></u><u></u></p><p class=3D"MsoNormal">> commits is c= onsistently well over 100ms (without writemap it is faster, but<u></u><u></= u></p><p class=3D"MsoNormal">> still slow). It is expected that a sync s= hould take some time while waiting for<u></u><u></u></p><p class=3D"MsoNorm= al">> disk confirmation of the writes, but more concerning is that these= sync<u></u><u></u></p><p class=3D"MsoNormal">> operations (in writemap = mode) are instead dominated by nearly 100% system CPU<u></u><u></u></p><p c= lass=3D"MsoNormal">> utilization, so operations that requires sub-millis= econd b-tree update<u></u><u></u></p><p class=3D"MsoNormal">> operations= are then dominated by very large amounts of system CPU cycles during<u></u= ><u></u></p><p class=3D"MsoNormal">> the sync phase.<u></u><u></u></p><p= class=3D"MsoNormal">> <u></u><u></u></p><p class=3D"MsoNormal">> I t= hink that the fundamental problem is that FlushViewOfFile seems to be an O(= n)<u></u><u></u></p><p class=3D"MsoNormal">> operation where n is the si= ze of the file (or map). I presume that Windows is<u></u><u></u></p><p clas= s=3D"MsoNormal">> scanning the entire map/file for dirty pages to flush,= I'm guessing because it<u></u><u></u></p><p class=3D"MsoNormal">> d= oesn't have an internal index of all the dirty pages for every file/map= -view in<u></u><u></u></p><p class=3D"MsoNormal">> the OS disk cache. Th= erefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p>= <p class=3D"MsoNormal">> operation to find the dirty pages for (large fi= le) and initiate their writes,<u></u><u></u></p><p class=3D"MsoNormal">>= which, of course, is contrary to the whole goal of a scalable database sys= tem.<u></u><u></u></p><p class=3D"MsoNormal">> And FlushFileBuffers is a= lso relatively slow as well. We have attempted to batch<u></u><u></u></p><p= class=3D"MsoNormal">> as many operations into single transaction as pos= sible, but this is still a very<u></u><u></u></p><p class=3D"MsoNormal">>= ; large overhead.<u></u><u></u></p><p class=3D"MsoNormal">> <u></u><u></= u></p><p class=3D"MsoNormal">> The Windows docs for FlushFileBuffers its= elf warns about the inefficiencies of<u></u><u></u></p><p class=3D"MsoNorma= l">> this function (<a href=3D"https://docs.microsoft.com/en-us/windows/= desktop/api/fileapi/nf-fileapi-flushfilebuffers" target=3D"_blank">https://= docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebu= ffers</a>).<u></u><u></u></p><p class=3D"MsoNormal">> Which also points = to the solution: it is much faster to write out the dirty<u></u><u></u></p>= <p class=3D"MsoNormal">> pages with WriteFile through a sync file handle= (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p class=3D"MsoNormal">> <u= ></u><u></u></p><p class=3D"MsoNormal">> The associated patch<u></u><u><= /u></p><p class=3D"MsoNormal">> (<a href=3D"https://github.com/kriszyp/n= ode-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" target=3D"_blank"= >https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab93= 32b7fc4ce9</a>)<u></u><u></u></p><p class=3D"MsoNormal">> is my attempt = at implementing this solution, for Windows. Fortunately, with the<u></u><u>= </u></p><p class=3D"MsoNormal">> design of LMDB, this is relatively stra= ightforward. LMDB already supports<u></u><u></u></p><p class=3D"MsoNormal">= > writing out dirty pages with WriteFile calls. I added a write-through = handle for<u></u><u></u></p><p class=3D"MsoNormal">> sending these write= s directly to disk. I then made that file-handle<u></u><u></u></p><p class= =3D"MsoNormal">> overlapped/asynchronously, so all the writes for a comm= it could be started in<u></u><u></u></p><p class=3D"MsoNormal">> overlap= mode, and (at least theoretically) transfer in parallel to the drive and<u= ></u><u></u></p><p class=3D"MsoNormal">> then used GetOverlappedResult t= o wait for the completion. So basically<u></u><u></u></p><p class=3D"MsoNor= mal">> mdb_page_flush becomes the sync. I extended the writing of dirty = pages through<u></u><u></u></p><p class=3D"MsoNormal">> WriteFile to wri= temap mode as well (for writing meta too), so that WriteFile<u></u><u></u><= /p><p class=3D"MsoNormal">> with write-through can be used to flush the = data without ever needing to call<u></u><u></u></p><p class=3D"MsoNormal">&= gt; FlushViewOfFile or FlushFileBuffers. I also implemented support for wri= te<u></u><u></u></p><p class=3D"MsoNormal">> gathering in writemap mode = where contiguous file positions infers contiguous<u></u><u></u></p><p class= =3D"MsoNormal">> memory (by tracking the starting position with wdp and = writing contiguous pages<u></u><u></u></p><p class=3D"MsoNormal">> in si= ngle operations). Sorting of the dirty list is maintained even in writemap<= u></u><u></u></p><p class=3D"MsoNormal">> mode for this purpose.<u></u><= u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor= mal">What is the point of using writemap mode if you still need to use Writ= eFile<u></u><u></u></p><p class=3D"MsoNormal">on every individual page?<u><= /u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"Ms= oNormal">> The performance benefits of this patch, in my testing, are co= nsiderable. Writing<u></u><u></u></p><p class=3D"MsoNormal">> out/syncin= g transactions is typically over 5x faster in writemap mode, and 2x<u></u><= u></u></p><p class=3D"MsoNormal">> faster in standard mode. And perhaps = more importantly (especially in environment<u></u><u></u></p><p class=3D"Ms= oNormal">> with many threads/processes), the efficiency benefits are eve= n larger,<u></u><u></u></p><p class=3D"MsoNormal">> particularly in writ= emap mode, where there can be a 50-100x reduction in the<u></u><u></u></p><= p class=3D"MsoNormal">> system CPU usage by using this patch. This bring= s windows performance with<u></u><u></u></p><p class=3D"MsoNormal">> syn= c'ed transactions in LMDB back into the range of "lightning" = performance :).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u= ></p><p class=3D"MsoNormal">What is the performance difference between your= patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNormal">not= using writemap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"= ><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">-- <u></u><u></u></p><p cla= ss=3D"MsoNormal">=C2=A0=C2=A0-- Howard Chu<u></u><u></u></p><p class=3D"Mso= Normal">=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 <a href=3D"http://www.symas.com" target=3D"_blank">http:= //www.symas.com</a><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0 Director= , Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 <a href=3D"http://highlandsun.com/hy= c/" target=3D"_blank">http://highlandsun.com/hyc/</a><u></u><u></u></p><p c= lass=3D"MsoNormal">=C2=A0 Chief Architect, OpenLDAP=C2=A0 <a href=3D"http:/= /www.openldap.org/project/" target=3D"_blank">http://www.openldap.org/proje= ct/</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p = class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div></div></blockquote></div> </blockquote></div> --000000000000d8a73d0592d86418--