Hello, Opened the below JIRA for this issue. I will work on this and try to submit a patch. [LUCENE-9889] Lucene (unexpected ) fsync on existing segments - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/LUCENE-9889>
Thanks, Rahul On Fri, Mar 26, 2021 at 9:56 AM Rahul Goswami <[email protected]> wrote: > Mike, > > >> "But, I believe you (system locks up with MMapDirectory for you > use-case), so there is a bug somewhere! And I wish we could get to the > bottom of that, and fix it." > > Yes that's true for Windows for sure. I haven't tested it on Unix-like > systems to that scale, so don't have any observations to report there. > > >> "Also, this (system locks up when using MMapDirectory) sounds different > from the "Lucene fsyncs files that it doesn't need to" bug, right?" > > That's correct, they are separate issues. I just brought up the > system-freezing-up-on-Windows point in response to Uwe's explanation > earlier. > > I know I had taken it upon myself to open up a Jira for the fsync issue, > but it got delayed from my side as I got occupied with other things > in my day job. Will open up one later today. > > Thanks, > Rahul > > > On Wed, Mar 24, 2021 at 12:58 PM Michael McCandless < > [email protected]> wrote: > >> MMapDirectory really should be (is supposed to be) better than >> SimpleFSDirectory for your usage case. >> >> Memory mapped pages do not have to fit into your 64 GB physical space, >> but the "hot" pages (parts of the index that you are actively querying) >> ideally would fit mostly in free RAM on your box to have OK search >> performance. Run with as small a JVM heap as possible so the OS has the >> most RAM to keep such pages hot. Since you are getting OK performance with >> SimpleFSDirectory it sounds like you do have enough free RAM for the parts >> of the index you are searching... >> >> But, I believe you (system locks up with MMapDirectory for you use-case), >> so there is a bug somewhere! And I wish we could get to the bottom of >> that, and fix it. >> >> Also, this (system locks up when using MMapDirectory) sounds different >> from the "Lucene fsyncs files that it doesn't need to" bug, right? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Mon, Mar 15, 2021 at 4:28 PM Rahul Goswami <[email protected]> >> wrote: >> >>> Uwe, >>> I understand that mmap would only map *a part* of the index from virtual >>> address space to physical memory as and when the pages are requested. >>> However the limitation on our side is that in most cases, we cannot ask for >>> more than 128 GB RAM (and unfortunately even that would be a stretch) for >>> the Solr machine. >>> >>> I have read and re-read the article you referenced in the past :) It's >>> brilliantly written and did help clarify quite a few things for me I must >>> say. However, at the end of the day, there is only so much the OS (at least >>> Windows) can do before it starts to swap different pages in a 2-3 TB index >>> into 64 GB of physical space, isn't that right ? The CPU usage spikes to >>> 100% at such times and the machine becomes totally unresponsive. Turning on >>> SimpleFSDIrectory at such times does rid us of this issue. I understand >>> that we are losing out on performance by an order of magnitude compared to >>> mmap, but I don't know any alternate solution. Also, since most of our use >>> cases are more write-heavy than read-heavy, we can afford to compromise on >>> the search performance due to SimpleFS. >>> >>> Please let me know still, if there is anything about my explanation that >>> doesn't sound right to you. >>> >>> Thanks, >>> Rahul >>> >>> On Mon, Mar 15, 2021 at 3:54 PM Uwe Schindler <[email protected]> wrote: >>> >>>> This is not true. Memory mapping does not need to load the index into >>>> ram, so you don't need so much physical memory. Paging is done only between >>>> index files and ram, that's what memory mapping is about. >>>> >>>> Please read the blog post: >>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>>> >>>> Uwe >>>> >>>> Am March 15, 2021 7:43:29 PM UTC schrieb Rahul Goswami < >>>> [email protected]>: >>>>> >>>>> Mike, >>>>> Yes I am using a 64 bit JVM on Windows. I haven't tried reproducing >>>>> the issue on Linux yet. In the past we have had problems with mmap on >>>>> Windows with the machine freezing. The rationale I gave to myself is the >>>>> amount of disk and CPU activity for paging in and out must be intense for >>>>> the OS while trying to map an index that large into 64 GB of heap. Also >>>>> since it's an on-premise deployment, we can't expect the customers of the >>>>> product to provide nodes with > 400 GB RAM which is what *I think* would >>>>> be >>>>> required to get a decent performance with mmap. Hence we had to switch to >>>>> SimpleFSDirectory. >>>>> >>>>> As for the fsync behavior, you are right. I tried with >>>>> NRTCachingDirectoryFactory as well which defaults to using mmap underneath >>>>> and still makes fsync calls for already existing index files. >>>>> >>>>> Thanks, >>>>> Rahul >>>>> >>>>> On Mon, Mar 15, 2021 at 3:15 PM Michael McCandless < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Rahul. >>>>>> >>>>>> > primary reason being that memory mapping multi-terabyte indexes is >>>>>> not feasible through mmap >>>>>> >>>>>> Hmm, that is interesting -- are you using a 64 bit JVM? If so, what >>>>>> goes wrong with such large maps? Lucene's MMapDirectory should chunk the >>>>>> mapping to deal with ByteBuffer int only address space. >>>>>> >>>>>> SimpleFSDirectory usually has substantially worse performance than >>>>>> MMapDirectory. >>>>>> >>>>>> Still, I suspect you would hit the same issue if you used other >>>>>> FSDirectory implementations -- the fsync behavior should be the same. >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> >>>>>> On Fri, Mar 12, 2021 at 1:46 PM Rahul Goswami <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks Michael. For your question...yes I am running Solr on Windows >>>>>>> and running it with SimpleFSDirectoryFactory (primary reason being that >>>>>>> memory mapping multi-terabyte indexes is not feasible through mmap). I >>>>>>> will >>>>>>> create a Jira later today with the details in this thread and assign it >>>>>>> to >>>>>>> myself. Will take a shot at the fix. >>>>>>> >>>>>>> Thanks, >>>>>>> Rahul >>>>>>> >>>>>>> On Fri, Mar 12, 2021 at 10:00 AM Michael McCandless < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I think long ago we used to track which files were actually dirty >>>>>>>> (we had written bytes to) and only fsync those ones. But something >>>>>>>> went >>>>>>>> wrong with that, and at some point we "simplified" this logic, I think >>>>>>>> on >>>>>>>> the assumption that asking the OS to fsync a file that does in fact >>>>>>>> exist >>>>>>>> yet indeed has not changed would be harmless? But somehow it is not in >>>>>>>> your case? Are you on Windows? >>>>>>>> >>>>>>>> I tried to do a bit of digital archaeology and remember what >>>>>>>> happened here, and I came across this relevant looking issue: >>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2328. That issue >>>>>>>> moved tracking of which files have been written but not yet fsync'd >>>>>>>> down >>>>>>>> from IndexWriter into FSDirectory. >>>>>>>> >>>>>>>> But there was another change that then removed staleFiles from >>>>>>>> FSDirectory entirely.... still trying to find that. Aha, found it! >>>>>>>> https://issues.apache.org/jira/browse/LUCENE-6150. Phew Uwe was >>>>>>>> really quite upset in that issue ;) >>>>>>>> >>>>>>>> I also came across this delightful related issue, showing how a >>>>>>>> massive hurricane (Irene) can lead to finding and fixing a bug in >>>>>>>> Lucene! >>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3418 >>>>>>>> >>>>>>>> > The assumption is that while the commit point is saved, no >>>>>>>> changes happen to the segment files in the saved generation. >>>>>>>> >>>>>>>> This assumption should really be true. Lucene writes the files, >>>>>>>> append only, once, and then never changes them, once they are closed. >>>>>>>> Pulling a commit point from Solr should further ensure that, even as >>>>>>>> indexing continues and new segments are written, the old segments >>>>>>>> referenced in that commit point will not be deleted. But apparently >>>>>>>> this >>>>>>>> "harmless fsync" Lucene is doing is not so harmless in your use case. >>>>>>>> Maybe open an issue and pull out the details from this discussion onto >>>>>>>> it? >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 12, 2021 at 9:03 AM Michael Sokolov <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Also - I should have said - I think the first step here is to >>>>>>>>> write a >>>>>>>>> focused unit test that demonstrates the existence of the extra >>>>>>>>> fsyncs >>>>>>>>> that we want to eliminate. It would be awesome if you were able to >>>>>>>>> create such a thing. >>>>>>>>> >>>>>>>>> On Fri, Mar 12, 2021 at 9:00 AM Michael Sokolov < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >>>>>>>>> > Yes, please go ahead and open an issue. TBH I'm not sure why >>>>>>>>> this is >>>>>>>>> > happening - there may be a good reason?? But let's explore it >>>>>>>>> using an >>>>>>>>> > issue, thanks. >>>>>>>>> > >>>>>>>>> > On Fri, Mar 12, 2021 at 12:16 AM Rahul Goswami < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > > >>>>>>>>> > > I can create a Jira and assign it to myself if that's ok (?). >>>>>>>>> I think this can help improve commit performance. >>>>>>>>> > > Also, to answer your question, we have indexes sometimes going >>>>>>>>> into multiple terabytes. Using the replication handler for backup >>>>>>>>> would >>>>>>>>> mean requiring a disk capacity more than 2x the index size on the >>>>>>>>> machine >>>>>>>>> at all times, which might not be feasible. So we directly back the >>>>>>>>> index up >>>>>>>>> from the Solr node to a remote repository. >>>>>>>>> > > >>>>>>>>> > > Thanks, >>>>>>>>> > > Rahul >>>>>>>>> > > >>>>>>>>> > > On Thu, Mar 11, 2021 at 4:09 PM Michael Sokolov < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> >>>>>>>>> > >> Well, it certainly doesn't seem necessary to fsync files that >>>>>>>>> are >>>>>>>>> > >> unchanged and have already been fsync'ed. Maybe there's an >>>>>>>>> opportunity >>>>>>>>> > >> to improve it? On the other hand, support for external >>>>>>>>> processes >>>>>>>>> > >> reading Lucene index files isn't likely to become a feature >>>>>>>>> of Lucene. >>>>>>>>> > >> You might want to consider using Solr replication to power >>>>>>>>> your >>>>>>>>> > >> backup? >>>>>>>>> > >> >>>>>>>>> > >> On Thu, Mar 11, 2021 at 2:52 PM Rahul Goswami < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> > >>>>>>>>> > >> > Thanks Michael. I thought since this discussion is closer >>>>>>>>> to the code than most discussions on the solr-users list, it seemed >>>>>>>>> like a >>>>>>>>> more appropriate forum. Will be mindful going forward. >>>>>>>>> > >> > On your point about new segments, I attached a debugger and >>>>>>>>> tried to do a new commit (just pure Solr commit, no backup process >>>>>>>>> running), and the code indeed does fsync on a pre-existing segment >>>>>>>>> file. >>>>>>>>> Hence I was a bit baffled since it challenged my fundamental >>>>>>>>> understanding >>>>>>>>> that segment files once written are immutable, no matter what (unless >>>>>>>>> picked up for a merge of course). Hence I thought of reaching out, in >>>>>>>>> case >>>>>>>>> there are scenarios where this might happen which I might be unaware >>>>>>>>> of. >>>>>>>>> > >> > >>>>>>>>> > >> > Thanks, >>>>>>>>> > >> > Rahul >>>>>>>>> > >> > >>>>>>>>> > >> > On Thu, Mar 11, 2021 at 2:38 PM Michael Sokolov < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> >> >>>>>>>>> > >> >> This isn't a support forum; solr-users@ might be more >>>>>>>>> appropriate. On >>>>>>>>> > >> >> that list someone might have a better idea about how the >>>>>>>>> replication >>>>>>>>> > >> >> handler gets its list of files. This would be a good list >>>>>>>>> to try if >>>>>>>>> > >> >> you wanted to propose a fix for the problem you're having. >>>>>>>>> But since >>>>>>>>> > >> >> you're here -- it looks to me as if IndexWriter indeed >>>>>>>>> syncs all "new" >>>>>>>>> > >> >> files in the current segments being committed; look in >>>>>>>>> > >> >> IndexWriter.startCommit and SegmentInfos.files. Caveat: >>>>>>>>> (1) I'm >>>>>>>>> > >> >> looking at this code for the first time, and (2) things >>>>>>>>> may have been >>>>>>>>> > >> >> different in 7.7.2? Sorry I don't know for sure, but are >>>>>>>>> you sure that >>>>>>>>> > >> >> your backup process is not attempting to copy one of the >>>>>>>>> new files? >>>>>>>>> > >> >> >>>>>>>>> > >> >> On Thu, Mar 11, 2021 at 1:35 PM Rahul Goswami < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> >> > >>>>>>>>> > >> >> > Hello, >>>>>>>>> > >> >> > Just wanted to follow up one more time to see if this is >>>>>>>>> the right form for my question? Or is this suitable for some other >>>>>>>>> mailing >>>>>>>>> list? >>>>>>>>> > >> >> > >>>>>>>>> > >> >> > Best, >>>>>>>>> > >> >> > Rahul >>>>>>>>> > >> >> > >>>>>>>>> > >> >> > On Sat, Mar 6, 2021 at 3:57 PM Rahul Goswami < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> >> >> >>>>>>>>> > >> >> >> Hello everyone, >>>>>>>>> > >> >> >> Following up on my question in case anyone has any >>>>>>>>> idea. Why it's important to know this is because I am thinking of >>>>>>>>> allowing >>>>>>>>> the backup process to not hold any lock on the index files, which >>>>>>>>> should >>>>>>>>> allow the fsync during parallel commits. BUT, in case doing an fsync >>>>>>>>> on >>>>>>>>> existing segment files in a saved commit point DOES have an effect, it >>>>>>>>> might render the backed up index in a corrupt state. >>>>>>>>> > >> >> >> >>>>>>>>> > >> >> >> Thanks, >>>>>>>>> > >> >> >> Rahul >>>>>>>>> > >> >> >> >>>>>>>>> > >> >> >> On Fri, Mar 5, 2021 at 3:04 PM Rahul Goswami < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> Hello, >>>>>>>>> > >> >> >>> We have a process which backs up the index (Solr >>>>>>>>> 7.7.2) on a schedule. The way we do it is we first save a commit >>>>>>>>> point on >>>>>>>>> the index and then using Solr's /replication handler, get the list of >>>>>>>>> files >>>>>>>>> in that generation. After the backup completes, we release the commit >>>>>>>>> point >>>>>>>>> (Please note that this is a separate backup process outside of Solr >>>>>>>>> and not >>>>>>>>> the backup command of the /replication handler) >>>>>>>>> > >> >> >>> The assumption is that while the commit point is >>>>>>>>> saved, no changes happen to the segment files in the saved generation. >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> Now the issue... The backup process opens the index >>>>>>>>> files in a shared READ mode, preventing writes. This is causing any >>>>>>>>> parallel commits to fail as it seems to be complaining about the index >>>>>>>>> files to be locked by another process(the backup process). Upon >>>>>>>>> debugging, >>>>>>>>> I see that fsync is being called during commit on already existing >>>>>>>>> segment >>>>>>>>> files which is not expected. So, my question is, is there any reason >>>>>>>>> for >>>>>>>>> lucene to call fsync on already existing segment files? >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> The line of code I am referring to is as below: >>>>>>>>> > >> >> >>> try (final FileChannel file = >>>>>>>>> FileChannel.open(fileToSync, isDir ? StandardOpenOption.READ : >>>>>>>>> StandardOpenOption.WRITE)) >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> in method fsync(Path fileToSync, boolean isDir) of the >>>>>>>>> class file >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> lucene\core\src\java\org\apache\lucene\util\IOUtils.java >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> Thanks, >>>>>>>>> > >> >> >>> Rahul >>>>>>>>> > >> >> >>>>>>>>> > >> >> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> > >> >> To unsubscribe, e-mail: [email protected] >>>>>>>>> > >> >> For additional commands, e-mail: >>>>>>>>> [email protected] >>>>>>>>> > >> >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> > >> To unsubscribe, e-mail: [email protected] >>>>>>>>> > >> For additional commands, e-mail: [email protected] >>>>>>>>> > >> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>> >>>>>>>>> >>>> -- >>>> Uwe Schindler >>>> Achterdiek 19, 28357 Bremen >>>> https://www.thetaphi.de >>>> >>>
