[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431 ]
Marvin Humphrey commented on LUCENE-2026: ----------------------------------------- > I guess my confusion is what are all the other benefits of using > file-backed RAM? You can efficiently use process only concurrency > (though shared memory is technically an option for this too), and you > have wicked fast open times (but, you still must warm, just like > Lucene). Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional -- it's a *core* *concern*. > What else? Oh maybe the ability to inform OS not to cache > eg the reads done when merging segments. That's one I sure wish > Lucene could use... Lightweight searchers mean architectural freedom. Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you need for whatever app architecture you just dreamed up -- then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc. If your normal development pattern is a single monolithic Java process, then that freedom might not mean much to you. But with their low per-object RAM requirements and fast opens, lightweight searchers are easy to use within a lot of other development patterns. For example: lightweight searchers work well for maxing out multiple CPU cores under process-only concurrency. > In exchange you risk the OS making poor choices about what gets > swapped out (LRU policy is too simplistic... not all pages are created > equal), The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently: {panel} http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched (by the memory management subsystem) it's age is increased by 3 to a maximum of 20. Each time the Kernel swap daemon runs it ages pages, decrementing their age by 1. {panel} And while that system may not be ideal from our standpoint, it's still pretty good. In general, the operating system's virtual memory scheme is going to work fine as designed, for us and everyone else, and minimize memory availability wait times. When will swapping out the term dictionary be a problem? * For indexes where queries are made frequently, no problem. * Foir systems with plenty of RAM, no problem. * For systems that aren't very busy, no problem. * For small indexes, no problem. The only situation we're talking about is infrequent queries against large indexes on busy boxes where RAM isn't abundant. Under those circumstances, it *might* be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's. But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision. OK, so there's a vulnerable bubble where the the query rate against a large index is neither too fast nor too slow, on busy machines where RAM isn't abundant. I don't think that bubble ought to drive major architectural decisions. Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers? > I do love how pure the file-backed RAM approach is, but I worry that > down the road it'll result in erratic search performance in certain > app profiles. If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less. That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict. > Yeah, that you need 3 files for the string sort cache is a little > spooky... that's 3X the chance of a page fault. Not when using the compound format. > But the CFS construction must also go through the filesystem (like > Lucene) right? So you still incur IO load of creating the small > files, then 2nd pass to consolidate. Yes. > I think we may need to largely take "time" out of our programming > languages, eg switch to much more declarative code, or > something... wanna port Lucy to Erlang? > > But I'm not sure process only concurrency, sharing only via > file-backed memory, is the answer either I think relying heavily on file-backed memory is particularly appropriate for Lucy because the write-once file format works well with MAP_SHARED memory segments. If files were being modified and had to be protected with semaphores, it wouldn't be as sweet a match. Focusing on process-only concurrency also works well for Lucy because host threading models differ substantially and so will only be accessible via a generalized interface from the Lucy C core. It will be difficult to tune threading performance through that layer of indirection -- I'm guessing beyond the ability of most developers since few will be experts in multiple host threading models. In contrast, expertise in process level concurrency will be easier to come by and to nourish. > Using Zoie you can make reopen time insanely fast (much faster than I > think necessary for most apps), but at the expense of some expected > hit to searching/indexing throughput. I don't think that's the right > tradeoff for Lucene. But as Jake pointed out early in the thread, Zoie achieves those insanely fast reopens without tight coupling to IndexWriter and its components. The auxiliary RAM index approach is well proven. > Do you have any hard numbers on how much time it takes Lucene to load > from a hot IO cache, populating its RAM resident data structures? Hmm, I don't spend a lot of time working with Lucene directly, so I might not be the person most likely to have data like that at my fingertips. Maybe that McCandless dude can help you out, he runs a lot of benchmarks. ;) Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers". ;) > OK. Then, you are basically pooling your readers Ie, you do allow > in-process sharing, but only among readers. Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for each new segment, but they would be private to each parent PolyReader. So if you reopened two IndexReaders at the same time after e.g. segment "seg_12" had been added, each would create a new, private SegReader for "seg_12". > Refactoring of IndexWriter > -------------------------- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org