[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793918#action_12793918 ]
Marvin Humphrey commented on LUCENE-2026: ----------------------------------------- > Very interesting - thanks. So it also factors in how much the page > was used in the past, not just how long it's been since the page was > last used. In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test. :( > Even smallish indexes can see the pages swapped out? Yes, you're right -- the wait time to get at a small term dictionary isn't necessarily small. I've amended my previous post, thanks. > And of course Java pretty much forces threads-as-concurrency (JVM > startup time, hotspot compilation, are costly). Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over -- provided that we can develop a sane genericized interface using configuration probing and #ifdefs. It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time. > Right, this is how Lucy would force warming. I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly. > But, even within that CFS file, these three sub-files will not be > local? Ie you'll still have to hit three pages per "lookup" right? They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value. But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data. An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline -- more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable). For each sort cache value lookup, you're going to need to access two blocks of memory. * With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data. * With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array. I think the locality costs should be approximately the same... have I missed anything? > Write-once is good for Lucene too. Hellyeah. > And it seems like Lucy would not need anything crazy-os-specific wrt > threads? It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host. The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves. Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread. However, if you want to *share* Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models. > Hmm I'd guess that field cache is slowish; deleted docs & norms are > very fast; terms index is somewhere in between. That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else? > You're right, you'd get two readers for seg_12 in that case. By > "pool" I meant you're tapping into all the sub-readers that the > existing reader have opened - the reader is your pool of sub-readers. Each unique SegReader will also have dedicated "sub-reader" objects: two "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12" PostingsReaders, etc. However, all those sub-readers will share the same file-backed RAM data, so in that sense they're pooled. > Refactoring of IndexWriter > -------------------------- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org