[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792996#action_12792996 ]
Michael McCandless commented on LUCENE-2026: -------------------------------------------- {quote} bq. But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share. For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference. {quote} But everything must go through the filesystem with Lucy... Eg, with Lucene, deletions are not written to disk until you commit. Flush doesn't write the del file, merging doesn't, etc. The deletes are carried in RAM. We could (but haven't yet -- NRT turnaround time is already plenty fast) do the same with norms, field cache, terms dict index, etc. {quote} Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM. Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too. {quote} As long as the page is hot... (in both cases!). But by using file-backed RAM (not malloc'd RAM), you're telling the OS it's OK if it chooses to swap it out. Sure, malloc'd RAM can be swapped out too... but that should be less frequent (and, we can control this behavior, somewhat, eg swappiness). It's similar to using a weak v strong reference in java. By using file-backed RAM you tell the OS it's fair game for swapping. {quote} If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess. In both cases, the availability of fresh data is decoupled from the fsync. In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds. In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit. {quote} The two approaches are basically the same, so, we get the same features ;) It's just that Lucy uses the filesystem for sharing, and Lucene shares through RAM. bq. We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that. I guess my confusion is what are all the other benefits of using file-backed RAM? You can efficiently use process only concurrency (though shared memory is technically an option for this too), and you have wicked fast open times (but, you still must warm, just like Lucene). What else? Oh maybe the ability to inform OS *not* to cache eg the reads done when merging segments. That's one I sure wish Lucene could use... In exchange you risk the OS making poor choices about what gets swapped out (LRU policy is too simplistic... not all pages are created equal), must down cast all data structures to file-flat, must share everything through the filesystem, (perf hit to NRT). I do love how pure the file-backed RAM approach is, but I worry that down the road it'll result in erratic search performance in certain app profiles. {quote} bq. But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far? It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data. {quote} Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault. {quote} The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent. {quote} But the CFS construction must also go through the filesystem (like Lucene) right? So you still incur IO load of creating the small files, then 2nd pass to consolidate. I agree there's a certain design purity to having the files clearly separate out the elements of the data structures, but if it means erratic search performance... function over form? {quote} If we were willing to ditch portability, we could cast to arrays of structs in Lucy - but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired. {quote} Someday we could make a Lucene codec that interacts with a Lucy index... would be a good exercise to go though to see if the flex API really is "flex" enough... {quote} bq. The RAM resident things Lucene has - norms, deleted docs, terms index, field cache - seem to "cast" just fine to file-flat. There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now. {quote} We've leaving them as UTF8 by default for Lucene (with the flex changes). Still, the terms index once loaded does have silly RAM overhead... we can cut that back a fair amount though. {quote} I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it. {quote} Sigh, that's a curious downside... so term decode intensive uses (merging, range queries, I guess maybe term dict lookup) take the brunt of that hit? {quote} bq. If we switched to an FST for the terms index I guess that could get tricky... Hmm, I haven't been following that. {quote} There's not much to follow -- it's all just talk at this point. I don't think anyone's built a prototype yet ;) {quote} Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it. {quote} And then we'll borrow back your simplifications ;) Lather, rinse, repeat. {quote} bq. Wouldn't shared memory be possible for process-only concurrent models? IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot. {quote} I had assumed so too, but I was surprised that Python's multiprocessing module exposes a simple API for sharing objects from parent to forked child. It's at least a counter example (though, in all fairness, I haven't looked at the impl ;) ), ie, there seems to be some hope of containing shared memory under a consistent API. I'm just pointing out that "going through the filesystem" isn't the only way to have efficient process-only concurrency. Shared memory is another option, but, yes it has tradeoffs. {quote} bq. Also, what popular systems/environments have this requirement (only process level concurrency) today? Perl's threads suck. Actually all threads suck. Perl's are just worse than average - and so many Perl binaries are compiled without them. Java threads suck less, but they still suck - look how much engineering time you folks blow on managing that stuff. Threads are a terrible programming model. I'm not into the idea of forcing Lucy users to use threads. They should be able to use processes as their primary concurrency model if they want. {quote} Yes, working with threads is a nightmare (eg have a look at Java's memory model). I think the jury is still out (for our species) just how, long term, we'll make use of concurrency with the machines. I think we may need to largely take "time" out of our programming languages, eg switch to much more declarative code, or something... wanna port Lucy to Erlang? But I'm not sure process only concurrency, sharing only via file-backed memory, is the answer either ;) {quote} bq. It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right? Depends. Total indexing throughput in both Lucene and KinoSearch has been pretty decent for a long time. However, there's been a large gap between average index update performance and worst case index update performance, especially when you factor in sort cache loading. There are plenty of applications that may not have very high throughput requirements but where it may not be acceptable for an index update to take several seconds or several minutes every once in a while, even if it usually completes faster. bq. I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually? Sometimes the person who just performed the action that updated the index is the only one you care about. For instance, to use a feature request that came in from Slashdot a while back, if someone leaves a comment on your website, it's nice to have it available in the search index right away. Consistently fast index update responsiveness makes personalization of the customer experience easier. {quote} Turnaround time for Lucene NRT is already very fast, as is. After an immense merge, it'll be the worst, but if you warm the reader first, that won't be an issue. Using Zoie you can make reopen time insanely fast (much faster than I think necessary for most apps), but at the expense of some expected hit to searching/indexing throughput. I don't think that's the right tradeoff for Lucene. I suspect Lucy is making a similar tradeoff, ie, that search performance will be erratic due to page faults, at a smallish gain in reopen time. Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures? I wonder in practice what extra cost we are really talking about... it's RAM to RAM "translation" of data structures (if the files are hot). FieldCache we just have to fix to stop doing uninversion... (ie we need CSF). {quote} bq. Is reopen even necessary in Lucy? Probably. If you have a boatload of segments and a boatload of fields, you might start to see file opening and metadata parsing costs come into play. If it turns out that for some indexes reopen() can knock down the time from say, 100 ms to 10 ms or less, I'd consider that sufficient justification. {quote} OK. Then, you are basically pooling your readers ;) Ie, you do allow in-process sharing, but only among readers. > Refactoring of IndexWriter > -------------------------- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org