You are full of crap. From your own comments in Lucene 1458: "The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?
Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size. For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable." and then you state farther down "Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. > We could also explore something in-between, eg it'd be nice to > genericize MultiLevelSkipListWriter so that it could index arbitrary > files, then we could use that to index the terms dict. You could > choose to spend dedicated process RAM on the higher levels of the skip > tree, and then tentatively trust IO cache for the lower levels. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths." The thing I find funny is that many are approaching these issues as if new ground is being broken. These are ALL standard, long-known issues that any database engineer has already worked with, and there are accepted designs given applicable constraints. This is why I've tried to point folks towards alternative designs that open the door much wider to increased performance/reliability/robustness. Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding). -----Original Message----- >From: Marvin Humphrey <mar...@rectangular.com> >Sent: Dec 26, 2008 3:53 PM >To: java-dev@lucene.apache.org, Robert Engels <reng...@ix.netcom.com> >Subject: Re: Realtime Search > >Robert, > >Three exchanges ago in this thread, you made the incorrect assumption that the >motivation behind using mmap was read speed, and that memory mapping was being >waved around as some sort of magic wand: > > Is there something that I am missing? I see lots of references to > using "memory mapped" files to "dramatically" improve performance. > > I don't think this is the case at all. At the lowest levels, it is > somewhat more efficient from a CPU standpoint, but with a decent OS > cache the IO performance difference is going to negligible. > >In response, I indicated that the mmap design had been discussed in JIRA, and >pointed you at a particular issue. > > There have been substantial discussions about this design in JIRA, > notably LUCENE-1458. > > The "dramatic" improvement is WRT to opening/reopening an IndexReader. > >Apparently, you did not go back to read that JIRA thread, because you >subsequently offered a critique of a purely invented design you assumed we >must have arrived at, and continued to argue with a straw man about read >speed: > > 1. with "fixed" size terms, the additional IO (larger pages) probably > offsets a lot of the random access benefit. This is why "compressed" > disks on a fast machine (CPU) are often faster than "uncompressed" - > more data is read during every IO access. > >While my reply did not specifically point back to LUCENE-1458 again, I hoped >that having your foolish assumption exposed would motivate you to go back and >read it, so that you could offer an informed critique of the *actual* design. >I also linked to a specific comment in LUCENE-831 which explained how mmap >applied to sort caches. > > Additionally, sort caches would be written at index time in three files, > and > memory mapped as laid out in > > <https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>. > >Apparently you still didn't go back and read up, because you subsequently made >a third incorrect assumption, this time about plans to do away with the term >dictionary index. In response I griped about JIRA again, using slightly >stronger but still intentionally indirect language. > > No. That idea was entertained briefly and quickly discarded. There seems > to be an awful lot of irrelevant noise in the current thread arising due > to lack of familiarity with the ongoing discussions in JIRA. > >Unfortunately, this must not have worked either, because you have now offered a >fourth message based on incorrect assumptions which would have been remedied by >bringing yourself up to date with the relevant JIRA threads. > >> That could very well be, but I was referencing your statement: >> >> "1) Design index formats that can be memory mapped rather than slurped, >> bringing the cost of opening/reopening an IndexReader down to a >> negligible level." >> >> The only reason to do this (or have it happen) is if you perform a binary >> search on the term index. > >No. As discussed in LUCENE-1458, LUCENE-1483, the specific link I pointed you >towards in LUCENE-831, the message where I provided you with that link, and >elsewhere in this thread... loading the term dictionary index is important, but >the cost pales in comparison to the cost of loading sort caches. > >> Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be >> workable if the files were on a striped drive, or put each file on a >> different >> drive/controller, but requiring such specially configured hardware is not a >> good idea. In the common case (single drive), you are going to be seeking all >> over the place. > >Mike McCandless and I had an extensive debate about the pros and cons of >depending on the OS cache to hold the term dictionary index under LUCENE-1458. >The concerns you express here were fully addressed, and even resolved under an >"agree to disagree" design. > >> Also, the mmap is only suitable for 64 bit platforms, since there is no way >> in Java to unmap, you are going to run out of address space as segments are >> rewritten. > >The discussion of how the mmap design translates from Lucy to Lucene is an >important one, but I despair of having it if we have to rehash all of >LUCENE-1458, LUCENE-831, and possibly LUCENE-1476 and LUCENE-1483 because you >cannot be troubled to bring yourself up to speed before commenting. > >You are obviously knowledgable on the subject of low level memory issues. Me >and Mike McCandless ain't exactly chopped liver, though, and neither are a lot >of other people around here who *are* bothering to keep up with the threads in >JIRA. I request that you show the rest of us more respect. Our time is >valuable, too. > >Marvin Humphrey > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org