On Sat, Sep 27, 2008 at 3:14 PM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > What if the index is large enough that it we exceed the system's addressable > space?
Sadly, I haven't considered this problem at all. I just naively thought it wasn't going to be a problem. And for my own uses, I think this is true: I've been looking at systems that fit in RAM running on a 64-bit OS. Either of these conditions alone should make it that addressable space is the first limit one hits. But yes, if you have a 32-bit OS with a <4GB address space and a large index, you could run up against this limit pretty quickly. Perhaps not as quickly as you'd presume, though. The way I was looking at it, it's not the total size of the index that matters, only the total size of the posting lists involved in the query. I was thinking to mmap() each posting list independently. Even so, with no stop-words and a sufficiently complex query, you could probably run out of address space with a <10GB index. My first instinct would be to ignore the problem, and declare 32-bit systems as suitable for development, testing, and small-scale home use only. My alternative instinct (perhaps no less unreasonable) would be to limit index segments to 4GB on 32-bit systems, and map/unmap on a per-segment basis. In my mind, once we need to load from disk a subset of index larger than addressable space, we've already given up on performance and are just trying to slog through without undue embarrassment. I'm sure there are use-cases where this is not the case, however. I would try very hard to avoid doing buffer bounds checking. The code will be so much simpler and faster if it is not done. If it truly needs to be done, I would figure out a way to encapsulate it within the index reader, and just guarantee to the higher levels that they can read through to end of the current posting without overstepping bounds. > Also, many posting formats won't have a theoretical maximum length. The > current implementation of ScorePosting doesn't for instance, because there's > no > ceiling on the number of positions. True, but I'd hope it's reasonable to require that the collectiive postings being used to score an individual document can fit into addressable memory, if not physical memory? While there is no theoretical limit on the length of a posting, imposing a practical limit here seems reasonable. > We should ape Lucene and give PostingList a Bulk_Read() method, which gets > called by the TermScorer subclass. I still don't think this would be a win, at least not for any postings that include position information. If you read in a large enough list to have significant avoidance of function overhead, you'll just blow your L1 or L2 cache. For match-only postings there might be a sweet spot, but I think the real win will still be from making it very fast to fill a single posting from the raw index. Nathan Kurz [EMAIL PROTECTED]
