On 11/29/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Albert Chern wrote: > Every time the size of the map file hits a multiple of the index > interval, an index entry is written. Therefore, it is possible that > an index entry is not added for the first occurrence of a key, but one > of the later ones. The reader will then seek to one of those instead > of the first. > > This does seem to be inconsistent with the the fact that you are > allowed to insert equal key records. Yes, I agree that this is confusing and arguably a bug. > I suspect perhaps the developers > meant for MapFile records to be uniquely keyed, but in > MapFile.Writer.checkKey() they used a > where they intended a >= or > something. I think what actually happened was that I originally coded it to prohibit equal keys, then, at some point found an application (somewhere in Nutch) where equal keys were useful, and changed MapFile to support them, not realizing the consequences. Sigh. I don't know whether Nutch still relies on this or not. MapFile could probably be fixed by changing the way the index is created, to write the location of the first instance of any run of equal keys. We could also avoid recording two instances of equal keys in the index: for a long run of equal keys, we could wait until the key changes before emitting a new index entry.
but the index interval is not real interval anymore. because even in the interval, the index will be appended caused by equal keys. I think the reason of existence of index interval is for reducing the index size when the MapFile is too large. I think we may introduce something like MultiValueMapFile, or IndexFile to do the job, and leave MapFile keeping its own principle. best regards, Feng Doug