Howard Chu writes: > To avoid wasting those 5 unneeded bits in the base, we could use them as a > run-length counter and use multiple vectors per base. But it might be better > to slide things over and use just a 24 bit base, and use the bottom 8 bits of > the word as a bitmask to represent which following vectors are present. E.g.
With variable-length records you can't do a binary search for an entry ID. I don't know how significant that is though. And you can get around it, at some expense: > ......01 means 1 vector follows, representing base+ 0-31. > ......02 means 1 vector follows, representing base+32-63. That can be normalized so bottom bit == 1: (base+32, 01). If we also don't use entry IDs with (ID & 31) == 0, we can look at the bottom bit of a word in the IDL to tell if it is a vector header or a vector word. > That allows us to represent up to 256 entryIDs in only 288 bits > (instead of the 16384 bits we currently need). Though in sparse IDLs, worst case will be 2 words per entry ID. OTOH if we have exact ranges, best case is 2 words for any number of entries:-) I don't know how they are stored now though, you mentioned loss of precision so I assume approximate ranges are involved but a glance at the code doesn't reveal where. > That would allow us to track about 1.86M entryIDs in the same space > we currently use to track only 64K entryIDs. > > It seems we would still need some kind of range representation to > handle IDLs with more than 1.86M entries. And some way to tell that from vectors, possibly using up another bit as a flag in the vector header. Alternatively, don't use IDs where (ID & 31) == 2 (or 31) either, and use that bit in the first vector words as a flag. Do you have any stats or feeling about how sparse the average IDL of any size is? Or how much ranges improve over lists of single IDLs, for that matter? I suppose it matters somewhat in temporary IDLs in memory as well as in files... -- Hallvard