We have a problem where using FieldCache (or using TermEnum/TermDocs directly) in order to pre-cache several fields. It is a bottleneck, because we open a new searchable index snapshot very frequently (every minute). Each time we get a new snapshot of our master index (basically a copy using hard links), we use a background thread to reload a bunch of fields using FieldCache-style loading, then switch incoming queries over to the new searchable index. For current index size (18M docs), it is taking about 90 seconds to load all the fields we need cached. 90 seconds is too long. I understand the issue is because it is doing 2 I/O seeks for each term, even though these are terms which are unique (or almost unique such as timestamp) for each document. I am familiar with using field pay loads as an alternative, but we cannot use them, and we also have an existing index which we cant modify.
One idea I have is to maintain a set of binary on-disk files (one for each field we want to cache) on our master index server and replicate these files with each index snapshot, so that they are directly read into memory, which would be much faster than using FieldCache on each snapshot. These would be binary arrays whose indices need to correspond by internal doc ID to the index snapshots. As new documents arrive, we could just append field values to these files (for unique fields). The big question is when do internal doc IDs change? We do periodically expire documents from the master index (delete), and do unfrequent optimization (monthly). When we schedule an optimization of the master index, we can regenerate all these files. So, I know doc IDs can change then, but the big question is: in normal indexing with no optimizations, will doc IDs change due some low-level segments merging as we add new documents, due to existence of deleted documents?