Hi Navneet, With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint the memory mapped segment that the IO will be random using madvise POSIX API with MADV_RANDOM flag.
For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not sure. Or maybe it doesn't hint anything? It's up to the OS to then take these hints and do something "interesting" to try to optimize IO and page caching based on these hints. I think modern Linux OSs will readahead (and pre-warm page cache) for MADV_SEQUENTIAL? And maybe skip page cache and readhead for MADV_RANDOM? Not certain... For computing checksum, which is always a sequential operation, if we use MADV_RANDOM (which is stupid), that is indeed expected to perform worse since there is no readahead pre-caching. 50% worse (what you are seeing) is indeed quite an impact ... Maybe open an issue? At least for checksumming we should open even .vec files for sequential read? But, then, if it's the same IndexInput which will then be used "normally" (e.g. for merging), we would want THAT one to be open for random access ... might be tricky to fix. One simple workaround an application can do is to ask MMapDirectory to pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to cache all of those bytes into page cache (if there is enough free RAM). We do this at Amazon (product search) for our production searching processes. Otherwise paging in all .vec/.veq pages via random access provoked through HNSW graph searching is crazy slow... Mike McCandless http://blog.mikemccandless.com On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma <vermanavneet...@gmail.com> wrote: > Hi Lucene Experts, > I wanted to understand the performance difference between opening and > reading the whole file using an IndexInput with IoContext as RANDOM vs > READ. > > I can see .vec files(storing the flat vectors) are opened with RANDOM and > whereas dvd files are opened as READ. As per my testing with files close to > size 5GB storing (~1.6M docs with each doc 3072 bytes), I can see that when > full file checksum validation is happening for a file opened via READ > context it is faster than RANDOM. The amount of time difference I am seeing > is close to 50%. Hence the performance question is coming up, I wanted to > understand is this understanding correct? > > Thanks > Navneet >