Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

Michael McCandless Sun, 29 Sep 2024 08:09:36 -0700

Hi Navneet,

With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint the
memory mapped segment that the IO will be random using madvise POSIX API
with MADV_RANDOM flag.

For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not sure.
Or maybe it doesn't hint anything?

It's up to the OS to then take these hints and do something "interesting"
to try to optimize IO and page caching based on these hints.  I think
modern Linux OSs will readahead (and pre-warm page cache) for
MADV_SEQUENTIAL?  And maybe skip page cache and readhead for MADV_RANDOM?
Not certain...

For computing checksum, which is always a sequential operation, if we use
MADV_RANDOM (which is stupid), that is indeed expected to perform worse
since there is no readahead pre-caching.  50% worse (what you are seeing)
is indeed quite an impact ...

Maybe open an issue?  At least for checksumming we should open even .vec
files for sequential read?  But, then, if it's the same IndexInput which
will then be used "normally" (e.g. for merging), we would want THAT one to
be open for random access ... might be tricky to fix.

One simple workaround an application can do is to ask MMapDirectory to
pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to cache
all of those bytes into page cache (if there is enough free RAM).  We do
this at Amazon (product search) for our production searching processes.
Otherwise paging in all .vec/.veq pages via random access provoked through
HNSW graph searching is crazy slow...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma <vermanavneet...@gmail.com>
wrote:

> Hi Lucene Experts,
> I wanted to understand the performance difference between opening and
> reading the whole file using an IndexInput with IoContext as RANDOM vs
> READ.
>
> I can see .vec files(storing the flat vectors) are opened with RANDOM and
> whereas dvd files are opened as READ. As per my testing with files close to
> size 5GB storing (~1.6M docs with each doc 3072 bytes), I can see that when
> full file checksum validation is happening for a file opened via READ
> context it is faster than RANDOM. The amount of time difference I am seeing
> is close to 50%. Hence the performance question is coming up, I wanted to
> understand is this understanding correct?
>
> Thanks
> Navneet
>

Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

Reply via email to