Hi Uwe and Mike,
Thanks for providing such a quick response. Let me try to ans few things
here:





*In addition, inLucene 9.12 (latest 9.x) version released today there are
some changesto ensure that checksumming is always done with
IOContext.READ_ONCE(which uses READ behind the scenes).*
I didn't find any such change for FlatVectorReaders
<https://github.com/apache/lucene/blob/branch_9_12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java#L68-L77>,
even though I checked the BufferedChecksumInput
<https://github.com/apache/lucene/blob/branch_9_12/lucene/core/src/java/org/apache/lucene/store/BufferedChecksumIndexInput.java#L31-L35>
and CheckedSumInput
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/ChecksumIndexInput.java#L25>,
CodecUtil
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java#L606-L621>
in 9.12 version. Please point me to the right file if I am missing
something here. I can see the same for lucene version 10
<https://github.com/apache/lucene/blob/branch_10_0/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java#L69-L77>
too.

Mike on the question of what is RANDOM vs READ context doing we found this
information related to MADV online.

MADV_RANDOM Expect page references in random order. (Hence, read ahead may
be less useful than normally.)
MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages
in the given range can be aggressively read ahead, and may be freed soon
after they are accessed.)
MADV_WILLNEED Expect access in the near future. (Hence, it might be a good
idea to read some pages ahead.)

This tells me that MADV_RANDOM random for checksum is not good as it will
consume more read cycles given the sequential nature of the checksum.







*One simple workaround an application can do is to ask MMapDirectory
topre-touch all bytes/pages in .vec/.veq files -- this asks the OS to
cacheall of those bytes into page cache (if there is enough free RAM).  We
dothis at Amazon (product search) for our production searching
processes.Otherwise paging in all .vec/.veq pages via random access
provoked throughHNSW graph searching is crazy slow...*
Did you mean the preload functionality offered by MMapDirectory here? I can
try this to see if that helps. But I doubt that in this case.

On opening the issue, I am working through some reproducible benchmarks
before creating a gh issue. If you believe I should create a GH issue first
I can do that. As it might take me sometime to build reproducible
benchmarks.

Thanks
Navneet


On Mon, Sep 30, 2024 at 3:08 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
> Hi,
>
> please also note: In Lucene 10 there checksum IndexInput will always be
> opened with IOContext.READ_ONCE.
>
> If you want to sequentially read a whole index file for other reason
> than checksumming, please pass the correct IOContext. In addition, in
> Lucene 9.12 (latest 9.x) version released today there are some changes
> to ensure that checksumming is always done with IOContext.READ_ONCE
> (which uses READ behind scenes).
>
> Uwe
>
> Am 29.09.2024 um 17:09 schrieb Michael McCandless:
> > Hi Navneet,
> >
> > With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint
the
> > memory mapped segment that the IO will be random using madvise POSIX API
> > with MADV_RANDOM flag.
> >
> > For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not
sure.
> > Or maybe it doesn't hint anything?
> >
> > It's up to the OS to then take these hints and do something
"interesting"
> > to try to optimize IO and page caching based on these hints.  I think
> > modern Linux OSs will readahead (and pre-warm page cache) for
> > MADV_SEQUENTIAL?  And maybe skip page cache and readhead for
MADV_RANDOM?
> > Not certain...
> >
> > For computing checksum, which is always a sequential operation, if we
use
> > MADV_RANDOM (which is stupid), that is indeed expected to perform worse
> > since there is no readahead pre-caching.  50% worse (what you are
seeing)
> > is indeed quite an impact ...
> >
> > Maybe open an issue?  At least for checksumming we should open even .vec
> > files for sequential read?  But, then, if it's the same IndexInput which
> > will then be used "normally" (e.g. for merging), we would want THAT one
to
> > be open for random access ... might be tricky to fix.
> >
> > One simple workaround an application can do is to ask MMapDirectory to
> > pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to
cache
> > all of those bytes into page cache (if there is enough free RAM).  We do
> > this at Amazon (product search) for our production searching processes.
> > Otherwise paging in all .vec/.veq pages via random access provoked
through
> > HNSW graph searching is crazy slow...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma <vermanavneet...@gmail.com
>
> > wrote:
> >
> >> Hi Lucene Experts,
> >> I wanted to understand the performance difference between opening and
> >> reading the whole file using an IndexInput with IoContext as RANDOM vs
> >> READ.
> >>
> >> I can see .vec files(storing the flat vectors) are opened with RANDOM
and
> >> whereas dvd files are opened as READ. As per my testing with files
close to
> >> size 5GB storing (~1.6M docs with each doc 3072 bytes), I can see that
when
> >> full file checksum validation is happening for a file opened via READ
> >> context it is faster than RANDOM. The amount of time difference I am
seeing
> >> is close to 50%. Hence the performance question is coming up, I wanted
to
> >> understand is this understanding correct?
> >>
> >> Thanks
> >> Navneet
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Reply via email to