Hi all,

some of the information above was incorrect. This is what happens:
- the source "vec" files are indeed read twice, but for a different reason:
once to calculate the checksum and once to copy the live vectors to the
"vec_temp" file.
- the "vec.tmp" file is then closed for writing and opened for reading with
RANDOM advice. It is first copied as is to the "vec" file in the target
segment. If there's enough memory (2x vector size), the file might be
cached.
- after the sequential read of the "vec_temp" file, the file is read
randomly to build the HNSW graph, and then deleted. If there was enough
cache and we're lucky, maybe the tmp file never hit the disk thanks to
write-behind the OS does.

To avoid the temporary copy, we'd need an ability to open a file for both
reading and writing. We could then copy the live vectors to the target
"vec" file, and before closing it read it randomly to build the graph. The
problem is that we process vector fields one by one, so we can't close the
file earlier. I proposed to add read-write access to `Directory` in another
issue regarding the incorrect use of fsync (
https://github.com/apache/lucene/issues/14334), so we might kill two birds
with one stone.

Another option would be to build the graph by reading the vectors from the
`vec` files in the original segments. This way we'll have some page cache
pollution with the deleted vectors, that's the only drawback I can see. It
would be interesting to know what was the original motivation to do the
temp file; I couldn't find it in old PRs. Randomly reading the original
segments seems to be a slight overhead IMO, I don't think it alone would
justify the temp file, but this is just my guess, I didn't benchmark
anything.

For my use case I actually ended up modifying Lucene code to not write the
temp at all and keep the merged vectors on heap. I made other guarantees to
have enough RAM for this (by setting
TieredMergePolicy.setMaxMergedSegmentMB).

Viliam

On Fri, Jun 27, 2025 at 11:52 PM Viliam Ďurina <viliam.dur...@gmail.com>
wrote:

> I can confirm the temp file isn't renamed, but it's copied a second time.
> I'm on vacation next week.
>
> Dňa pi 27. 6. 2025, 21:24 Michael Sokolov <msoko...@gmail.com> napísal(a):
>
>> Right! Thanks for the pointer. It does seem like there is room for
>> improvement then, maybe Viliam wants to tackle it?
>>
>> On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand <jpou...@gmail.com> wrote:
>> >
>> > Mike, I believe that the answer to your question is in this PR review
>> > comment:
>> https://github.com/apache/lucene/pull/601#discussion_r783711025.
>> >
>> > Merging is currently implemented by looping over fields once, and
>> merging
>> > them. Writing the vec file first would require merging flat vectors for
>> all
>> > fields first, and then doing a second pass over all fields to create
>> their
>> > HNSW graph. This sounds doable, but we never got to it.
>> >
>> >
>> >
>> > On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >
>> > > Without this temp file we would need to load the entire set of vectors
>> > > for the new merged segment into RAM in order to support building an
>> > > HNSW graph from it. This way we can read the vectors off the disk in
>> > > the same way we would do during normal searches.  I'm not sure, but I
>> > > think the temp file simply gets renamed into the new segment and
>> > > doesn't have to be physically copied a second time.  It would be good
>> > > to confirm that.
>> > >
>> > > On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <
>> viliam.dur...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi all,
>> > > >
>> > > > I noticed that during merging in an index that contains vector
>> fields,
>> > > the
>> > > > new segment contains a temporary file with ".vec_temp_N.tmp"
>> extension,
>> > > > which contains all the vectors being merged. This file is used to
>> search
>> > > > for neighbors for the new HNSW graph. It is later deleted, and the
>> > > segment
>> > > > will contain a ".vec" file with the same vectors. So vectors are
>> copied
>> > > two
>> > > > times and more space is temporarily needed on disk.
>> > > >
>> > > > In my index, the ".vec" file is 98% of the index size and the index
>> is
>> > > many
>> > > > GB. Is it really necessary to have the temp file? Couldn't Lucene
>> query
>> > > the
>> > > > "vec" file directly? I checked the code around it, one temp file is
>> > > created
>> > > > per field and the temp file is probably deleted before starting the
>> next
>> > > > field, but still, there is another copy of the vector, so the temp
>> file
>> > > > seems unnecessary.
>> > > >
>> > > > Is there some specific need for the temp file? I might try to do a
>> PR
>> > > > removing the need for it.
>> > > >
>> > > > Viliam
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >
>> > >
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

Reply via email to