Hi, I was looking around for an exotic index type to try the experience of streamifying an extension, ie out-of-core code. I am totally new to pgvector, but since everyone keeps talking about it, I could not avoid picking up some basic facts in the pgconf.dev hallway track, and understood that its scans have some degree of known-order access predictability, and then also some degree of fuzzy-predictable order-not-yet-determined access too. It's also quite random in the I/O sense.
Here's a toy to streamify the known-order part. I think for the fuzzy part that links those parts together, maybe there is some way to guess when it's a reasonable time to speculatively prefetch the lowest order stuff in the pairing heap, and then deal with it if you're wrong, but I didn't try that... Someone involved in that project mentioned that it's probably not a great topic to research in practice, because real world users of HNSW use fully cached ie prewarmed indexes, because the performance is so bad otherwise. (Though maybe that argument is a little circular...). So although this patch clearly speeds up cold HSNW searches to a degree controlled by effective_io_concurrency, I'll probably look for something else. Suggestions for interesting index types to look at streamifying are very welcome! Hmm. If that's really true about HNSW though, then there may still be an opportunity to do automatic memory prefetching[1]. But then in the case of index building, "stream" is NULL in this patch anyway. It surely must also be possible to find some good places to put profitable explicit pg_mem_prefetch() calls given the predictability and the need to get only ~60ns ahead for that usage. I didn't look into that because I was trying to prove things about read_stream.c, not get involved in another project :-D Here ends my science experiment report, which I'm dropping here just in case others see useful ideas here. The main thing I learned about the read stream API is that it'd be nice to be able to reset the stream but preserve the distance (something that came up on the streaming sequential scan thread for a different reason), to deal with cases where look-ahead opportunities come in bursts but you want a longer lived stream than I used here. That is the reason the patch creates and destroys temporary streams in a loop; doh. It also provides an interesting case study for what speculative random look-ahead support might need to look like. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKNUMnqubrrz8pRBdEM8vHeSCZcNq7iqERmkt6zPtpA3g%40mail.gmail.com === setup ==== create extension vector; create or replace function random_vector(dimensions int) returns vector language sql begin atomic; select array_agg(random())::vector from generate_series(1, dimensions); end; create table t (id serial, embedding vector(6)); insert into t (embedding) select random_vector(6) from generate_series(1, 1000000); set maintenance_work_mem = '2GB'; create index on t using hnsw(embedding vector_l2_ops); === test of a hot search, assuming repeated === select embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector from t where embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector < 0.2 order by 1 limit 20; === test of a cold search, assuming empty caches === create or replace function test() returns void language plpgsql as $$ declare my_vec vector(6) := random_vector(6); begin perform embedding <-> my_vec from t where embedding <-> my_vec < 0.2 order by 1 limit 20; end; $$; select test(); (Make sure you remember to set effective_io_concurrency to an interesting number if you want to generate a lot of overlapping fadvise calls.)
0001-XXX-toy-use-of-read-stream-API.patch
Description: Binary data