On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <m...@timescale.com> wrote: > > I used the combination of your patch and making the computation of > vacattrstats for a relation available through the API and managed to > implement something that I think does the right thing. (I just sampled a > few different statistics to check if they seem reasonable, like most common > vals and most common freqs.) See attached patch. > > Cool. I went ahead and committed that small new function and will > mark the open item closed. > Thank you Thomas, this will help a lot. > > I need the vacattrstats to set up the two streams for the internal > relations. I can just re-implement them in the same way as is already done, > but this seems like a small change that avoids unnecessary code duplication. > > Unfortunately we're not in a phase where we can make non-essential > changes, we're right about to release and we're only committing fixes, > and it seems like you have a way forward (albeit with some > duplication). We can keep talking about that for v18. > Yes, I can work around this by re-implementing the same code that is present in PostgreSQL. > > From your earlier email: > > I'll take a look at the thread. I really think the ReadStream > abstraction is a good step in the right direction. > > Here's something you or your colleagues might be interested in: I was > looking around for a fun extension to streamify as a demo of the > technology, and I finished up writing a quick patch to streamify > pgvector's HNSW index scan, which worked well enough to share[1] (I > think it should in principle be able to scale with the number of graph > connections, at least 16x), but then people told me that it's of > limited interest because everybody knows that HNSW indexes have to fit > in memory (I think there may also be memory prefetch streaming > opportunities, unexamined for now). But that made me wonder what the > people with the REALLY big indexes do for hyperdimensional graph > search on a scale required to build Skynet, and that led me back to > Timescale pgvectorscale[2]. I see two obvious signs that this thing > is eminently and profitably streamifiable: (1) The stated aim is > optimising for indexes that don't fit in memory, hence "Disk" in the > name of the research project it is inspired by, (2) I see that > DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP > (Windows). So now I am waiting patiently for a Rustacean to show up > with patches for pgvectorscale to use ReadStream, which would already > get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon > hopefully), and hopefully also provide a nice test case for the AIO > patch set which redirects buffer reads through io_uring (Linux, > basically the newer better libaio) or background I/O workers (other > OSes, which works surprisingly competitively). Just BTW for > comparison with DiskANN we have also had early POC-quality patches > that drive AIO with overlapped/IOCP (Windows) which will eventually be > rebased and proposed (Windows isn't really a primary target but we > wanted to validate that the stuff we're working on has abstractions > that will map to the obvious system APIs found in the systems > PostgreSQL targets). For completeness, I've also had it mostly > working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped > support for those last two so that was a bit of a dead end). > [1] > https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com > [2] https://github.com/timescale/pgvectorscale > [3] https://github.com/microsoft/DiskANN > Thanks Thomas, this looks really interesting. I've forwarded it to the pgvectorscale team. -- Best wishes, Mats Kindahl, Timescale