Re: Use streaming read API in ANALYZE

Mats Kindahl Thu, 19 Sep 2024 23:37:10 -0700

On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro <thomas.mu...@gmail.com> wrote:


> On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <m...@timescale.com> wrote:
> > I used the combination of your patch and making the computation of
> vacattrstats for a relation available through the API and managed to
> implement something that I think does the right thing. (I just sampled a
> few different statistics to check if they seem reasonable, like most common
> vals and most common freqs.) See attached patch.
>
> Cool.  I went ahead and committed that small new function and will
> mark the open item closed.
>

Thank you Thomas, this will help a lot.


> > I need the vacattrstats to set up the two streams for the internal
> relations. I can just re-implement them in the same way as is already done,
> but this seems like a small change that avoids unnecessary code duplication.
>
> Unfortunately we're not in a phase where we can make non-essential
> changes, we're right about to release and we're only committing fixes,
> and it seems like you have a way forward (albeit with some
> duplication).  We can keep talking about that for v18.
>

Yes, I can work around this by re-implementing the same code that is
present in PostgreSQL.


>
> From your earlier email:
> > I'll take a look at the thread. I really think the ReadStream
> abstraction is a good step in the right direction.
>
> Here's something you or your colleagues might be interested in: I was
> looking around for a fun extension to streamify as a demo of the
> technology, and I finished up writing a quick patch to streamify
> pgvector's HNSW index scan, which worked well enough to share[1] (I
> think it should in principle be able to scale with the number of graph
> connections, at least 16x), but then people told me that it's of
> limited interest because everybody knows that HNSW indexes have to fit
> in memory (I think there may also be memory prefetch streaming
> opportunities, unexamined for now).  But that made me wonder what the
> people with the REALLY big indexes do for hyperdimensional graph
> search on a scale required to build Skynet, and that led me back to
> Timescale pgvectorscale[2].  I see two obvious signs that this thing
> is eminently and profitably streamifiable: (1) The stated aim is
> optimising for indexes that don't fit in memory, hence "Disk" in the
> name of the research project it is inspired by, (2) I see that
> DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP
> (Windows).  So now I am waiting patiently for a Rustacean to show up
> with patches for pgvectorscale to use ReadStream, which would already
> get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon
> hopefully), and hopefully also provide a nice test case for the AIO
> patch set which redirects buffer reads through io_uring (Linux,
> basically the newer better libaio) or background I/O workers (other
> OSes, which works surprisingly competitively).  Just BTW for
> comparison with DiskANN we have also had early POC-quality patches
> that drive AIO with overlapped/IOCP (Windows) which will eventually be
> rebased and proposed (Windows isn't really a primary target but we
> wanted to validate that the stuff we're working on has abstractions
> that will map to the obvious system APIs found in the systems
> PostgreSQL targets).  For completeness, I've also had it mostly
> working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped
> support for those last two so that was a bit of a dead end).




> [1]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com
> [2] https://github.com/timescale/pgvectorscale
> [3] https://github.com/microsoft/DiskANN
>

Thanks Thomas, this looks really interesting. I've forwarded it to the
pgvectorscale team.
-- 
Best wishes,
Mats Kindahl, Timescale

Re: Use streaming read API in ANALYZE

Reply via email to