> > KaiGai-san,
> >
> > On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > > As an aside, background of my motivation is the slide below:
> > > http://www.slideshare.net/kaigai/sqlgpussd-english
> > > (LT slides in JPUG conference last Dec)
> > >
> > > I'm under investigation of SSD-to-GPU direct feature on top of
> > > the custom-scan interface. It intends to load a bunch of data
> > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > > It only makes sense if the target blocks are not loaded to the
> > > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > > So, I like to have a reliable way to check the latest status of
> > > the shared buffer, to kwon whether a particular block is already
> > > loaded or not.
> >
> > Quite interesting stuff, thanks for sharing!
> >
> > I'm in no way expert on this but could this generally be attacked from the
> > smgr API perspective? Currently, we have only one implementation - md.c
> > (the hard-coded RelationData.smgr_which = 0). If we extended that and
> > provided end-to-end support so that there would be md.c alternatives to
> > storage operations, I guess that would open up opportunities for
> > extensions to specify smgr_which as an argument to ReadBufferExtended(),
> > provided there is already support in place to install md.c alternatives
> > (perhaps in .so). Of course, these are just musings and, perhaps does not
> > really concern the requirements of custom scan methods you have been
> > developing.
> >
> Thanks for your idea. Indeed, smgr hooks are good candidate to implement
> the feature, however, what I need is a thin intermediation layer rather
> than alternative storage engine.
> It becomes clear we need two features here.
> 1. A feature to check whether a particular block is already on the shared
>    buffer pool.
>    It is available. BufTableLookup() under the BufMappingPartitionLock
>    gives us the information we want.
> 2. A feature to suspend i/o write-out towards a particular blocks
>    that are registered by other concurrent backend, unless it is not
>    unregistered (usually, at the end of P2P DMA).
>    ==> to be discussed.
> When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
> from the 'smgrsw' array, then calls smgr_write.
>   void
>   smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>             char *buffer, bool skipFsync)
>   {
>       (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
>                                                 buffer, skipFsync);
>   }
> If extension would overwrite smgrsw[] array, then call the original
> function under the control by extension, it allows to suspend the call
> of the original smgr_write until completion of P2P DMA.
> It may be a minimum invasive way to implement, and portable to any
> further storage layers.
> How about your thought? Even though it is a bit different from your
> original proposition.
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

  ASSUMPTION: I intend to implement this feature as a part of extension,
      because this i/o suspend/resume checks are pure overhead increment
      for the core features, unless extension which utilizes it.

Three functions shall be added:

extern int    GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:

  FlushBuffer, and others...
   +-- smgrwrite(...)
        +-- (extension's own function)
             +-- mdwrite

Once extension's own function blocks write i/o until P2P DMA completed by
concurrent process, we don't need to care about partial update of OS cache
or storage device.
It is not difficult for extensions to implement a feature to track/untrack
a pair of (relFileNode, forkNum, blockNum), automatic untracking according
to the resource-owner, and a mechanism to block the caller by P2P DMA

On the other hands, its flexibility seems to me a bit larger than necessity
(what I want to implement is just a blocker of buffer write i/o). And, it
may give people wrong impression for the feature of pluggable storage.

How about folk's thought?

NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to