Re: [gpfsug-discuss] mmfsd write behavior

2017-10-09 Thread Aaron Knister
Thanks, Sven.

I think my goal was for the REQ_FUA flag to be used in alignment with
the consistency expectations of the filesystem. Meaning if I was writing
to a file on a filesystem (e.g. dd if=/dev/zero of=/gpfs/fs0/file1) that
the write requests to the disk addresses containing data on the file
wouldn't be issued with REQ_FUA. However, once the file was closed the
close() wouldn't return until a disk buffer flush had occurred. For more
important operations (e.g. metadata updates, log operations) I would
expect/suspect REQ_FUA would be issued more frequently.

The advantage here is it would allow GPFS to run ontop of block devices
that don't perform well with the present synchronous workload of mmfsd
(e.g. ZFS, and various other software-defined storage or hardware
appliances) but that can perform well when only periodically (e.g. every
few seconds) asked to flush pending data to disk. I also think this
would be *really* important in an FPO environment where individual
drives will probably have caches on by default and I'm not sure direct
I/O is sufficient to force linux to issue scsi synchronize cache
commands to those devices.

I'm guessing that this is far from easy but I figured I'd ask.

-Aaron

On 10/9/17 5:07 PM, Sven Oehme wrote:
> Hi,
> 
> yeah sorry i intended to reply back before my vacation and forgot about
> it the the vacation flushed it all away :-D
> so right now the assumption in Scale/GPFS is that the underlying storage
> doesn't have any form of enabled volatile write cache. the problem seems
> to be that even if we set REQ_FUA some stacks or devices may not have
> implemented that at all or correctly, so even we would set it there is
> no guarantee that it will do what you think it does. the benefit of
> adding the flag at least would allow us to blame everything on the
> underlying stack/device , but i am not sure that will make somebody
> happy if bad things happen, therefore the requirement of a non-volatile
> device will still be required at all times underneath Scale.
> so if you think we should do this, please open a PMR with the details of
> your test so it can go its regular support path. you can mention me in
> the PMR as a reference as we already looked at the places the request
> would have to be added.  
> 
> Sven
> 
> 
> On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister  > wrote:
> 
> Hi Sven,
> 
> Just wondering if you've had any additional thoughts/conversations about
> this.
> 
> -Aaron
> 
> On 9/8/17 5:21 PM, Sven Oehme wrote:
> > Hi,
> >
> > the code assumption is that the underlying device has no volatile
> write
> > cache, i was absolute sure we have that somewhere in the FAQ, but i
> > couldn't find it, so i will talk to somebody to correct this.
> > if i understand
> >
> 
> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt 
> correct
> > one could enforce this by setting REQ_FUA, but thats not
> explicitly set
> > today, at least i can't see it. i will discuss this with one of
> our devs
> > who owns this code and come back.
> >
> > sven
> >
> >
> > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister
> 
> >  >> wrote:
> >
> >     Thanks Sven. I didn't think GPFS itself was caching anything
> on that
> >     layer, but it's my understanding that O_DIRECT isn't
> sufficient to force
> >     I/O to be flushed (e.g. the device itself might have a
> volatile caching
> >     layer). Take someone using ZFS zvol's as NSDs. I can write()
> all day log
> >     to that zvol (even with O_DIRECT) but there is absolutely no
> guarantee
> >     those writes have been committed to stable storage and aren't just
> >     sitting in RAM until an fsync() occurs (or some other bio
> function that
> >     causes a flush). I also don't believe writing to a SATA drive with
> >     O_DIRECT will force cache flushes of the drive's writeback cache..
> >     although I just tested that one and it seems to actually
> trigger a scsi
> >     cache sync. Interesting.
> >
> >     -Aaron
> >
> >     On 9/7/17 10:55 PM, Sven Oehme wrote:
> >      > I am not sure what exactly you are looking for but all
> >     blockdevices are
> >      > opened with O_DIRECT , we never cache anything on this layer .
> >      >
> >      >
> >      > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister
> >     
> >
> >      >  
> >      

Re: [gpfsug-discuss] mmfsd write behavior

2017-10-09 Thread Sven Oehme
Hi,

yeah sorry i intended to reply back before my vacation and forgot about it
the the vacation flushed it all away :-D
so right now the assumption in Scale/GPFS is that the underlying storage
doesn't have any form of enabled volatile write cache. the problem seems to
be that even if we set REQ_FUA some stacks or devices may not have
implemented that at all or correctly, so even we would set it there is no
guarantee that it will do what you think it does. the benefit of adding the
flag at least would allow us to blame everything on the underlying
stack/device , but i am not sure that will make somebody happy if bad
things happen, therefore the requirement of a non-volatile device will
still be required at all times underneath Scale.
so if you think we should do this, please open a PMR with the details of
your test so it can go its regular support path. you can mention me in the
PMR as a reference as we already looked at the places the request would
have to be added.

Sven


On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister 
wrote:

> Hi Sven,
>
> Just wondering if you've had any additional thoughts/conversations about
> this.
>
> -Aaron
>
> On 9/8/17 5:21 PM, Sven Oehme wrote:
> > Hi,
> >
> > the code assumption is that the underlying device has no volatile write
> > cache, i was absolute sure we have that somewhere in the FAQ, but i
> > couldn't find it, so i will talk to somebody to correct this.
> > if i understand
> >
> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
>  correct
> > one could enforce this by setting REQ_FUA, but thats not explicitly set
> > today, at least i can't see it. i will discuss this with one of our devs
> > who owns this code and come back.
> >
> > sven
> >
> >
> > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister  > > wrote:
> >
> > Thanks Sven. I didn't think GPFS itself was caching anything on that
> > layer, but it's my understanding that O_DIRECT isn't sufficient to
> force
> > I/O to be flushed (e.g. the device itself might have a volatile
> caching
> > layer). Take someone using ZFS zvol's as NSDs. I can write() all day
> log
> > to that zvol (even with O_DIRECT) but there is absolutely no
> guarantee
> > those writes have been committed to stable storage and aren't just
> > sitting in RAM until an fsync() occurs (or some other bio function
> that
> > causes a flush). I also don't believe writing to a SATA drive with
> > O_DIRECT will force cache flushes of the drive's writeback cache..
> > although I just tested that one and it seems to actually trigger a
> scsi
> > cache sync. Interesting.
> >
> > -Aaron
> >
> > On 9/7/17 10:55 PM, Sven Oehme wrote:
> >  > I am not sure what exactly you are looking for but all
> > blockdevices are
> >  > opened with O_DIRECT , we never cache anything on this layer .
> >  >
> >  >
> >  > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister
> > 
> >  >  > >> wrote:
> >  >
> >  > Hi Everyone,
> >  >
> >  > This is something that's come up in the past and has recently
> > resurfaced
> >  > with a project I've been working on, and that is-- it seems
> > to me as
> >  > though mmfsd never attempts to flush the cache of the block
> > devices its
> >  > writing to (looking at blktrace output seems to confirm
> > this). Is this
> >  > actually the case? I've looked at the gpl headers for linux
> > and I don't
> >  > see any sign of blkdev_fsync, blkdev_issue_flush,
> WRITE_FLUSH, or
> >  > REQ_FLUSH. I'm sure there's other ways to trigger this
> > behavior that
> >  > GPFS may very well be using that I've missed. That's why I'm
> > asking :)
> >  >
> >  > I figure with FPO being pushed as an HDFS replacement using
> > commodity
> >  > drives this feature has *got* to be in the code somewhere.
> >  >
> >  > -Aaron
> >  >
> >  > --
> >  > Aaron Knister
> >  > NASA Center for Climate Simulation (Code 606.2)
> >  > Goddard Space Flight Center
> >  > (301) 286-2776 
> >  > ___
> >  > gpfsug-discuss mailing list
> >  > gpfsug-discuss at spectrumscale.org
> >  
> >  > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >  >
> >  >
> >  >
> >  > ___
> >  > gpfsug-discuss mailing list
> >  > gpfsug-discuss at spectrumscale.org 
> >  > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >  >
> >
> > --
> > Aaron 

Re: [gpfsug-discuss] mmfsd write behavior

2017-10-09 Thread Aaron Knister

Hi Sven,

Just wondering if you've had any additional thoughts/conversations about 
this.


-Aaron

On 9/8/17 5:21 PM, Sven Oehme wrote:

Hi,

the code assumption is that the underlying device has no volatile write 
cache, i was absolute sure we have that somewhere in the FAQ, but i 
couldn't find it, so i will talk to somebody to correct this.
if i understand 
https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt correct 
one could enforce this by setting REQ_FUA, but thats not explicitly set 
today, at least i can't see it. i will discuss this with one of our devs 
who owns this code and come back.


sven


On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > wrote:


Thanks Sven. I didn't think GPFS itself was caching anything on that
layer, but it's my understanding that O_DIRECT isn't sufficient to force
I/O to be flushed (e.g. the device itself might have a volatile caching
layer). Take someone using ZFS zvol's as NSDs. I can write() all day log
to that zvol (even with O_DIRECT) but there is absolutely no guarantee
those writes have been committed to stable storage and aren't just
sitting in RAM until an fsync() occurs (or some other bio function that
causes a flush). I also don't believe writing to a SATA drive with
O_DIRECT will force cache flushes of the drive's writeback cache..
although I just tested that one and it seems to actually trigger a scsi
cache sync. Interesting.

-Aaron

On 9/7/17 10:55 PM, Sven Oehme wrote:
 > I am not sure what exactly you are looking for but all
blockdevices are
 > opened with O_DIRECT , we never cache anything on this layer .
 >
 >
 > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister

 > >> wrote:
 >
 >     Hi Everyone,
 >
 >     This is something that's come up in the past and has recently
resurfaced
 >     with a project I've been working on, and that is-- it seems
to me as
 >     though mmfsd never attempts to flush the cache of the block
devices its
 >     writing to (looking at blktrace output seems to confirm
this). Is this
 >     actually the case? I've looked at the gpl headers for linux
and I don't
 >     see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or
 >     REQ_FLUSH. I'm sure there's other ways to trigger this
behavior that
 >     GPFS may very well be using that I've missed. That's why I'm
asking :)
 >
 >     I figure with FPO being pushed as an HDFS replacement using
commodity
 >     drives this feature has *got* to be in the code somewhere.
 >
 >     -Aaron
 >
 >     --
 >     Aaron Knister
 >     NASA Center for Climate Simulation (Code 606.2)
 >     Goddard Space Flight Center
 > (301) 286-2776 
 >     ___
 >     gpfsug-discuss mailing list
 >     gpfsug-discuss at spectrumscale.org
 
 > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
 >
 >
 >
 > ___
 > gpfsug-discuss mailing list
 > gpfsug-discuss at spectrumscale.org 
 > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
 >

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org 
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmfsd write behavior

2017-09-07 Thread Aaron Knister
Thanks Sven. I didn't think GPFS itself was caching anything on that 
layer, but it's my understanding that O_DIRECT isn't sufficient to force 
I/O to be flushed (e.g. the device itself might have a volatile caching 
layer). Take someone using ZFS zvol's as NSDs. I can write() all day log 
to that zvol (even with O_DIRECT) but there is absolutely no guarantee 
those writes have been committed to stable storage and aren't just 
sitting in RAM until an fsync() occurs (or some other bio function that 
causes a flush). I also don't believe writing to a SATA drive with 
O_DIRECT will force cache flushes of the drive's writeback cache.. 
although I just tested that one and it seems to actually trigger a scsi 
cache sync. Interesting.


-Aaron

On 9/7/17 10:55 PM, Sven Oehme wrote:
I am not sure what exactly you are looking for but all blockdevices are 
opened with O_DIRECT , we never cache anything on this layer .



On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > wrote:


Hi Everyone,

This is something that's come up in the past and has recently resurfaced
with a project I've been working on, and that is-- it seems to me as
though mmfsd never attempts to flush the cache of the block devices its
writing to (looking at blktrace output seems to confirm this). Is this
actually the case? I've looked at the gpl headers for linux and I don't
see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or
REQ_FLUSH. I'm sure there's other ways to trigger this behavior that
GPFS may very well be using that I've missed. That's why I'm asking :)

I figure with FPO being pushed as an HDFS replacement using commodity
drives this feature has *got* to be in the code somewhere.

-Aaron

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org 
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmfsd write behavior

2017-09-07 Thread Sven Oehme
I am not sure what exactly you are looking for but all blockdevices are
opened with O_DIRECT , we never cache anything on this layer .

On Thu, Sep 7, 2017, 7:11 PM Aaron Knister  wrote:

> Hi Everyone,
>
> This is something that's come up in the past and has recently resurfaced
> with a project I've been working on, and that is-- it seems to me as
> though mmfsd never attempts to flush the cache of the block devices its
> writing to (looking at blktrace output seems to confirm this). Is this
> actually the case? I've looked at the gpl headers for linux and I don't
> see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or
> REQ_FLUSH. I'm sure there's other ways to trigger this behavior that
> GPFS may very well be using that I've missed. That's why I'm asking :)
>
> I figure with FPO being pushed as an HDFS replacement using commodity
> drives this feature has *got* to be in the code somewhere.
>
> -Aaron
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss