Re: [LSF/MM TOPIC] The end of the DAX experiment

2019-02-17 Thread Ric Wheeler

On 2/6/19 4:12 PM, Dan Williams wrote:

Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.

An enumeration of remaining projects follows, please expand this list
if I missed something:

* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.



Sounds like a great topic to me. Having just gone through a new round of USENIX 
paper reviews, it is interesting to see how many academic systems are being 
pitched in this space (and most of them don't mention the kernel based xfs/ext4 
with dax).


Regards,

Ric




* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.

* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.


Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.

* Userfaultfd for file-backed mappings and DAX


Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations





Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Ric Wheeler

On 03/16/2016 06:23 PM, Chris Mason wrote:

On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:

On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:

On 3/15/16 7:06 PM, Linus Torvalds wrote:

On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner  wrote:

It is pretty clear that the onus is on the patch submitter to
provide justification for inclusion, not for the reviewer/Maintainer
to have to prove that the solution is unworkable.

I agree, but quite frankly, performance is a good justification.

So if Ted can give performance numbers, that's justification enough.
We've certainly taken changes with less.

I've been away from ext4 for a while, so I'm really not on top of the
mechanics of the underlying problem at the moment.

But I would say that in addition to numbers showing that ext4 has trouble
with unwritten extent conversion, we should have an explanation of
why it can't be solved in a way that doesn't open up these concerns.

XFS certainly has different mechanisms, but is the demonstrated workload
problematic on XFS (or btrfs) as well?  If not, can ext4 adopt any of the
solutions that make the workload perform better on other filesystems?

When I've benchmarked this in the past, doing small random buffered writes
into an preallocated extent was dramatically (3x or more) slower on xfs
than doing them into a fully written extent.  That was two years ago,
but I can redo it.

So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
kernel).  This is O_DIRECT without O_SYNC.  I don't think xfs will do
commits for each IO into the prealloc file?  O_SYNC makes it much
slower, so hopefully I've got this right.

The test runs for 60 seconds, and I used an iodepth of 4:

prealloc file: 32,000 iops
overwrite:121,000 iops

If I bump the iodepth up to 512:

prealloc file: 33,000 iops
overwrite:   279,000 iops

For streaming writes, XFS converts prealloc to written much better when
the IO isn't random.  You can start seeing the difference at 16K
sequential O_DIRECT writes, but really its not a huge impact.  The worst
case is 4K:

prealloc file: 227MB/s
overwrite: 340MB/s

I can't think of sequential workloads where this will matter, since they
will either end up with bigger IO or the performance impact won't get
noticed.

-chris


I think that these numbers are the interesting ones, see a 4x slow down is 
certainly significant.


If you do the same patch after hacking XFS preallocation as Dave suggested with 
xfs_db, do we get most of the performance back?


Ric






Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Ric Wheeler

On 03/16/2016 06:23 PM, Chris Mason wrote:

On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:

On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:

On 3/15/16 7:06 PM, Linus Torvalds wrote:

On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner  wrote:

It is pretty clear that the onus is on the patch submitter to
provide justification for inclusion, not for the reviewer/Maintainer
to have to prove that the solution is unworkable.

I agree, but quite frankly, performance is a good justification.

So if Ted can give performance numbers, that's justification enough.
We've certainly taken changes with less.

I've been away from ext4 for a while, so I'm really not on top of the
mechanics of the underlying problem at the moment.

But I would say that in addition to numbers showing that ext4 has trouble
with unwritten extent conversion, we should have an explanation of
why it can't be solved in a way that doesn't open up these concerns.

XFS certainly has different mechanisms, but is the demonstrated workload
problematic on XFS (or btrfs) as well?  If not, can ext4 adopt any of the
solutions that make the workload perform better on other filesystems?

When I've benchmarked this in the past, doing small random buffered writes
into an preallocated extent was dramatically (3x or more) slower on xfs
than doing them into a fully written extent.  That was two years ago,
but I can redo it.

So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
kernel).  This is O_DIRECT without O_SYNC.  I don't think xfs will do
commits for each IO into the prealloc file?  O_SYNC makes it much
slower, so hopefully I've got this right.

The test runs for 60 seconds, and I used an iodepth of 4:

prealloc file: 32,000 iops
overwrite:121,000 iops

If I bump the iodepth up to 512:

prealloc file: 33,000 iops
overwrite:   279,000 iops

For streaming writes, XFS converts prealloc to written much better when
the IO isn't random.  You can start seeing the difference at 16K
sequential O_DIRECT writes, but really its not a huge impact.  The worst
case is 4K:

prealloc file: 227MB/s
overwrite: 340MB/s

I can't think of sequential workloads where this will matter, since they
will either end up with bigger IO or the performance impact won't get
noticed.

-chris


I think that these numbers are the interesting ones, see a 4x slow down is 
certainly significant.


If you do the same patch after hacking XFS preallocation as Dave suggested with 
xfs_db, do we get most of the performance back?


Ric






Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Ric Wheeler

On 03/17/2016 01:47 PM, Linus Torvalds wrote:

On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum  wrote:

So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream.

Hmm.

So to me it really sounds like somebody should cook up a patch, but we
shouldn't put it in the upstream kernel until we get numbers and
actual "yes, we'd use this" from outside of google.

I say "outside of google", because inside of google not only do we not
get numbers, but google can maintain their own patch.

But maybe Ted could at least post the patch google uses, and somebody
in the Ceph community might want to at least try it out...


What *is* a big 
deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.

Hmm. I don't hate that patch, because the NOATIME thing really does
wonders on many loads. NOMTIME makes sense.

It's not like you can't do this with utimes() anyway.

That said, I do wonder if people wouldn't just prefer to expand on and
improve on the lazytime.

Is there some reason you guys didn't use that?


As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can.

That argues against worrying about this all in the kernel unless there
are other users.

   Linus


Just a note, when Greg says "user space solution", Ceph is looking at writing 
directly to raw block devices which is kind of a through back to early 
enterprise database trends.


Ric



Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-18 Thread Ric Wheeler

On 03/17/2016 01:47 PM, Linus Torvalds wrote:

On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum  wrote:

So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream.

Hmm.

So to me it really sounds like somebody should cook up a patch, but we
shouldn't put it in the upstream kernel until we get numbers and
actual "yes, we'd use this" from outside of google.

I say "outside of google", because inside of google not only do we not
get numbers, but google can maintain their own patch.

But maybe Ted could at least post the patch google uses, and somebody
in the Ceph community might want to at least try it out...


What *is* a big 
deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.

Hmm. I don't hate that patch, because the NOATIME thing really does
wonders on many loads. NOMTIME makes sense.

It's not like you can't do this with utimes() anyway.

That said, I do wonder if people wouldn't just prefer to expand on and
improve on the lazytime.

Is there some reason you guys didn't use that?


As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can.

That argues against worrying about this all in the kernel unless there
are other users.

   Linus


Just a note, when Greg says "user space solution", Ceph is looking at writing 
directly to raw block devices which is kind of a through back to early 
enterprise database trends.


Ric



Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-14 Thread Ric Wheeler

On 03/13/2016 07:30 PM, Dave Chinner wrote:

On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:

On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o  wrote:

At the end of the day it's about whether you trust the userspace
program or not.

There's a big difference between "give the user rope", and "tie the
rope in a noose and put a banana peel so that the user might stumble
into the rope and hang himself", though.

So I do think that Dave is right that we should also strive to make
sure that our interfaces are not just secure in theory, but that they
are also good interfaces to make mistakes less likely.

At which point I have to ask: how do we safely allow filesystems to
expose stale data in files? There's a big "we need to trust
userspace" component in ever proposal that has been made so far -
that's the part I have extreme trouble with.

For example, what happens when a backup process running as root a
file that has exposed stale data? Yes, we could set the "NODUMP"
flag on the inode to tell backup programs to skip backing up such
files, but we're now trusting some random userspace application
(e.g. tar, rsync, etc) not to do something we don't want it to do
with the data in that file.

AFAICT, we can't stop root from copying files that have exposed
stale data or changing their ownership without some kind of special
handling of "contains stale data" files within the kernel. At this
point we are back to needing persistent tracking of the "exposed
stale data" state in the inode as the only safe way to allow us to
expose stale data.  That's fairly ironic given that the stated
purpose of exposing stale data through fallocate is to avoid the
overhead of the existing mechanisms we use to track extents
containing stale data


I think that once we enter this mode, the local file system has effectively 
ceded its role to prevent stale data exposure to the upper layer. In effect, 
this ceases to become a normal file system for any enabled process if we control 
this through fallocate() or for all processes if we do the brute force mount 
option that would be file system wide.


That means we would not need to track this. Extents would be marked as if they 
always have had valid data (no more allocated but unwritten state).


In the end, that is the actual goal - move this enforcement up a layer for 
overlay/user space file systems that are then responsible for policing this ind 
of thing.


Regards,

Ric




I think we _should_ give users rope, but maybe we should also make
sure that there isn't some hidden rapidly spinning saw-blade right
next to the rope that the user doesn't even think about.

IMO we already have a good, safe interface that provides the rope
without the saw blades. I'm happy to be proven wrong, but IMO I
don't see that we can provide stale data exposure in a safe,
non-saw-bladey way without any kernel/filesystem side overhead.

Cheers,

Dave.




Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-14 Thread Ric Wheeler

On 03/13/2016 07:30 PM, Dave Chinner wrote:

On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:

On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o  wrote:

At the end of the day it's about whether you trust the userspace
program or not.

There's a big difference between "give the user rope", and "tie the
rope in a noose and put a banana peel so that the user might stumble
into the rope and hang himself", though.

So I do think that Dave is right that we should also strive to make
sure that our interfaces are not just secure in theory, but that they
are also good interfaces to make mistakes less likely.

At which point I have to ask: how do we safely allow filesystems to
expose stale data in files? There's a big "we need to trust
userspace" component in ever proposal that has been made so far -
that's the part I have extreme trouble with.

For example, what happens when a backup process running as root a
file that has exposed stale data? Yes, we could set the "NODUMP"
flag on the inode to tell backup programs to skip backing up such
files, but we're now trusting some random userspace application
(e.g. tar, rsync, etc) not to do something we don't want it to do
with the data in that file.

AFAICT, we can't stop root from copying files that have exposed
stale data or changing their ownership without some kind of special
handling of "contains stale data" files within the kernel. At this
point we are back to needing persistent tracking of the "exposed
stale data" state in the inode as the only safe way to allow us to
expose stale data.  That's fairly ironic given that the stated
purpose of exposing stale data through fallocate is to avoid the
overhead of the existing mechanisms we use to track extents
containing stale data


I think that once we enter this mode, the local file system has effectively 
ceded its role to prevent stale data exposure to the upper layer. In effect, 
this ceases to become a normal file system for any enabled process if we control 
this through fallocate() or for all processes if we do the brute force mount 
option that would be file system wide.


That means we would not need to track this. Extents would be marked as if they 
always have had valid data (no more allocated but unwritten state).


In the end, that is the actual goal - move this enforcement up a layer for 
overlay/user space file systems that are then responsible for policing this ind 
of thing.


Regards,

Ric




I think we _should_ give users rope, but maybe we should also make
sure that there isn't some hidden rapidly spinning saw-blade right
next to the rope that the user doesn't even think about.

IMO we already have a good, safe interface that provides the rope
without the saw blades. I'm happy to be proven wrong, but IMO I
don't see that we can provide stale data exposure in a safe,
non-saw-bladey way without any kernel/filesystem side overhead.

Cheers,

Dave.




Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-10 Thread Ric Wheeler

On 03/11/2016 12:03 AM, Linus Torvalds wrote:

On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler <ricwhee...@gmail.com> wrote:

What was objectionable at the time this patch was raised years back (not
just to me, but to pretty much every fs developer at LSF/MM that year)
centered on the concern that this would be viewed as a "performance" mode
and we get pressure to support this for non-priveleged users. It gives any
user effectively the ability to read the block device content for previously
allocated data without restriction.

The sane way to do it would be to just check permissions of the
underlying block device.

That way, people can just set the permissions for that to whatever
they want. If google right now uses some magical group for this, they
could make the underlying block device be writable for that group.

We can do the security check at the filesystem level, because we have
sb->s_bdev->bd_inode, and if you have read and write permissions to
that inode, you might as well have permission to create a unsafe hole.

That doesn't sound very hacky to me.

Linus


I agree that this sounds quite reasonable.

Ric



Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-10 Thread Ric Wheeler

On 03/11/2016 12:03 AM, Linus Torvalds wrote:

On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler  wrote:

What was objectionable at the time this patch was raised years back (not
just to me, but to pretty much every fs developer at LSF/MM that year)
centered on the concern that this would be viewed as a "performance" mode
and we get pressure to support this for non-priveleged users. It gives any
user effectively the ability to read the block device content for previously
allocated data without restriction.

The sane way to do it would be to just check permissions of the
underlying block device.

That way, people can just set the permissions for that to whatever
they want. If google right now uses some magical group for this, they
could make the underlying block device be writable for that group.

We can do the security check at the filesystem level, because we have
sb->s_bdev->bd_inode, and if you have read and write permissions to
that inode, you might as well have permission to create a unsafe hole.

That doesn't sound very hacky to me.

Linus


I agree that this sounds quite reasonable.

Ric



Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-10 Thread Ric Wheeler

On 03/10/2016 04:38 AM, Theodore Ts'o wrote:

On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote:

I really am sensitive to the security concerns, just know that if it's
a permanent blocker you're essentially blocking out a growing category
of disk users (who run on an awfully large number of disks!).

Or they just have to use kernels with out-of-tree patches installed.  :-P

If you want to consider how many disks Google has that are using this
patch, I probably could have appealed to Linus and asked him to accept
the patch if I forced the issue.  The only reason why I didn't was
that people like Ric Wheeler threatened to have distro-specific
patches to disable the feature, and at the end of the day, I didn't
care that much.  After all, if it makes it harder for large scale
cloud companies besides Google to create more efficient userspace
cluster file systems, it's not like I was keeping the patch a secret.

So ultimately, if the Ceph developers want to make a case to Red Hat
management that this is important, great.  If not, it's not that hard
for those people who need the patch and who are running large cloud
infrastructures to simply apply the out-of-tree patch if they need it.

Cheers,

  - Ted



What was objectionable at the time this patch was raised years back (not just to 
me, but to pretty much every fs developer at LSF/MM that year) centered on the 
concern that this would be viewed as a "performance" mode and we get pressure to 
support this for non-priveleged users. It gives any user effectively the ability 
to read the block device content for previously allocated data without restriction.


At the time, I also don't recall seeing the patch posted on upstream lists for 
debate or justification.


As we discussed a few weeks back, I don't object to having support for doing 
this in carefully controlled ways for things like user space file systems. In 
effect, the problem of preventing other people's data being handed over to the 
end user is taken on by that layer of code. I suspect that fits the use case at 
google and Ceph both.


Regards,

Ric





Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-10 Thread Ric Wheeler

On 03/10/2016 04:38 AM, Theodore Ts'o wrote:

On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote:

I really am sensitive to the security concerns, just know that if it's
a permanent blocker you're essentially blocking out a growing category
of disk users (who run on an awfully large number of disks!).

Or they just have to use kernels with out-of-tree patches installed.  :-P

If you want to consider how many disks Google has that are using this
patch, I probably could have appealed to Linus and asked him to accept
the patch if I forced the issue.  The only reason why I didn't was
that people like Ric Wheeler threatened to have distro-specific
patches to disable the feature, and at the end of the day, I didn't
care that much.  After all, if it makes it harder for large scale
cloud companies besides Google to create more efficient userspace
cluster file systems, it's not like I was keeping the patch a secret.

So ultimately, if the Ceph developers want to make a case to Red Hat
management that this is important, great.  If not, it's not that hard
for those people who need the patch and who are running large cloud
infrastructures to simply apply the out-of-tree patch if they need it.

Cheers,

  - Ted



What was objectionable at the time this patch was raised years back (not just to 
me, but to pretty much every fs developer at LSF/MM that year) centered on the 
concern that this would be viewed as a "performance" mode and we get pressure to 
support this for non-priveleged users. It gives any user effectively the ability 
to read the block device content for previously allocated data without restriction.


At the time, I also don't recall seeing the patch posted on upstream lists for 
debate or justification.


As we discussed a few weeks back, I don't object to having support for doing 
this in carefully controlled ways for things like user space file systems. In 
effect, the problem of preventing other people's data being handed over to the 
end user is taken on by that layer of code. I suspect that fits the use case at 
google and Ceph both.


Regards,

Ric





Re: Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-10 Thread Ric Wheeler

I would like to nominate Sage Weil with his consent.

Sage has lead the ceph project since its inception, contributed to the kernel as 
well as had an influence on projects like openstack.


thanks!

Ric



On 10/06/2015 01:06 PM, Grant Likely wrote:

[Resending because I messed up the first one]

The elections for five of the ten members of the Linux Foundation
Technical Advisory Board (TAB) are held every year[1]. This year the
election will be at the 2015 Kernel Summit in Seoul, South Korea
(probably on the Monday, 26 October) and will be open to all attendees
of both Kernel Summit and Korea Linux Forum.

Anyone is eligible to stand for election, simply send your nomination to:

tech-board-disc...@lists.linux-foundation.org

We currently have 3 nominees for five places:
Thomas Gleixner
Greg Kroah-Hartman
Stephen Hemminger

The deadline for receiving nominations is up until the beginning of
the event where the election is held. Although, please remember if
you're not going to be present that things go wrong with both networks
and mailing lists, so get your nomination in early).

Grant Likely, TAB Chair

[1] TAB members sit for a term of 2 years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are half way through their term and will be up for
election next year. The history of the TAB elections can be found
here:

https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_YQoTNCKA/edit#gid=0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-10 Thread Ric Wheeler

I would like to nominate Sage Weil with his consent.

Sage has lead the ceph project since its inception, contributed to the kernel as 
well as had an influence on projects like openstack.


thanks!

Ric



On 10/06/2015 01:06 PM, Grant Likely wrote:

[Resending because I messed up the first one]

The elections for five of the ten members of the Linux Foundation
Technical Advisory Board (TAB) are held every year[1]. This year the
election will be at the 2015 Kernel Summit in Seoul, South Korea
(probably on the Monday, 26 October) and will be open to all attendees
of both Kernel Summit and Korea Linux Forum.

Anyone is eligible to stand for election, simply send your nomination to:

tech-board-disc...@lists.linux-foundation.org

We currently have 3 nominees for five places:
Thomas Gleixner
Greg Kroah-Hartman
Stephen Hemminger

The deadline for receiving nominations is up until the beginning of
the event where the election is held. Although, please remember if
you're not going to be present that things go wrong with both networks
and mailing lists, so get your nomination in early).

Grant Likely, TAB Chair

[1] TAB members sit for a term of 2 years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are half way through their term and will be up for
election next year. The history of the TAB elections can be found
here:

https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_YQoTNCKA/edit#gid=0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler
On 01/22/2014 01:37 PM, Chris Mason wrote:
> Circling back to what we might talk about at the conference, Ric do you
> have any ideas on when these drives might hit the wild?
>
> -chris

I will poke at vendors to see if we can get someone to make a public statement,
but I cannot do that for them.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 01:35 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote:

On 01/22/2014 01:13 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:

On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:

[ I like big sectors and I cannot lie ]

I think I might be sceptical, but I don't think that's showing in my
concerns ...


I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.

Do we even need to do that (eliminate buffer heads)?  We cope with 4k
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.


We're likely to have people mixing 4K drives and  on the same box.  We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.

If the page cache had a variable granularity per device, that would cope
with this.  It's the variable granularity that's the VM problem.


  From a pure code point of view, it may be less work to change it once in
the VM.  But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.

Agreed, but only if we don't do RMW in the buffer cache ... which may be
a good reason to keep it.


The other question is if the drive does RMW between 4k and whatever its
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?

The real benefit is when and how the reads get scheduled.  We're able to
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me "no" to this do I think we need to worry about changing page
cache granularity.

Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler.  Particularly if the FS helps us with layout.  My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.

James


I think that the key to having the file system work with larger
sectors is to
create them properly aligned and use the actual, native sector size as
their FS
block size. Which is pretty much back the original challenge.

Only if you think laying out stuff requires block size changes.  If a 4k
block filesystem's allocation algorithm tried to allocate on a 16k
boundary for instance, that gets us a lot of the performance without
needing a lot of alteration.


The key here is that we cannot assume that writes happen only during 
allocation/append mode.


Unless the block size enforces it, we will have non-aligned, small block IO done 
to allocated regions that won't get coalesced.


It's not even obvious that an ignorant 4k layout is going to be so
bad ... the RMW occurs only at the ends of the transfers, not in the
middle.  If we say 16k physical block and average 128k transfers,
probabalistically we misalign on 6 out of 31 sectors (or 19% of the
time).  We can make that better by increasing the transfer size (it
comes down to 10% for 256k transfers.


This really depends on the nature of the device. Some devices could produce very 
erratic performance or even (not today, but some day) reject the IO.





Teaching each and every file system to be aligned at the storage
granularity/minimum IO size when that is larger than the physical
sector size is
harder I think.

But you're making assumptions about needing larger block sizes.  I'm
asking what can we do with what we currently have?  Increasing the
transfer size is a way of mitigating the problem with no FS support
whatever.  Adding alignment to the FS layout algorithm is another.  When
you've done both of those, I think you're already at the 99% aligned
case, which is &

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 01:13 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:

On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:

[ I like big sectors and I cannot lie ]

I think I might be sceptical, but I don't think that's showing in my
concerns ...


I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.

Do we even need to do that (eliminate buffer heads)?  We cope with 4k
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.


We're likely to have people mixing 4K drives and  on the same box.  We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.

If the page cache had a variable granularity per device, that would cope
with this.  It's the variable granularity that's the VM problem.


 From a pure code point of view, it may be less work to change it once in
the VM.  But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.

Agreed, but only if we don't do RMW in the buffer cache ... which may be
a good reason to keep it.


The other question is if the drive does RMW between 4k and whatever its
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?

The real benefit is when and how the reads get scheduled.  We're able to
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me "no" to this do I think we need to worry about changing page
cache granularity.

Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler.  Particularly if the FS helps us with layout.  My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.

James



I think that the key to having the file system work with larger sectors is to 
create them properly aligned and use the actual, native sector size as their FS 
block size. Which is pretty much back the original challenge.


Teaching each and every file system to be aligned at the storage 
granularity/minimum IO size when that is larger than the physical sector size is 
harder I think.


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 11:03 AM, James Bottomley wrote:

On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.


My memory is that Nick's work just didn't have the momentum to get
pushed in.  It all seemed very reasonable though, I think our hatred of
buffered heads just wasn't yet bigger than the fear of moving away.

But, the bigger question is how big are the blocks going to be?  At some
point (64K?) we might as well just make a log structured dm target and
have a single setup for both shingled and large sector drives.

There is no real point.  Even with 4k drives today using 4k sectors in
the filesystem, we still get 512 byte writes because of journalling and
the buffer cache.


I think that you are wrong here James. Even with 512 byte drives, the IO's we 
send down tend to be 4k or larger. Do you have traces that show this and details?




The question is what would we need to do to support these devices and
the answer is "try to send IO in x byte multiples x byte aligned" this
really becomes an ioscheduler problem, not a supporting large page
problem.

James



Not that simple.

The requirement of some of these devices are that you *never* send down a 
partial write or an unaligned write.


Also keep in mind that larger block sizes allow us to track larger files with 
smaller amounts of metadata which is a second win.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 09:34 AM, Mel Gorman wrote:

On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:

On 01/22/2014 04:34 AM, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.


I have a somewhat hazy memory of Andrew warning us that touching
this code takes us into dark and scary places.


That is a light summary. As Andrew tends to reject patches with poor
documentation in case we forget the details in 6 months, I'm going to guess
that he does not remember the details of a discussion from 7ish years ago.
This is where Andrew swoops in with a dazzling display of his eidetic
memory just to prove me wrong.

Ric, are there any storage vendor that is pushing for this right now?
Is someone working on this right now or planning to? If they are, have they
looked into the history of fsblock (Nick) and large block support (Christoph)
to see if they are candidates for forward porting or reimplementation?
I ask because without that person there is a risk that the discussion
will go as follows

Topic leader: Does anyone have an objection to supporting larger block
sizes than the page size?
Room: Send patches and we'll talk.



I will have to see if I can get a storage vendor to make a public statement, but 
there are vendors hoping to see this land in Linux in the next few years. I 
assume that anyone with a shipping device will have to at least emulate the 4KB 
sector size for years to come, but that there might be a significant performance 
win for platforms that can do a larger block.


Note that windows seems to suffer from the exact same limitation, so we are not 
alone here with the vm page size/fs block size entanglement


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 04:34 AM, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.



I have a somewhat hazy memory of Andrew warning us that touching this code takes 
us into dark and scary places.


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 04:34 AM, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.



I have a somewhat hazy memory of Andrew warning us that touching this code takes 
us into dark and scary places.


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 09:34 AM, Mel Gorman wrote:

On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:

On 01/22/2014 04:34 AM, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.


I have a somewhat hazy memory of Andrew warning us that touching
this code takes us into dark and scary places.


That is a light summary. As Andrew tends to reject patches with poor
documentation in case we forget the details in 6 months, I'm going to guess
that he does not remember the details of a discussion from 7ish years ago.
This is where Andrew swoops in with a dazzling display of his eidetic
memory just to prove me wrong.

Ric, are there any storage vendor that is pushing for this right now?
Is someone working on this right now or planning to? If they are, have they
looked into the history of fsblock (Nick) and large block support (Christoph)
to see if they are candidates for forward porting or reimplementation?
I ask because without that person there is a risk that the discussion
will go as follows

Topic leader: Does anyone have an objection to supporting larger block
sizes than the page size?
Room: Send patches and we'll talk.



I will have to see if I can get a storage vendor to make a public statement, but 
there are vendors hoping to see this land in Linux in the next few years. I 
assume that anyone with a shipping device will have to at least emulate the 4KB 
sector size for years to come, but that there might be a significant performance 
win for platforms that can do a larger block.


Note that windows seems to suffer from the exact same limitation, so we are not 
alone here with the vm page size/fs block size entanglement


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 11:03 AM, James Bottomley wrote:

On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote:

On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:

One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.


Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. I do
believe it motivated an alternative by Nick Piggin called fsblock though
(http://lwn.net/Articles/321390/). At the very least it would be nice to
know why neither were never merged for those of us that were not around
at the time and who may not have the chance to dive through mailing list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is requiring
high-order allocations to succeed for the system to behave correctly.


My memory is that Nick's work just didn't have the momentum to get
pushed in.  It all seemed very reasonable though, I think our hatred of
buffered heads just wasn't yet bigger than the fear of moving away.

But, the bigger question is how big are the blocks going to be?  At some
point (64K?) we might as well just make a log structured dm target and
have a single setup for both shingled and large sector drives.

There is no real point.  Even with 4k drives today using 4k sectors in
the filesystem, we still get 512 byte writes because of journalling and
the buffer cache.


I think that you are wrong here James. Even with 512 byte drives, the IO's we 
send down tend to be 4k or larger. Do you have traces that show this and details?




The question is what would we need to do to support these devices and
the answer is try to send IO in x byte multiples x byte aligned this
really becomes an ioscheduler problem, not a supporting large page
problem.

James



Not that simple.

The requirement of some of these devices are that you *never* send down a 
partial write or an unaligned write.


Also keep in mind that larger block sizes allow us to track larger files with 
smaller amounts of metadata which is a second win.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 01:13 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:

On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:

[ I like big sectors and I cannot lie ]

I think I might be sceptical, but I don't think that's showing in my
concerns ...


I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.

Do we even need to do that (eliminate buffer heads)?  We cope with 4k
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.


We're likely to have people mixing 4K drives and fill in some other
size here on the same box.  We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.

If the page cache had a variable granularity per device, that would cope
with this.  It's the variable granularity that's the VM problem.


 From a pure code point of view, it may be less work to change it once in
the VM.  But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.

Agreed, but only if we don't do RMW in the buffer cache ... which may be
a good reason to keep it.


The other question is if the drive does RMW between 4k and whatever its
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?

The real benefit is when and how the reads get scheduled.  We're able to
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me no to this do I think we need to worry about changing page
cache granularity.

Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler.  Particularly if the FS helps us with layout.  My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.

James



I think that the key to having the file system work with larger sectors is to 
create them properly aligned and use the actual, native sector size as their FS 
block size. Which is pretty much back the original challenge.


Teaching each and every file system to be aligned at the storage 
granularity/minimum IO size when that is larger than the physical sector size is 
harder I think.


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler

On 01/22/2014 01:35 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote:

On 01/22/2014 01:13 PM, James Bottomley wrote:

On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:

On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:

On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:

[ I like big sectors and I cannot lie ]

I think I might be sceptical, but I don't think that's showing in my
concerns ...


I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.

Do we even need to do that (eliminate buffer heads)?  We cope with 4k
sector only devices just fine today because the bh mechanisms now
operate on top of the page cache and can do the RMW necessary to update
a bh in the page cache itself which allows us to do only 4k chunked
writes, so we could keep the bh system and just alter the granularity of
the page cache.


We're likely to have people mixing 4K drives and fill in some other
size here on the same box.  We could just go with the biggest size and
use the existing bh code for the sub-pagesized blocks, but I really
hesitate to change VM fundamentals for this.

If the page cache had a variable granularity per device, that would cope
with this.  It's the variable granularity that's the VM problem.


  From a pure code point of view, it may be less work to change it once in
the VM.  But from an overall system impact point of view, it's a big
change in how the system behaves just for filesystem metadata.

Agreed, but only if we don't do RMW in the buffer cache ... which may be
a good reason to keep it.


The other question is if the drive does RMW between 4k and whatever its
physical sector size, do we need to do anything to take advantage of
it ... as in what would altering the granularity of the page cache buy
us?

The real benefit is when and how the reads get scheduled.  We're able to
do a much better job pipelining the reads, controlling our caches and
reducing write latency by having the reads done up in the OS instead of
the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me no to this do I think we need to worry about changing page
cache granularity.

Realistically, if you look at what the I/O schedulers output on a
standard (spinning rust) workload, it's mostly large transfers.
Obviously these are misalgned at the ends, but we can fix some of that
in the scheduler.  Particularly if the FS helps us with layout.  My
instinct tells me that we can fix 99% of this with layout on the FS + io
schedulers ... the remaining 1% goes to the drive as needing to do RMW
in the device, but the net impact to our throughput shouldn't be that
great.

James


I think that the key to having the file system work with larger
sectors is to
create them properly aligned and use the actual, native sector size as
their FS
block size. Which is pretty much back the original challenge.

Only if you think laying out stuff requires block size changes.  If a 4k
block filesystem's allocation algorithm tried to allocate on a 16k
boundary for instance, that gets us a lot of the performance without
needing a lot of alteration.


The key here is that we cannot assume that writes happen only during 
allocation/append mode.


Unless the block size enforces it, we will have non-aligned, small block IO done 
to allocated regions that won't get coalesced.


It's not even obvious that an ignorant 4k layout is going to be so
bad ... the RMW occurs only at the ends of the transfers, not in the
middle.  If we say 16k physical block and average 128k transfers,
probabalistically we misalign on 6 out of 31 sectors (or 19% of the
time).  We can make that better by increasing the transfer size (it
comes down to 10% for 256k transfers.


This really depends on the nature of the device. Some devices could produce very 
erratic performance or even (not today, but some day) reject the IO.





Teaching each and every file system to be aligned at the storage
granularity/minimum IO size when that is larger than the physical
sector size is
harder I think.

But you're making assumptions about needing larger block sizes.  I'm
asking what can we do with what we currently have?  Increasing the
transfer size is a way of mitigating the problem with no FS support
whatever.  Adding alignment to the FS layout algorithm is another.  When
you've done both of those, I think you're already at the 99% aligned
case

Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-22 Thread Ric Wheeler
On 01/22/2014 01:37 PM, Chris Mason wrote:
 Circling back to what we might talk about at the conference, Ric do you
 have any ideas on when these drives might hit the wild?

 -chris

I will poke at vendors to see if we can get someone to make a public statement,
but I cannot do that for them.

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-21 Thread Ric Wheeler
One topic that has been lurking forever at the edges is the current 4k 
limitation for file system block sizes. Some devices in production today and 
others coming soon have larger sectors and it would be interesting to see if it 
is time to poke at this topic again.


LSF/MM seems to be pretty much the only event of the year that most of the key 
people will be present, so should be a great topic for a joint session.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-21 Thread Ric Wheeler
One topic that has been lurking forever at the edges is the current 4k 
limitation for file system block sizes. Some devices in production today and 
others coming soon have larger sectors and it would be interesting to see if it 
is time to poke at this topic again.


LSF/MM seems to be pretty much the only event of the year that most of the key 
people will be present, so should be a great topic for a joint session.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: status of block-integrity

2014-01-07 Thread Ric Wheeler

On 12/23/2013 09:35 PM, Martin K. Petersen wrote:

"Christoph" == Christoph Hellwig  writes:

Christoph> We have the block integrity code to support DIF/DIX in the
Christoph> the tree for about 5 and a half years, and we still don't
Christoph> have a single consumer of it.

What do you mean? If you have a DIX-capable HBA (lpfc, qla2xxx, zfcp)
then integrity protection is active from the block layer down. The only
code that's not currently being exercised are the tag interleaving
functions.  I was hoping the FS people would use them for back pointers
but nobody seemed to bite.


Christoph> Given that we'll have a lot of work to do in this area with
Christoph> block multiqueue I think it's time to either kill it off for
Christoph> good or make sure we can actually use and test it.

I don't understand why multiqueue would require a lot of work? It's just
an extra scatterlist per request.

And obviously, if there's anything that needs to be done in this area
I'll be happy to do so...



One of the major knocks on linux file systems (except for btrfs) that I hear is 
the lack of full data path checksums. DIF/DIX + xfs or ext4 done right will give 
us another answer here.  I don't think it will be common, it is a request that 
comes in for very large storage customers most commonly.


We do have devices that support this and are working to get more vendor testing 
done, so I would hate to see us throw out the code instead of fixing it up for 
the end users that see value here.


I think that we can get this working & agree with the call to continue this 
discussion (here and at LSF :))


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: status of block-integrity

2014-01-07 Thread Ric Wheeler

On 12/23/2013 09:35 PM, Martin K. Petersen wrote:

Christoph == Christoph Hellwig h...@infradead.org writes:

Christoph We have the block integrity code to support DIF/DIX in the
Christoph the tree for about 5 and a half years, and we still don't
Christoph have a single consumer of it.

What do you mean? If you have a DIX-capable HBA (lpfc, qla2xxx, zfcp)
then integrity protection is active from the block layer down. The only
code that's not currently being exercised are the tag interleaving
functions.  I was hoping the FS people would use them for back pointers
but nobody seemed to bite.


Christoph Given that we'll have a lot of work to do in this area with
Christoph block multiqueue I think it's time to either kill it off for
Christoph good or make sure we can actually use and test it.

I don't understand why multiqueue would require a lot of work? It's just
an extra scatterlist per request.

And obviously, if there's anything that needs to be done in this area
I'll be happy to do so...



One of the major knocks on linux file systems (except for btrfs) that I hear is 
the lack of full data path checksums. DIF/DIX + xfs or ext4 done right will give 
us another answer here.  I don't think it will be common, it is a request that 
comes in for very large storage customers most commonly.


We do have devices that support this and are working to get more vendor testing 
done, so I would hate to see us throw out the code instead of fixing it up for 
the end users that see value here.


I think that we can get this working  agree with the call to continue this 
discussion (here and at LSF :))


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems

2014-01-06 Thread Ric Wheeler


I would like to attend this year and continue to talk about the work on enabling 
the new class of persistent memory devices. Specifically, very interested in 
talking about both using a block driver under our existing stack and also 
progress at the file system layer (adding xip/mmap tweaks to existing file 
systems and looking at new file systems).


We also have a lot of work left to do on unifying management, it would be good 
to resync on that.


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage file systems

2014-01-06 Thread Ric Wheeler


I would like to attend this year and continue to talk about the work on enabling 
the new class of persistent memory devices. Specifically, very interested in 
talking about both using a block driver under our existing stack and also 
progress at the file system layer (adding xip/mmap tweaks to existing file 
systems and looking at new file systems).


We also have a lot of work left to do on unifying management, it would be good 
to resync on that.


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 07:22 PM, Pavel Machek wrote:

On Sat 2013-11-23 18:01:32, Ric Wheeler wrote:

On 11/23/2013 03:36 PM, Pavel Machek wrote:

On Wed 2013-11-20 08:02:33, Howard Chu wrote:

Theodore Ts'o wrote:

Historically, Intel has been really good about avoiding this, but
since they've moved to using 3rd party flash controllers, I now advise
everyone who plans to use any flash storage, regardless of the
manufacturer, to do their own explicit power fail testing (hitting the
reset button is not good enough, you need to kick the power plug out
of the wall, or better yet, use a network controlled power switch you
so you can repeat the power fail test dozens or hundreds of times for
your qualification run) before being using flash storage in a mission
critical situation where you care about data integrity after a power
fail event.

Speaking of which, what would you use to automate this sort of test?
I'm thinking an SSD connected by eSATA, with an external power
supply, and the host running inside a VM. Drop power to the drive at
the same time as doing a kill -9 on the VM, then you can resume the
VM pretty quickly instead of waiting for a full reboot sequence.

I was just pulling power on sata drive.

It uncovered "interesting" stuff. I plugged power back, and kernel
re-estabilished communication with that drive, but any settings with
hdparm were forgotten. I'd say there's some room for improvement
there...

Hi Pavel,

When you drop power, your drive normally loses temporary settings
(like a change to write cache, etc).

Depending on the class of the device, there are ways to make that
permanent (look at hdparm or sdparm for details).

This is a feature of the drive and its firmware, not something we
reset in the device each time it re-appears.

Yes, and I'm arguing that is a bug (as in, < 0.01% people are using
hdparm correctly).


Almost no end users use hdparm. Those who do should read the man page and add 
the -K flag :)


Or system scripts that tweak should invoke it with the right flags.

Ric


So you used hparm to disable write cache so that ext3 can be safely
used on your hdd. Now you have glitch on power. Then, system continues
to operate in dangerous mode until reboot.

I guess it would be safer not to reattach drives after power
fail... (also I wonder what this does to data integrity. Drive lost
content of its writeback cache, but kernel continues... Journal will
not prevent data corruption in this case).

Pavel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 03:36 PM, Pavel Machek wrote:

On Wed 2013-11-20 08:02:33, Howard Chu wrote:

Theodore Ts'o wrote:

Historically, Intel has been really good about avoiding this, but
since they've moved to using 3rd party flash controllers, I now advise
everyone who plans to use any flash storage, regardless of the
manufacturer, to do their own explicit power fail testing (hitting the
reset button is not good enough, you need to kick the power plug out
of the wall, or better yet, use a network controlled power switch you
so you can repeat the power fail test dozens or hundreds of times for
your qualification run) before being using flash storage in a mission
critical situation where you care about data integrity after a power
fail event.

Speaking of which, what would you use to automate this sort of test?
I'm thinking an SSD connected by eSATA, with an external power
supply, and the host running inside a VM. Drop power to the drive at
the same time as doing a kill -9 on the VM, then you can resume the
VM pretty quickly instead of waiting for a full reboot sequence.

I was just pulling power on sata drive.

It uncovered "interesting" stuff. I plugged power back, and kernel
re-estabilished communication with that drive, but any settings with
hdparm were forgotten. I'd say there's some room for improvement
there...

Pavel


Hi Pavel,

When you drop power, your drive normally loses temporary settings (like a change 
to write cache, etc).


Depending on the class of the device, there are ways to make that permanent 
(look at hdparm or sdparm for details).


This is a feature of the drive and its firmware, not something we reset in the 
device each time it re-appears.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 01:27 PM, Stefan Priebe wrote:

Hi Ric,

Am 22.11.2013 21:37, schrieb Ric Wheeler:

On 11/22/2013 03:01 PM, Stefan Priebe wrote:

Hi Christoph,
Am 21.11.2013 11:11, schrieb Christoph Hellwig:


2. Some drives may implement CMD_FLUSH to return immediately i.e. no
guarantee the data is actually on disk.


In which case they aren't spec complicant.  While I've seen countless
data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
ingnores flush.  If you'd want to cheat that bluntly you'd be better
of just claiming to not have a writeback cache.

You solve your performance problem by completely disabling any chance
of having data integrity guarantees, and do so in a way that is not
detectable for applications or users.

If you have a workload with lots of small synchronous writes disabling
the writeback cache on the disk does indeed often help, especially with
the non-queueable FLUSH on all but the most recent ATA devices.


But this isn't correct for drives with capicitors like Crucial m500,
Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
option to disable this for drives like these?
/sys/block/sdX/device/ignore_flush


If you know 100% for sure that your drive has a non-volatile write
cache, you can run the file system without the flushing by mounting "-o
nobarrier".  With most devices, this is not needed since they tend to
simply ignore the flushes if they know they are power failure safe.

Block level, we did something similar for users who are not running
through a file system for SCSI devices - James added support to echo
"temporary" into the sd's device's cache_type field:

See:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 



At least to me this does not work. I get the same awful speed as before - also 
the I/O waits stay the same. I'm still seeing CMD flushes going to the devices.


Is there any way to check whether the temporary got accepted and works?

I simply executed:
for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary write 
back >$i; done


Stefan


What kernel are you running?  This is a new addition

Also, you can "cat" the same file to see what it says.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 01:27 PM, Stefan Priebe wrote:

Hi Ric,

Am 22.11.2013 21:37, schrieb Ric Wheeler:

On 11/22/2013 03:01 PM, Stefan Priebe wrote:

Hi Christoph,
Am 21.11.2013 11:11, schrieb Christoph Hellwig:


2. Some drives may implement CMD_FLUSH to return immediately i.e. no
guarantee the data is actually on disk.


In which case they aren't spec complicant.  While I've seen countless
data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
ingnores flush.  If you'd want to cheat that bluntly you'd be better
of just claiming to not have a writeback cache.

You solve your performance problem by completely disabling any chance
of having data integrity guarantees, and do so in a way that is not
detectable for applications or users.

If you have a workload with lots of small synchronous writes disabling
the writeback cache on the disk does indeed often help, especially with
the non-queueable FLUSH on all but the most recent ATA devices.


But this isn't correct for drives with capicitors like Crucial m500,
Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an
option to disable this for drives like these?
/sys/block/sdX/device/ignore_flush


If you know 100% for sure that your drive has a non-volatile write
cache, you can run the file system without the flushing by mounting -o
nobarrier.  With most devices, this is not needed since they tend to
simply ignore the flushes if they know they are power failure safe.

Block level, we did something similar for users who are not running
through a file system for SCSI devices - James added support to echo
temporary into the sd's device's cache_type field:

See:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 



At least to me this does not work. I get the same awful speed as before - also 
the I/O waits stay the same. I'm still seeing CMD flushes going to the devices.


Is there any way to check whether the temporary got accepted and works?

I simply executed:
for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary write 
back $i; done


Stefan


What kernel are you running?  This is a new addition

Also, you can cat the same file to see what it says.

Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 03:36 PM, Pavel Machek wrote:

On Wed 2013-11-20 08:02:33, Howard Chu wrote:

Theodore Ts'o wrote:

Historically, Intel has been really good about avoiding this, but
since they've moved to using 3rd party flash controllers, I now advise
everyone who plans to use any flash storage, regardless of the
manufacturer, to do their own explicit power fail testing (hitting the
reset button is not good enough, you need to kick the power plug out
of the wall, or better yet, use a network controlled power switch you
so you can repeat the power fail test dozens or hundreds of times for
your qualification run) before being using flash storage in a mission
critical situation where you care about data integrity after a power
fail event.

Speaking of which, what would you use to automate this sort of test?
I'm thinking an SSD connected by eSATA, with an external power
supply, and the host running inside a VM. Drop power to the drive at
the same time as doing a kill -9 on the VM, then you can resume the
VM pretty quickly instead of waiting for a full reboot sequence.

I was just pulling power on sata drive.

It uncovered interesting stuff. I plugged power back, and kernel
re-estabilished communication with that drive, but any settings with
hdparm were forgotten. I'd say there's some room for improvement
there...

Pavel


Hi Pavel,

When you drop power, your drive normally loses temporary settings (like a change 
to write cache, etc).


Depending on the class of the device, there are ways to make that permanent 
(look at hdparm or sdparm for details).


This is a feature of the drive and its firmware, not something we reset in the 
device each time it re-appears.


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-23 Thread Ric Wheeler

On 11/23/2013 07:22 PM, Pavel Machek wrote:

On Sat 2013-11-23 18:01:32, Ric Wheeler wrote:

On 11/23/2013 03:36 PM, Pavel Machek wrote:

On Wed 2013-11-20 08:02:33, Howard Chu wrote:

Theodore Ts'o wrote:

Historically, Intel has been really good about avoiding this, but
since they've moved to using 3rd party flash controllers, I now advise
everyone who plans to use any flash storage, regardless of the
manufacturer, to do their own explicit power fail testing (hitting the
reset button is not good enough, you need to kick the power plug out
of the wall, or better yet, use a network controlled power switch you
so you can repeat the power fail test dozens or hundreds of times for
your qualification run) before being using flash storage in a mission
critical situation where you care about data integrity after a power
fail event.

Speaking of which, what would you use to automate this sort of test?
I'm thinking an SSD connected by eSATA, with an external power
supply, and the host running inside a VM. Drop power to the drive at
the same time as doing a kill -9 on the VM, then you can resume the
VM pretty quickly instead of waiting for a full reboot sequence.

I was just pulling power on sata drive.

It uncovered interesting stuff. I plugged power back, and kernel
re-estabilished communication with that drive, but any settings with
hdparm were forgotten. I'd say there's some room for improvement
there...

Hi Pavel,

When you drop power, your drive normally loses temporary settings
(like a change to write cache, etc).

Depending on the class of the device, there are ways to make that
permanent (look at hdparm or sdparm for details).

This is a feature of the drive and its firmware, not something we
reset in the device each time it re-appears.

Yes, and I'm arguing that is a bug (as in,  0.01% people are using
hdparm correctly).


Almost no end users use hdparm. Those who do should read the man page and add 
the -K flag :)


Or system scripts that tweak should invoke it with the right flags.

Ric


So you used hparm to disable write cache so that ext3 can be safely
used on your hdd. Now you have glitch on power. Then, system continues
to operate in dangerous mode until reboot.

I guess it would be safer not to reattach drives after power
fail... (also I wonder what this does to data integrity. Drive lost
content of its writeback cache, but kernel continues... Journal will
not prevent data corruption in this case).

Pavel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-22 Thread Ric Wheeler

On 11/22/2013 03:01 PM, Stefan Priebe wrote:

Hi Christoph,
Am 21.11.2013 11:11, schrieb Christoph Hellwig:


2. Some drives may implement CMD_FLUSH to return immediately i.e. no
guarantee the data is actually on disk.


In which case they aren't spec complicant.  While I've seen countless
data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
ingnores flush.  If you'd want to cheat that bluntly you'd be better
of just claiming to not have a writeback cache.

You solve your performance problem by completely disabling any chance
of having data integrity guarantees, and do so in a way that is not
detectable for applications or users.

If you have a workload with lots of small synchronous writes disabling
the writeback cache on the disk does indeed often help, especially with
the non-queueable FLUSH on all but the most recent ATA devices.


But this isn't correct for drives with capicitors like Crucial m500, Intel DC 
S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable 
this for drives like these?

/sys/block/sdX/device/ignore_flush


If you know 100% for sure that your drive has a non-volatile write cache, you 
can run the file system without the flushing by mounting "-o nobarrier".  With 
most devices, this is not needed since they tend to simply ignore the flushes if 
they know they are power failure safe.


Block level, we did something similar for users who are not running through a 
file system for SCSI devices - James added support to echo "temporary" into the 
sd's device's cache_type field:


See:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88

Ric




Again, what your patch does is to explicitly ignore the data integrity
request from the application.  While this will usually be way faster,
it will also cause data loss.  Simply disabling the writeback cache
feature of the disk using hdparm will give you much better performance
than issueing all the FLUSH command, especially if they are non-queued,
but without breaking the gurantee to the application.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?

2013-11-22 Thread Ric Wheeler

On 11/22/2013 03:01 PM, Stefan Priebe wrote:

Hi Christoph,
Am 21.11.2013 11:11, schrieb Christoph Hellwig:


2. Some drives may implement CMD_FLUSH to return immediately i.e. no
guarantee the data is actually on disk.


In which case they aren't spec complicant.  While I've seen countless
data integrity bugs on lower end ATA SSDs I've not seen one that simpliy
ingnores flush.  If you'd want to cheat that bluntly you'd be better
of just claiming to not have a writeback cache.

You solve your performance problem by completely disabling any chance
of having data integrity guarantees, and do so in a way that is not
detectable for applications or users.

If you have a workload with lots of small synchronous writes disabling
the writeback cache on the disk does indeed often help, especially with
the non-queueable FLUSH on all but the most recent ATA devices.


But this isn't correct for drives with capicitors like Crucial m500, Intel DC 
S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable 
this for drives like these?

/sys/block/sdX/device/ignore_flush


If you know 100% for sure that your drive has a non-volatile write cache, you 
can run the file system without the flushing by mounting -o nobarrier.  With 
most devices, this is not needed since they tend to simply ignore the flushes if 
they know they are power failure safe.


Block level, we did something similar for users who are not running through a 
file system for SCSI devices - James added support to echo temporary into the 
sd's device's cache_type field:


See:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88

Ric




Again, what your patch does is to explicitly ignore the data integrity
request from the application.  While this will usually be way faster,
it will also cause data loss.  Simply disabling the writeback cache
feature of the disk using hdparm will give you much better performance
than issueing all the FLUSH command, especially if they are non-queued,
but without breaking the gurantee to the application.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] update xfs maintainers

2013-11-08 Thread Ric Wheeler

On 11/08/2013 05:17 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 05:07:45PM -0500, Ric Wheeler wrote:

On 11/08/2013 05:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben

Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing,

Sorry if my read on that was wrong.

I do appreciate the work and effort you and the SGI team put in but
think that this will be a good way to keep the community happier and
even more productive going forward.


Dave simply has earned the right
to take on the formal leadership role of maintainer.

Then we're gonna need some Reviewed-bys.  ;)

Those should come from the developers, thanks!

I actually do need your Reviewed-by.   We'll try and get this one in 3.13.  ;)

Thanks,
Ben


Happy to do that - I do think that Dave mostly posts from his redhat.com 
account, but he can comment once he gets back online.


Reviewed-by: Ric Wheeler 



From: Ben Myers 

xfs: update maintainers

Add Dave as maintainer of XFS.

Signed-off-by: Ben Myers 
---
  MAINTAINERS |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
+++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
@@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*
  XFS FILESYSTEM
  P:Silicon Graphics Inc
+M: Dave Chinner 
  M:Ben Myers 
-M: Alex Elder 
  M:x...@oss.sgi.com
  L:x...@oss.sgi.com
  W:http://oss.sgi.com/projects/xfs


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] update xfs maintainers

2013-11-08 Thread Ric Wheeler

On 11/08/2013 05:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben

Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing,

Sorry if my read on that was wrong.


I do appreciate the work and effort you and the SGI team put in but think that 
this will be a good way to keep the community happier and even more productive 
going forward.


Dave simply has earned the right
to take on the formal leadership role of maintainer.

Then we're gonna need some Reviewed-bys.  ;)


Those should come from the developers, thanks!

Ric



From: Ben Myers 

xfs: update maintainers

Add Dave as maintainer of XFS.

Signed-off-by: Ben Myers 
---
  MAINTAINERS |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
+++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
@@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*
  
  XFS FILESYSTEM

  P:Silicon Graphics Inc
+M: Dave Chinner 
  M:Ben Myers 
-M: Alex Elder 
  M:x...@oss.sgi.com
  L:x...@oss.sgi.com
  W:http://oss.sgi.com/projects/xfs


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.


Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)
  

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben


Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing, Dave simply has earned the right to take on 
the formal leadership role of maintainer.


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 02:34 PM, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.


Doing this as an unilateral decisions is not something that will win you
a fan base.

While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.


This sounds like exactly the right thing to do to me as well,

Ric



I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that
SGI is trying to enforce on the community.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 01:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 06:03:41AM -0500, Ric Wheeler wrote:

In the XFS community, we have 2 clear leaders in terms of
contributions of significant feaures and depth of knowledge -
Christoph and Dave.

If you look at the number of patches submitted by developers since
3.0 who have more than 10 patches, we get the following:

 319 Author: Dave Chinner 
 163 Author: Christoph Hellwig 
  51 Author: Christoph Hellwig 
  35 Author: Linus Torvalds 
  34 Author: Chandra Seetharaman 
  29 Author: Al Viro 
  28 Author: Brian Foster 
  25 Author: Zhi Yong Wu 
  24 Author: Jeff Liu 
  21 Author: Jie Liu 
  20 Author: Mark Tinguely 
  16 Author: Dave Chinner 
  12 Author: Eric Sandeen 
  12 Author: Carlos Maiolino 

If we as a community had more capacity for patch review, Dave's
numbers would have jumped up even higher :)

It is certainly very welcome to bring new developers into our
community, but if we are going to add a co-maintainer for XFS, we
really need to have one of our two leading developers in that role.

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

-Ben


I don't mean any disrepect to you or to Mark, but maintainership is something 
that you earn over time by proving yourself in the community as a developer and 
a leader of the technology on a personal level.


It is not something that gets managed by the community of developers and has the 
key role of keeping the most frequent developers engaged and happy.  That has 
not been working for us as a community lately.


Dave Chinner is the obvious person to take on the maintainer role as someone who 
has an order of magnitude more code contributed than either of you (even combined).


Christoph, if he has time, would also be an excellent candidate.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler
In the XFS community, we have 2 clear leaders in terms of contributions of 
significant feaures and depth of knowledge - Christoph and Dave.


If you look at the number of patches submitted by developers since 3.0 who have 
more than 10 patches, we get the following:


319 Author: Dave Chinner 
163 Author: Christoph Hellwig 
 51 Author: Christoph Hellwig 
 35 Author: Linus Torvalds 
 34 Author: Chandra Seetharaman 
 29 Author: Al Viro 
 28 Author: Brian Foster 
 25 Author: Zhi Yong Wu 
 24 Author: Jeff Liu 
 21 Author: Jie Liu 
 20 Author: Mark Tinguely 
 16 Author: Dave Chinner 
 12 Author: Eric Sandeen 
 12 Author: Carlos Maiolino 

If we as a community had more capacity for patch review, Dave's numbers would 
have jumped up even higher :)


It is certainly very welcome to bring new developers into our community, but if 
we are going to add a co-maintainer for XFS, we really need to have one of our 
two leading developers in that role.


Best regards,

Ric



On 11/07/2013 09:23 PM, Ric Wheeler wrote:

Hi Ben,

How exactly did we decide to add a new co-maintainer? Shouldn't we have some 
discussion on the list and see some substantial history of contributions?


Best regards,

Ric


On 11/07/2013 05:08 PM, Mark Tinguely wrote:

Updated maintainer info.

Signed-off-by: Ben Myers 
Reviewed-by: Mark Tinguely 
---
 MAINTAINERS |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS2013-11-07 15:42:04.554561805 -0600
+++ b/MAINTAINERS2013-11-07 15:42:59.034889770 -0600
@@ -9388,7 +9388,7 @@ F:drivers/xen/*swiotlb*
 XFS FILESYSTEM
 P:Silicon Graphics Inc
 M:Ben Myers 
-M:Alex Elder 
+M:Mark Tinguely 
 M:x...@oss.sgi.com
 L:x...@oss.sgi.com
 W:http://oss.sgi.com/projects/xfs

___
xfs mailing list
x...@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


___
xfs mailing list
x...@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler
In the XFS community, we have 2 clear leaders in terms of contributions of 
significant feaures and depth of knowledge - Christoph and Dave.


If you look at the number of patches submitted by developers since 3.0 who have 
more than 10 patches, we get the following:


319 Author: Dave Chinner dchin...@redhat.com
163 Author: Christoph Hellwig h...@infradead.org
 51 Author: Christoph Hellwig h...@lst.de
 35 Author: Linus Torvalds torva...@linux-foundation.org
 34 Author: Chandra Seetharaman sekha...@us.ibm.com
 29 Author: Al Viro v...@zeniv.linux.org.uk
 28 Author: Brian Foster bfos...@redhat.com
 25 Author: Zhi Yong Wu wu...@linux.vnet.ibm.com
 24 Author: Jeff Liu jeff@oracle.com
 21 Author: Jie Liu jeff@oracle.com
 20 Author: Mark Tinguely tingu...@sgi.com
 16 Author: Dave Chinner da...@fromorbit.com
 12 Author: Eric Sandeen sand...@redhat.com
 12 Author: Carlos Maiolino cmaiol...@redhat.com

If we as a community had more capacity for patch review, Dave's numbers would 
have jumped up even higher :)


It is certainly very welcome to bring new developers into our community, but if 
we are going to add a co-maintainer for XFS, we really need to have one of our 
two leading developers in that role.


Best regards,

Ric



On 11/07/2013 09:23 PM, Ric Wheeler wrote:

Hi Ben,

How exactly did we decide to add a new co-maintainer? Shouldn't we have some 
discussion on the list and see some substantial history of contributions?


Best regards,

Ric


On 11/07/2013 05:08 PM, Mark Tinguely wrote:

Updated maintainer info.

Signed-off-by: Ben Myers b...@sgi.com
Reviewed-by: Mark Tinguely tingu...@sgi.com
---
 MAINTAINERS |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS2013-11-07 15:42:04.554561805 -0600
+++ b/MAINTAINERS2013-11-07 15:42:59.034889770 -0600
@@ -9388,7 +9388,7 @@ F:drivers/xen/*swiotlb*
 XFS FILESYSTEM
 P:Silicon Graphics Inc
 M:Ben Myers b...@sgi.com
-M:Alex Elder el...@kernel.org
+M:Mark Tinguely tingu...@sgi.com
 M:x...@oss.sgi.com
 L:x...@oss.sgi.com
 W:http://oss.sgi.com/projects/xfs

___
xfs mailing list
x...@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


___
xfs mailing list
x...@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 01:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 06:03:41AM -0500, Ric Wheeler wrote:

In the XFS community, we have 2 clear leaders in terms of
contributions of significant feaures and depth of knowledge -
Christoph and Dave.

If you look at the number of patches submitted by developers since
3.0 who have more than 10 patches, we get the following:

 319 Author: Dave Chinner dchin...@redhat.com
 163 Author: Christoph Hellwig h...@infradead.org
  51 Author: Christoph Hellwig h...@lst.de
  35 Author: Linus Torvalds torva...@linux-foundation.org
  34 Author: Chandra Seetharaman sekha...@us.ibm.com
  29 Author: Al Viro v...@zeniv.linux.org.uk
  28 Author: Brian Foster bfos...@redhat.com
  25 Author: Zhi Yong Wu wu...@linux.vnet.ibm.com
  24 Author: Jeff Liu jeff@oracle.com
  21 Author: Jie Liu jeff@oracle.com
  20 Author: Mark Tinguely tingu...@sgi.com
  16 Author: Dave Chinner da...@fromorbit.com
  12 Author: Eric Sandeen sand...@redhat.com
  12 Author: Carlos Maiolino cmaiol...@redhat.com

If we as a community had more capacity for patch review, Dave's
numbers would have jumped up even higher :)

It is certainly very welcome to bring new developers into our
community, but if we are going to add a co-maintainer for XFS, we
really need to have one of our two leading developers in that role.

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

-Ben


I don't mean any disrepect to you or to Mark, but maintainership is something 
that you earn over time by proving yourself in the community as a developer and 
a leader of the technology on a personal level.


It is not something that gets managed by the community of developers and has the 
key role of keeping the most frequent developers engaged and happy.  That has 
not been working for us as a community lately.


Dave Chinner is the obvious person to take on the maintainer role as someone who 
has an order of magnitude more code contributed than either of you (even combined).


Christoph, if he has time, would also be an excellent candidate.

Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 02:34 PM, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.


Doing this as an unilateral decisions is not something that will win you
a fan base.

While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.


This sounds like exactly the right thing to do to me as well,

Ric



I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that
SGI is trying to enforce on the community.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS leadership and a new co-maintainer candidate

2013-11-08 Thread Ric Wheeler

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.


Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)
  

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben


Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing, Dave simply has earned the right to take on 
the formal leadership role of maintainer.


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] update xfs maintainers

2013-11-08 Thread Ric Wheeler

On 11/08/2013 05:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben

Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing,

Sorry if my read on that was wrong.


I do appreciate the work and effort you and the SGI team put in but think that 
this will be a good way to keep the community happier and even more productive 
going forward.


Dave simply has earned the right
to take on the formal leadership role of maintainer.

Then we're gonna need some Reviewed-bys.  ;)


Those should come from the developers, thanks!

Ric



From: Ben Myers b...@sgi.com

xfs: update maintainers

Add Dave as maintainer of XFS.

Signed-off-by: Ben Myers b...@sgi.com
---
  MAINTAINERS |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
+++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
@@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*
  
  XFS FILESYSTEM

  P:Silicon Graphics Inc
+M: Dave Chinner dchin...@fromorbit.com
  M:Ben Myers b...@sgi.com
-M: Alex Elder el...@kernel.org
  M:x...@oss.sgi.com
  L:x...@oss.sgi.com
  W:http://oss.sgi.com/projects/xfs


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] update xfs maintainers

2013-11-08 Thread Ric Wheeler

On 11/08/2013 05:17 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 05:07:45PM -0500, Ric Wheeler wrote:

On 11/08/2013 05:03 PM, Ben Myers wrote:

Hey Ric,

On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:

On 11/08/2013 03:46 PM, Ben Myers wrote:

Hey Christoph,

On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:

On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:

Mark is replacing Alex as my backup because Alex is really busy at
Linaro and asked to be taken off awhile ago.  The holiday season is
coming up and I fully intend to go off my meds, turn in to Fonzy the
bear, and eat my hat.  I need someone to watch the shop while I'm off
exploring on Mars.  I trust Mark to do that because he is totally
awesome.

Doing this as an unilateral decisions is not something that will win you
a fan base.

It's posted for review.


While we never had anything reassembling a democracy in Linux Kernel
development making decisions without even contacting the major
contributor is wrong, twice so if the maintainer is a relatively minor
contributor to start with.

Just because it recent came up elsewhere I'd like to recite the
definition from Trond here again:


http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

By many of the creative roles enlisted there it's clear that Dave should
be the maintainer.  He's been the main contributor and chief architect
for XFS for many year, while the maintainers came and went at the mercy
of SGI.  This is not meant to bad mouth either of you as I think you're
doing a reasonably good job compared to other maintainers, but at the
same time the direction is set by other people that have a much longer
involvement with the project, and having them officially in control
would help us forward a lot.  It would also avoid having to spend
considerable resources to train every new generation of SGI maintainer.

Coming to and end I would like to maintain Dave Chinner as the primary
XFS maintainer for all the work he has done as biggest contributor and
architect of XFS since longer than I can remember, and I would love to
retain Ben Myers as a co-maintainer for all the good work he has done
maintaining and reviewing patches since November 2011.

I think we're doing a decent job too.  So thanks for that much at least.  ;)

I would also like to use this post as a public venue to condemn the
unilateral smokey backroom decisions about XFS maintainership that SGI is
trying to enforce on the community.

That really didn't happen Christoph.  It's not in my tree or in a pull request.

Linus, let me know what you want to do.  I do think we're doing a fair job over
here, and (geez) I'm just trying to add Mark as my backup since Alex is too
busy.  I know the RH people want more control, and that's understandable, but
they really don't need to replace me to get their code in.  Ouch.

Thanks,
Ben

Christoph is not a Red Hat person.

Jeff is from Oracle.

This is not a Red Hat vs SGI thing,

Sorry if my read on that was wrong.

I do appreciate the work and effort you and the SGI team put in but
think that this will be a good way to keep the community happier and
even more productive going forward.


Dave simply has earned the right
to take on the formal leadership role of maintainer.

Then we're gonna need some Reviewed-bys.  ;)

Those should come from the developers, thanks!

I actually do need your Reviewed-by.   We'll try and get this one in 3.13.  ;)

Thanks,
Ben


Happy to do that - I do think that Dave mostly posts from his redhat.com 
account, but he can comment once he gets back online.


Reviewed-by: Ric Wheeler rwhee...@redhat.com



From: Ben Myers b...@sgi.com

xfs: update maintainers

Add Dave as maintainer of XFS.

Signed-off-by: Ben Myers b...@sgi.com
---
  MAINTAINERS |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===
--- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
+++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
@@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*
  XFS FILESYSTEM
  P:Silicon Graphics Inc
+M: Dave Chinner dchin...@fromorbit.com
  M:Ben Myers b...@sgi.com
-M: Alex Elder el...@kernel.org
  M:x...@oss.sgi.com
  L:x...@oss.sgi.com
  W:http://oss.sgi.com/projects/xfs


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 04:00 PM, Bernd Schubert wrote:
pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently? 


The NFS and SCSI groups have each defined a standard which Zach's proposal 
abstracts into a common user API.


Distributed file systems tend to be rather unique and do not have similar 
standard bodies, but a lot of them could hide server specific implementations 
under the current proposed interfaces.


What is not a good idea is to drag out the core, simple copy offload discussion 
for another 5 years to pull in every odd use case :)


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:46 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler  wrote:

The way the array based offload (and some software side reflink works) is
not a byte by byte copy. We cannot assume that a valid count can be returned
or that such a count would be an indication of a sequential segment of good
data.  The whole thing would normally have to be reissued.

To make that a true assumption, you would have to mandate that in each of
the specifications (and sw targets)...

You're missing my point.

  - user issues SIZE_MAX splice request
  - fs issues *64M* (or whatever) request to offload
  - when that completes *fully* then we return 64M to userspace
  - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos


Yes, if you send a copy offload command and it works, you can assume that it 
worked fully. It would be pretty interesting if that were not true :)


If it fails, we cannot assume anything about partial completion.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:38 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler  wrote:

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be
restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy
starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
   1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
  1.a) fs reflinks the whole file in a jiffy and returns the size of
the file
  1 b) fs does copy offload of, say, 64MB and returns 64M
   2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?


No.

Keep in mind that the offload operation in (1) might fail partially. The
target file (the copy) is allocated, the question is what ranges have valid
data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos


The way the array based offload (and some software side reflink works) is not a 
byte by byte copy. We cannot assume that a valid count can be returned or that 
such a count would be an indication of a sequential segment of good data.  The 
whole thing would normally have to be reissued.


To make that a true assumption, you would have to mandate that in each of the 
specifications (and sw targets)...


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
  1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
 1 b) fs does copy offload of, say, 64MB and returns 64M
  2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos


No.

Keep in mind that the offload operation in (1) might fail partially. The target 
file (the copy) is allocated, the question is what ranges have valid data.


I don't see that (2) is interesting or really needed to be done in the kernel. 
If nothing else, it tends to confuse the discussion


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields  wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos


You cannot rely on a short count. That implies that an offloaded copy starts at 
byte 0 and the short count first bytes are all valid.


I don't believe that is in fact required by all (any?) versions of the spec :)

Best just to fail and restart the whole operation.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:34 AM, J. Bruce Fields wrote:

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler  wrote:


I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.
Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since
the invention of snapshots. The NFSv4.2 spec does allow you to set a
per-file attribute that causes the storage server to always preallocate
enough buffers to guarantee that you can rewrite the entire file, however
the fact that we've lived without it for said 20 years leads me to believe
that demand for it is going to be limited. I haven't put it top of the list
of features we care to implement...

Cheers,
 Trond


I agree - this has been common behaviour for a very long time in the array
space. Even without an array,  this is the same as overwriting a block in
btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.


I think that this is still confusing - reflink and array copy offload should not 
be differentiated.  In effect, they should often be the same order of magnitude 
in performance and possibly even use the same or very similar techniques (just 
on different sides of the initiator/target transaction!).


It is much simpler to let the application fail if the offload (or reflink) is 
not supported and let it do the traditional copy offload.  Then you always send 
the largest possible offload operation and do whatever you do now if that fails.


thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:34 AM, J. Bruce Fields wrote:

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler rwhee...@redhat.com wrote:


I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.
Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since
the invention of snapshots. The NFSv4.2 spec does allow you to set a
per-file attribute that causes the storage server to always preallocate
enough buffers to guarantee that you can rewrite the entire file, however
the fact that we've lived without it for said 20 years leads me to believe
that demand for it is going to be limited. I haven't put it top of the list
of features we care to implement...

Cheers,
 Trond


I agree - this has been common behaviour for a very long time in the array
space. Even without an array,  this is the same as overwriting a block in
btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.


I think that this is still confusing - reflink and array copy offload should not 
be differentiated.  In effect, they should often be the same order of magnitude 
in performance and possibly even use the same or very similar techniques (just 
on different sides of the initiator/target transaction!).


It is much simpler to let the application fail if the offload (or reflink) is 
not supported and let it do the traditional copy offload.  Then you always send 
the largest possible offload operation and do whatever you do now if that fails.


thanks!

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos


You cannot rely on a short count. That implies that an offloaded copy starts at 
byte 0 and the short count first bytes are all valid.


I don't believe that is in fact required by all (any?) versions of the spec :)

Best just to fail and restart the whole operation.

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
  1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
 1 b) fs does copy offload of, say, 64MB and returns 64M
  2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos


No.

Keep in mind that the offload operation in (1) might fail partially. The target 
file (the copy) is allocated, the question is what ranges have valid data.


I don't see that (2) is interesting or really needed to be done in the kernel. 
If nothing else, it tends to confuse the discussion


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:38 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be
restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy
starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
   1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
  1.a) fs reflinks the whole file in a jiffy and returns the size of
the file
  1 b) fs does copy offload of, say, 64MB and returns 64M
   2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?


No.

Keep in mind that the offload operation in (1) might fail partially. The
target file (the copy) is allocated, the question is what ranges have valid
data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos


The way the array based offload (and some software side reflink works) is not a 
byte by byte copy. We cannot assume that a valid count can be returned or that 
such a count would be an indication of a sequential segment of good data.  The 
whole thing would normally have to be reissued.


To make that a true assumption, you would have to mandate that in each of the 
specifications (and sw targets)...


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 10:46 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler rwhee...@redhat.com wrote:

The way the array based offload (and some software side reflink works) is
not a byte by byte copy. We cannot assume that a valid count can be returned
or that such a count would be an indication of a sequential segment of good
data.  The whole thing would normally have to be reissued.

To make that a true assumption, you would have to mandate that in each of
the specifications (and sw targets)...

You're missing my point.

  - user issues SIZE_MAX splice request
  - fs issues *64M* (or whatever) request to offload
  - when that completes *fully* then we return 64M to userspace
  - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos


Yes, if you send a copy offload command and it works, you can assume that it 
worked fully. It would be pretty interesting if that were not true :)


If it fails, we cannot assume anything about partial completion.

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-30 Thread Ric Wheeler

On 09/30/2013 04:00 PM, Bernd Schubert wrote:
pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently? 


The NFS and SCSI groups have each defined a standard which Zach's proposal 
abstracts into a common user API.


Distributed file systems tend to be rather unique and do not have similar 
standard bodies, but a lot of them could hide server specific implementations 
under the current proposed interfaces.


What is not a good idea is to drag out the core, simple copy offload discussion 
for another 5 years to pull in every odd use case :)


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-28 Thread Ric Wheeler

On 09/28/2013 11:20 AM, Myklebust, Trond wrote:

-Original Message-
From: Miklos Szeredi [mailto:mik...@szeredi.hu]
Sent: Saturday, September 28, 2013 12:50 AM
To: Zach Brown
Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown  wrote:

Also, I don't get the first option above at all.  The argument is
that it's safer to have more copies?  How much safety does another
copy on the same disk really give you?  Do systems that do dedup
provide interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
Trond


I agree - this has been common behaviour for a very long time in the array 
space. Even without an array,  this is the same as overwriting a block in btrfs 
or any file system with a read-write LVM snapshot.


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-28 Thread Ric Wheeler

On 09/28/2013 11:20 AM, Myklebust, Trond wrote:

-Original Message-
From: Miklos Szeredi [mailto:mik...@szeredi.hu]
Sent: Saturday, September 28, 2013 12:50 AM
To: Zach Brown
Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown z...@redhat.com wrote:

Also, I don't get the first option above at all.  The argument is
that it's safer to have more copies?  How much safety does another
copy on the same disk really give you?  Do systems that do dedup
provide interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
Trond


I agree - this has been common behaviour for a very long time in the array 
space. Even without an array,  this is the same as overwriting a block in btrfs 
or any file system with a read-write LVM snapshot.


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Ric Wheeler

On 09/27/2013 12:47 AM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler  wrote:

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos


Buffering  misses the whole point of the copy offload - the idea is *not* to
read or write the actual data in the most interesting cases which offload
the operation to a smart target device or file system.

I meant buffering the COPY, not the data.  Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous.  So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


I think that you are an order of magnitude off here in thinking about the scale 
of the operations.


An enabled, synchronize copy offload to an array (or one that turns into a 
reflink locally) is effectively the cost of the call itself. Let's say no slower 
than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that 
call is much faster than that worst case number.


Copying any substantial amount of data - like the target workload of VM images 
or media files - would be hundreds of MB's per copy and that would take seconds 
or minutes.


We should really work on getting the basic mechanism working and robust without 
any complications, then we can look at real, measured performance and see if 
there is any justification for adding complexity.


thanks!

Ric





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Ric Wheeler

On 09/27/2013 12:47 AM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on -direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos


Buffering  misses the whole point of the copy offload - the idea is *not* to
read or write the actual data in the most interesting cases which offload
the operation to a smart target device or file system.

I meant buffering the COPY, not the data.  Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous.  So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


I think that you are an order of magnitude off here in thinking about the scale 
of the operations.


An enabled, synchronize copy offload to an array (or one that turns into a 
reflink locally) is effectively the cost of the call itself. Let's say no slower 
than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that 
call is much faster than that worst case number.


Copying any substantial amount of data - like the target workload of VM images 
or media files - would be hundreds of MB's per copy and that would take seconds 
or minutes.


We should really work on getting the basic mechanism working and robust without 
any complications, then we can look at real, measured performance and see if 
there is any justification for adding complexity.


thanks!

Ric





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 02:55 PM, Zach Brown wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if "cp"  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

Hmm, yes, that would be a nice outcome.


However "cp" doesn't do reflinking by default, it has a switch for
that.  If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.


That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
   so that it will be used by default by people who feel strongly about
   their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
   use the file system shared reference ioctls (ocfs2 reflink, btrfs
   clone) but would like it to also do server-side read/write copies
   over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
   giant copies to take forever if they aren't instant.  (The qemu guys
   asked for this at Plumbers.)

I think I can live with that.

- z


This last flag should not prevent a remote target device (NFS or SCSI array) 
copy from working though since they often do reflink like operations inside of 
the remote target device


ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos



Buffering  misses the whole point of the copy offload - the idea is *not* to 
read or write the actual data in the most interesting cases which offload the 
operation to a smart target device or file system.


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 11:34 AM, J. Bruce Fields wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if "cp"  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.


I think that is not really possible to tell when we invoke it. It is very much 
target device (or file system, etc) dependent on how long it takes. It could be 
as simple as a reflink copying in a smallish amount of metadata or fall back to 
a full byte-by-byte copy.  Also note that speed is not the only impact here, 
some of the mechanisms actually do not consume more space (just increment shared 
data references).


It would probably make more sense to send it off to the target device and have 
it return an error when not appropriate (then the app can fall back to the old 
fashion copy).


ric




And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However "cp" doesn't do reflinking by default, it has a switch for
that.  If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 11:34 AM, J. Bruce Fields wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if cp  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.


I think that is not really possible to tell when we invoke it. It is very much 
target device (or file system, etc) dependent on how long it takes. It could be 
as simple as a reflink copying in a smallish amount of metadata or fall back to 
a full byte-by-byte copy.  Also note that speed is not the only impact here, 
some of the mechanisms actually do not consume more space (just increment shared 
data references).


It would probably make more sense to send it off to the target device and have 
it return an error when not appropriate (then the app can fall back to the old 
fashion copy).


ric




And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However cp doesn't do reflinking by default, it has a switch for
that.  If we just want cp and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on -direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos



Buffering  misses the whole point of the copy offload - the idea is *not* to 
read or write the actual data in the most interesting cases which offload the 
operation to a smart target device or file system.


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] extending splice for copy offloading

2013-09-26 Thread Ric Wheeler

On 09/26/2013 02:55 PM, Zach Brown wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if cp  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

Hmm, yes, that would be a nice outcome.


However cp doesn't do reflinking by default, it has a switch for
that.  If we just want cp and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.


That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
   so that it will be used by default by people who feel strongly about
   their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
   use the file system shared reference ioctls (ocfs2 reflink, btrfs
   clone) but would like it to also do server-side read/write copies
   over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
   giant copies to take forever if they aren't instant.  (The qemu guys
   asked for this at Plumbers.)

I think I can live with that.

- z


This last flag should not prevent a remote target device (NFS or SCSI array) 
copy from working though since they often do reflink like operations inside of 
the remote target device


ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs

2013-09-13 Thread Ric Wheeler

On 09/12/2013 11:36 AM, Chris Mason wrote:

Mark Fasheh's offline dedup work is also here.  In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO.   This is a building block where utilities  are
able to ask the FS to dedup a series of extents.  The kernel takes
care of verifying the data involved really is the same.  Today this
involves reading both extents, but we'll continue to evolve the patches.


Nice feature!

Just a note, the "offline" label is really confusing. In other storage products, 
they typically call this "out of band" since you are online but not during the 
actual write in a synchronous way :)


Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Btrfs

2013-09-13 Thread Ric Wheeler

On 09/12/2013 11:36 AM, Chris Mason wrote:

Mark Fasheh's offline dedup work is also here.  In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO.   This is a building block where utilities  are
able to ask the FS to dedup a series of extents.  The kernel takes
care of verifying the data involved really is the same.  Today this
involves reading both extents, but we'll continue to evolve the patches.


Nice feature!

Just a note, the offline label is really confusing. In other storage products, 
they typically call this out of band since you are online but not during the 
actual write in a synchronous way :)


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-23 Thread Ric Wheeler

On 08/23/2013 05:10 AM, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast 
mode".

If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
   # echo 1> /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
   # echo 0> /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on 
the

scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages

on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk, calling panic() will be suitable when an error

Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

2013-08-23 Thread Ric Wheeler

On 08/23/2013 05:10 AM, Eiichi Tsukata wrote:

(2013/08/21 3:09), Ewan Milne wrote:

On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:

(2013/08/19 23:30), James Bottomley wrote:

On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:

Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device failfast 
mode.

If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
   # echo 1 /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
   # echo 0 /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on 
the

scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.


I'm afraid you'll need to propose another solution.  We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout.  UA is one of those (as are most of the others you're
limiting).  How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James




Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error 
messages

on retry counts excess without changing retry logic will be enough
to stop the server and take fail over.  Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding panic_on_error option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--


For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.


Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.


I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though.  In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan



Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on not os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own disk and
do not share the same disk, calling panic() will be suitable when an error
occurs in 

Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-21 Thread Ric Wheeler

On 07/20/2013 01:04 PM, Ben Hutchings wrote:

n Fri, 2013-07-19 at 13:42 -0500, Felipe Contreras wrote:

>On Fri, Jul 19, 2013 at 7:08 AM, Ingo Molnar  wrote:

> >
> >* Felipe Contreras  wrote:

>

> >>As Linus already pointed out, not everybody has to work with everybody.

> >
> >That's not the point though, the point is to potentially roughly double
> >the creative brain capacity of the Linux kernel project.

>
>Unfortunately that's impossible; we all know there aren't as many
>women programmers as there are men.

In some countries, though not all.

But we also know (or should realise) that the gender ratio among
programmers in general is much less unbalanced than in some free
software communities including the Linux kernel developers.



Just a couple of data points to add.

When I was in graduate school in Israel, we had more women doing their phd then 
men. Not a huge sample, but it was interesting.


The counter sample is the number of coding women we have at Red Hat in the 
kernel team. We are around zero per cent. Certainly a sign that we need to do 
better, regardless of the broader community challenges...


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-21 Thread Ric Wheeler

On 07/20/2013 01:04 PM, Ben Hutchings wrote:

n Fri, 2013-07-19 at 13:42 -0500, Felipe Contreras wrote:

On Fri, Jul 19, 2013 at 7:08 AM, Ingo Molnarmi...@kernel.org  wrote:

 
 * Felipe Contrerasfelipe.contre...@gmail.com  wrote:



 As Linus already pointed out, not everybody has to work with everybody.

 
 That's not the point though, the point is to potentially roughly double
 the creative brain capacity of the Linux kernel project.


Unfortunately that's impossible; we all know there aren't as many
women programmers as there are men.

In some countries, though not all.

But we also know (or should realise) that the gender ratio among
programmers in general is much less unbalanced than in some free
software communities including the Linux kernel developers.



Just a couple of data points to add.

When I was in graduate school in Israel, we had more women doing their phd then 
men. Not a huge sample, but it was interesting.


The counter sample is the number of coding women we have at Red Hat in the 
kernel team. We are around zero per cent. Certainly a sign that we need to do 
better, regardless of the broader community challenges...


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] scsi-mq prototype discussion

2013-07-19 Thread Ric Wheeler

On 07/17/2013 12:52 AM, James Bottomley wrote:

On Tue, 2013-07-16 at 15:15 -0600, Jens Axboe wrote:

On Tue, Jul 16 2013, Nicholas A. Bellinger wrote:

On Sat, 2013-07-13 at 06:53 +, James Bottomley wrote:

On Fri, 2013-07-12 at 12:52 +0200, Hannes Reinecke wrote:

On 07/12/2013 03:33 AM, Nicholas A. Bellinger wrote:

On Thu, 2013-07-11 at 18:02 -0700, Greg KH wrote:

On Thu, Jul 11, 2013 at 05:23:32PM -0700, Nicholas A. Bellinger wrote:

Drilling down the work items ahead of a real mainline push is high on
priority list for discussion.

The parties to be included in such a discussion are:

   - Jens Axboe (blk-mq author)
   - James Bottomley (scsi maintainer)
   - Christoph Hellwig (scsi)
   - Martin Petersen (scsi)
   - Tejun Heo (block + libata)
   - Hannes Reinecke (scsi error recovery)
   - Kent Overstreet (block, per-cpu ida)
   - Stephen Cameron (scsi-over-pcie driver)
   - Andrew Vasquez (qla2xxx LLD)
   - James Smart (lpfc LLD)

Isn't this something that should have been discussed at the storage
mini-summit a few months ago?

The scsi-mq prototype, along with blk-mq (in it's current form) did not
exist a few short months ago.  ;)


  It seems very specific to one subsystem to be a kernel summit topic,
don't you think?

It's no more subsystem specific than half of the other proposals so far,
and given it's reach across multiple subsystems (block, scsi, target),
and the amount of off-list interest on the topic, I think it would make
a good candidate for discussion.


And it'll open up new approaches which previously were dismissed,
like re-implementing multipathing on top of scsi-mq, giving us the
single scsi device like other UNIX systems.

Also I do think there's quite some synergy to be had, as with blk-mq
we could nail each queue to a processor, which would eliminate the
need for locking.
Which could be useful for other subsystems, too.

Lets start with discussing this on the list, please, and then see where
we go from there ...


Yes, the discussion is beginning to make it's way to the list.  I've
mostly been waiting for blk-mq to get a wider review before taking the
early scsi-mq prototype driver to a larger public audience.

Primarily, I'm now reaching out to the people most effected by existing
scsi_request_fn() based performance limitations.  Most of them have
abandoned existing scsi_request_fn() based logic in favor of raw block
make_request() based drivers, and are now estimating the amount of
effort to move to an scsi-mq based approach.

Regardless, as the prototype progresses over the next months, having a
face-to-face discussion with the key parties in the room would be very
helpful given the large amount of effort involved to actually make this
type of generational shift in SCSI actually happen.

There's a certain amount of overlap with the aio/O_DIRECT work as well.
But if it's not a general session, could always be a BOF or something.

I'll second the argument that most technical topics probably DO belong
in a topic related workshop. But that leaves us with basically only
process related topics at KS, I don't think it hurts to have a bit of
tech meat on the bone too. At least I personally miss that part of KS
from years gone by.

Heh well, given that most of the block mq discussions at LSF have been
you saying you really should get around to cleaning up and posting the
code, you'll understand my wanting to see that happen first ...

I suppose we could try to run a storage workshop within KS, but I think
most of the mini summit slots have already gone.  There's also plumbers
if all slots are gone (I would say that, being biased and on the
programme committee) Ric is running the storage and Filesystems MC

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/159

James



And we still are looking for suggested topics - it would be great to have the 
multi-queue work at plumbers.


You can post a proposal for it (or other topics) here:

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals

Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] scsi-mq prototype discussion

2013-07-19 Thread Ric Wheeler

On 07/17/2013 12:52 AM, James Bottomley wrote:

On Tue, 2013-07-16 at 15:15 -0600, Jens Axboe wrote:

On Tue, Jul 16 2013, Nicholas A. Bellinger wrote:

On Sat, 2013-07-13 at 06:53 +, James Bottomley wrote:

On Fri, 2013-07-12 at 12:52 +0200, Hannes Reinecke wrote:

On 07/12/2013 03:33 AM, Nicholas A. Bellinger wrote:

On Thu, 2013-07-11 at 18:02 -0700, Greg KH wrote:

On Thu, Jul 11, 2013 at 05:23:32PM -0700, Nicholas A. Bellinger wrote:

Drilling down the work items ahead of a real mainline push is high on
priority list for discussion.

The parties to be included in such a discussion are:

   - Jens Axboe (blk-mq author)
   - James Bottomley (scsi maintainer)
   - Christoph Hellwig (scsi)
   - Martin Petersen (scsi)
   - Tejun Heo (block + libata)
   - Hannes Reinecke (scsi error recovery)
   - Kent Overstreet (block, per-cpu ida)
   - Stephen Cameron (scsi-over-pcie driver)
   - Andrew Vasquez (qla2xxx LLD)
   - James Smart (lpfc LLD)

Isn't this something that should have been discussed at the storage
mini-summit a few months ago?

The scsi-mq prototype, along with blk-mq (in it's current form) did not
exist a few short months ago.  ;)


  It seems very specific to one subsystem to be a kernel summit topic,
don't you think?

It's no more subsystem specific than half of the other proposals so far,
and given it's reach across multiple subsystems (block, scsi, target),
and the amount of off-list interest on the topic, I think it would make
a good candidate for discussion.


And it'll open up new approaches which previously were dismissed,
like re-implementing multipathing on top of scsi-mq, giving us the
single scsi device like other UNIX systems.

Also I do think there's quite some synergy to be had, as with blk-mq
we could nail each queue to a processor, which would eliminate the
need for locking.
Which could be useful for other subsystems, too.

Lets start with discussing this on the list, please, and then see where
we go from there ...


Yes, the discussion is beginning to make it's way to the list.  I've
mostly been waiting for blk-mq to get a wider review before taking the
early scsi-mq prototype driver to a larger public audience.

Primarily, I'm now reaching out to the people most effected by existing
scsi_request_fn() based performance limitations.  Most of them have
abandoned existing scsi_request_fn() based logic in favor of raw block
make_request() based drivers, and are now estimating the amount of
effort to move to an scsi-mq based approach.

Regardless, as the prototype progresses over the next months, having a
face-to-face discussion with the key parties in the room would be very
helpful given the large amount of effort involved to actually make this
type of generational shift in SCSI actually happen.

There's a certain amount of overlap with the aio/O_DIRECT work as well.
But if it's not a general session, could always be a BOF or something.

I'll second the argument that most technical topics probably DO belong
in a topic related workshop. But that leaves us with basically only
process related topics at KS, I don't think it hurts to have a bit of
tech meat on the bone too. At least I personally miss that part of KS
from years gone by.

Heh well, given that most of the block mq discussions at LSF have been
you saying you really should get around to cleaning up and posting the
code, you'll understand my wanting to see that happen first ...

I suppose we could try to run a storage workshop within KS, but I think
most of the mini summit slots have already gone.  There's also plumbers
if all slots are gone (I would say that, being biased and on the
programme committee) Ric is running the storage and Filesystems MC

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/159

James



And we still are looking for suggested topics - it would be great to have the 
multi-queue work at plumbers.


You can post a proposal for it (or other topics) here:

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals

Ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-16 Thread Ric Wheeler

On 07/16/2013 07:53 PM, Myklebust, Trond wrote:

On Tue, 2013-07-16 at 19:31 -0400, Ric Wheeler wrote:

On 07/16/2013 07:12 PM, Sarah Sharp wrote:

On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote:

On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote:


Yes, that's true.  Some kernel developers are better at moderating their
comments and tone towards individuals who are "sensitive".  Others
simply don't give a shit.  So we need to figure out how to meet
somewhere in the middle, in order to establish a baseline of civility.

I have to ask this because I'm thick, and don't really understand,
but ...

What problem exactly are we trying to solve here?

Personal attacks are not cool Steve.  Some people simply don't care if a
verbal tirade is directed at them.  Others do not want anyone to attack
them personally, but they're fine with people attacking their code.

Bystanders that don't understand the kernel community structure are
discouraged from contributing because they don't want to be verbally
abused, and they really don't want to see either personal attacks or
intense belittling, demeaning comments about code.

In order to make our community better, we need to figure out where the
baseline of "good" behavior is.  We need to define what behavior we want
from both maintainers and patch submitters.  E.g. "No regressions" and
"don't break userspace" and "no personal attacks".  That needs to be
written down somewhere, and it isn't.  If it's documented somewhere,
point me to the file in Documentation.  Hint: it's not there.

That is the problem.

Sarah Sharp

The problem you are pointing out - and it is a problem - makes us less effective
as a community.

Not really. Most of the people who already work as part of this
community are completely used to it. We've created the environment, and
have no problems with it.


You should never judge success by being popular with those people who are 
already contributing and put up with things. If you did that in business, you 
would never reach new customers.




Where it could possibly be a problem is when it comes to recruiting
_new_ members to our community. Particularly so given that some
journalists take a special pleasure in reporting particularly juicy
comments and antics. That would tend to scare off a lot of gun-shy
newbies.


That is my point - recruiting new members is made harder. As some one who 
manages *a lot* of upstream kernel developers, I will add that it is not just 
new comers that find this occasionally offensive and off putting.



On the other hand, it might tend to bias our recruitment toward people
of a more "special" disposition. Perhaps we finally need the services of
a social scientist to help us find out...



To be fair, we usually do very well at this, especially with new comers to our 
community.


I think that most of the problems come up between people who know each other 
quite well and are friendly with each other in person. The problem is that when 
you use language that you would use with good friends over drinks to tell them 
they are being stupid and do that on a public list, you set a tone that reaches 
far beyond your intended target. All of those new comers also read this list and 
do not see it as funny or friendly.


I really don't think that we have to be politically correct or overly kind to 
make things better.


As a very low bar, we could start by trying to avoid using language that would 
get you fired when you send off an email to someone that you have power over 
(either manage directly or indirectly control their career).


Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-16 Thread Ric Wheeler

On 07/16/2013 07:12 PM, Sarah Sharp wrote:

On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote:

On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote:


Yes, that's true.  Some kernel developers are better at moderating their
comments and tone towards individuals who are "sensitive".  Others
simply don't give a shit.  So we need to figure out how to meet
somewhere in the middle, in order to establish a baseline of civility.

I have to ask this because I'm thick, and don't really understand,
but ...

What problem exactly are we trying to solve here?

Personal attacks are not cool Steve.  Some people simply don't care if a
verbal tirade is directed at them.  Others do not want anyone to attack
them personally, but they're fine with people attacking their code.

Bystanders that don't understand the kernel community structure are
discouraged from contributing because they don't want to be verbally
abused, and they really don't want to see either personal attacks or
intense belittling, demeaning comments about code.

In order to make our community better, we need to figure out where the
baseline of "good" behavior is.  We need to define what behavior we want
from both maintainers and patch submitters.  E.g. "No regressions" and
"don't break userspace" and "no personal attacks".  That needs to be
written down somewhere, and it isn't.  If it's documented somewhere,
point me to the file in Documentation.  Hint: it's not there.

That is the problem.

Sarah Sharp


The problem you are pointing out - and it is a problem - makes us less effective 
as a community.


Getting the balance right is clearly difficult in a large, diverse community, 
but I do think that the key is to focus criticism on the code or technical 
arguments and avoid attacks on the individual.


Being direct and funny in a critique is not the core of the issue,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-16 Thread Ric Wheeler

On 07/16/2013 07:12 PM, Sarah Sharp wrote:

On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote:

On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote:


Yes, that's true.  Some kernel developers are better at moderating their
comments and tone towards individuals who are sensitive.  Others
simply don't give a shit.  So we need to figure out how to meet
somewhere in the middle, in order to establish a baseline of civility.

I have to ask this because I'm thick, and don't really understand,
but ...

What problem exactly are we trying to solve here?

Personal attacks are not cool Steve.  Some people simply don't care if a
verbal tirade is directed at them.  Others do not want anyone to attack
them personally, but they're fine with people attacking their code.

Bystanders that don't understand the kernel community structure are
discouraged from contributing because they don't want to be verbally
abused, and they really don't want to see either personal attacks or
intense belittling, demeaning comments about code.

In order to make our community better, we need to figure out where the
baseline of good behavior is.  We need to define what behavior we want
from both maintainers and patch submitters.  E.g. No regressions and
don't break userspace and no personal attacks.  That needs to be
written down somewhere, and it isn't.  If it's documented somewhere,
point me to the file in Documentation.  Hint: it's not there.

That is the problem.

Sarah Sharp


The problem you are pointing out - and it is a problem - makes us less effective 
as a community.


Getting the balance right is clearly difficult in a large, diverse community, 
but I do think that the key is to focus criticism on the code or technical 
arguments and avoid attacks on the individual.


Being direct and funny in a critique is not the core of the issue,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML

2013-07-16 Thread Ric Wheeler

On 07/16/2013 07:53 PM, Myklebust, Trond wrote:

On Tue, 2013-07-16 at 19:31 -0400, Ric Wheeler wrote:

On 07/16/2013 07:12 PM, Sarah Sharp wrote:

On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote:

On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote:


Yes, that's true.  Some kernel developers are better at moderating their
comments and tone towards individuals who are sensitive.  Others
simply don't give a shit.  So we need to figure out how to meet
somewhere in the middle, in order to establish a baseline of civility.

I have to ask this because I'm thick, and don't really understand,
but ...

What problem exactly are we trying to solve here?

Personal attacks are not cool Steve.  Some people simply don't care if a
verbal tirade is directed at them.  Others do not want anyone to attack
them personally, but they're fine with people attacking their code.

Bystanders that don't understand the kernel community structure are
discouraged from contributing because they don't want to be verbally
abused, and they really don't want to see either personal attacks or
intense belittling, demeaning comments about code.

In order to make our community better, we need to figure out where the
baseline of good behavior is.  We need to define what behavior we want
from both maintainers and patch submitters.  E.g. No regressions and
don't break userspace and no personal attacks.  That needs to be
written down somewhere, and it isn't.  If it's documented somewhere,
point me to the file in Documentation.  Hint: it's not there.

That is the problem.

Sarah Sharp

The problem you are pointing out - and it is a problem - makes us less effective
as a community.

Not really. Most of the people who already work as part of this
community are completely used to it. We've created the environment, and
have no problems with it.


You should never judge success by being popular with those people who are 
already contributing and put up with things. If you did that in business, you 
would never reach new customers.




Where it could possibly be a problem is when it comes to recruiting
_new_ members to our community. Particularly so given that some
journalists take a special pleasure in reporting particularly juicy
comments and antics. That would tend to scare off a lot of gun-shy
newbies.


That is my point - recruiting new members is made harder. As some one who 
manages *a lot* of upstream kernel developers, I will add that it is not just 
new comers that find this occasionally offensive and off putting.



On the other hand, it might tend to bias our recruitment toward people
of a more special disposition. Perhaps we finally need the services of
a social scientist to help us find out...



To be fair, we usually do very well at this, especially with new comers to our 
community.


I think that most of the problems come up between people who know each other 
quite well and are friendly with each other in person. The problem is that when 
you use language that you would use with good friends over drinks to tell them 
they are being stupid and do that on a public list, you set a tone that reaches 
far beyond your intended target. All of those new comers also read this list and 
do not see it as funny or friendly.


I really don't think that we have to be politically correct or overly kind to 
make things better.


As a very low bar, we could start by trying to avoid using language that would 
get you fired when you send off an email to someone that you have power over 
(either manage directly or indirectly control their career).


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point

2013-05-16 Thread Ric Wheeler

On 05/15/2013 04:03 PM, Zach Brown wrote:

On Wed, May 15, 2013 at 07:44:05PM +, Eric Wong wrote:

Why introduce a new syscall instead of extending sys_splice?

Personally, I think it's ugly to have different operations use the same
syscall just because their arguments match.


I agree with Zach - having a system call called "splice" do copy offloads is not 
intuitive.


This is a very reasonable name for something that battled its way through 
several standards bodies (for NFS and SCSI :)), so we should give it a 
reasonable name


thanks!

Ric



But that preference aside, sure, if the consensus is that we'd rather
use the splice() entry point then I can duck tape the pieces together to
make it work.


If the user doesn't need a out offset, then sendfile() should also be
able to transparently utilize COPY/CLONE_RANGE, too.

Perhaps, yeah.

- z


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point

2013-05-16 Thread Ric Wheeler

On 05/15/2013 04:03 PM, Zach Brown wrote:

On Wed, May 15, 2013 at 07:44:05PM +, Eric Wong wrote:

Why introduce a new syscall instead of extending sys_splice?

Personally, I think it's ugly to have different operations use the same
syscall just because their arguments match.


I agree with Zach - having a system call called splice do copy offloads is not 
intuitive.


This is a very reasonable name for something that battled its way through 
several standards bodies (for NFS and SCSI :)), so we should give it a 
reasonable name


thanks!

Ric



But that preference aside, sure, if the consensus is that we'd rather
use the splice() entry point then I can duck tape the pieces together to
make it work.


If the user doesn't need a out offset, then sendfile() should also be
able to transparently utilize COPY/CLONE_RANGE, too.

Perhaps, yeah.

- z


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?

2013-03-31 Thread Ric Wheeler

On 03/31/2013 07:18 PM, Pavel Machek wrote:

Hi!


Take a look at how many actively used filesystems out there that have
some variant of sillyrename(), and explain what you want to do in those
cases.

Well. Yes, there are non-unix filesystems around. You have to deal
with silly files on them, and this will not be different.

So this would be a local POSIX filesystem only solution to a problem
that has yet to be formulated?

Problem is "clasical create temp file then delete it" is racy. See the
archives. That is useful & common operation.

Which race are you concerned with exactly?

User wants to test for a file with name "foo.txt"

* create "foo.txt~" (or whatever)
* write contents into "foo.txt~"
* rename "foo.txt~" to "foo.txt"

Until rename is done, the file does not exists and is not complete.
You will potentially have a garbage file to clean up if the program
(or system) crashes, but that is not racy in a classic sense, right?

Well. If people rsync from you, they will start fetching incomplete
foo.txt~. Plus the garbage issue.


That is not racy, just garbage (not trying to be pedantic, just trying to 
understand). I can see that the "~" file is annoying, but we have dealt with it 
for a *long* time :)


Until it has the right name (on either the source or target system for rsync), 
it is not the file you are looking for.



This is more of a garbage clean up issue?

Also. Plus sometimes you want temporary "file" that is
deleted. Terminals use it for history, etc...


There you would have a race, you can create a file and unlink it of course and 
still write to it, but you would have a potential empty file issue?


Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?

2013-03-31 Thread Ric Wheeler

On 03/31/2013 06:50 PM, Pavel Machek wrote:

On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote:

On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote:

Hmm. open_deleted_file() will still need to get a directory... so it
will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
be acceptable interface?

...and what's the big plan to make this work on anything other than ext4 and 
btrfs?

Deleted but open files are from original unix, so it should work on
anything unixy (minix, ext, ext2, ...).

minix, ext, ext2... are not under active development and haven't been
for more than a decade.

Take a look at how many actively used filesystems out there that have
some variant of sillyrename(), and explain what you want to do in those
cases.

Well. Yes, there are non-unix filesystems around. You have to deal
with silly files on them, and this will not be different.

So this would be a local POSIX filesystem only solution to a problem
that has yet to be formulated?

Problem is "clasical create temp file then delete it" is racy. See the
archives. That is useful & common operation.


Which race are you concerned with exactly?

User wants to test for a file with name "foo.txt"

* create "foo.txt~" (or whatever)
* write contents into "foo.txt~"
* rename "foo.txt~" to "foo.txt"

Until rename is done, the file does not exists and is not complete. You will 
potentially have a garbage file to clean up if the program (or system) crashes, 
but that is not racy in a classic sense, right?


This is more of a garbage clean up issue?

Regards,

Ric



Problem is "atomicaly create file at target location with guaranteed
right content". That's also in the archives. Looks useful if someone
does rsync from your directory.

Non-POSIX filesystems have problems handling deleted files, but that
was always the case. That's one of the reasons they are seldomly used
for root filesystems.

Pavel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?

2013-03-31 Thread Ric Wheeler

On 03/31/2013 06:50 PM, Pavel Machek wrote:

On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote:

On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote:

Hmm. open_deleted_file() will still need to get a directory... so it
will still need a path. Perhaps open(/foo/bar/mnt, O_DELETED) would
be acceptable interface?

...and what's the big plan to make this work on anything other than ext4 and 
btrfs?

Deleted but open files are from original unix, so it should work on
anything unixy (minix, ext, ext2, ...).

minix, ext, ext2... are not under active development and haven't been
for more than a decade.

Take a look at how many actively used filesystems out there that have
some variant of sillyrename(), and explain what you want to do in those
cases.

Well. Yes, there are non-unix filesystems around. You have to deal
with silly files on them, and this will not be different.

So this would be a local POSIX filesystem only solution to a problem
that has yet to be formulated?

Problem is clasical create temp file then delete it is racy. See the
archives. That is useful  common operation.


Which race are you concerned with exactly?

User wants to test for a file with name foo.txt

* create foo.txt~ (or whatever)
* write contents into foo.txt~
* rename foo.txt~ to foo.txt

Until rename is done, the file does not exists and is not complete. You will 
potentially have a garbage file to clean up if the program (or system) crashes, 
but that is not racy in a classic sense, right?


This is more of a garbage clean up issue?

Regards,

Ric



Problem is atomicaly create file at target location with guaranteed
right content. That's also in the archives. Looks useful if someone
does rsync from your directory.

Non-POSIX filesystems have problems handling deleted files, but that
was always the case. That's one of the reasons they are seldomly used
for root filesystems.

Pavel


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?

2013-03-31 Thread Ric Wheeler

On 03/31/2013 07:18 PM, Pavel Machek wrote:

Hi!


Take a look at how many actively used filesystems out there that have
some variant of sillyrename(), and explain what you want to do in those
cases.

Well. Yes, there are non-unix filesystems around. You have to deal
with silly files on them, and this will not be different.

So this would be a local POSIX filesystem only solution to a problem
that has yet to be formulated?

Problem is clasical create temp file then delete it is racy. See the
archives. That is useful  common operation.

Which race are you concerned with exactly?

User wants to test for a file with name foo.txt

* create foo.txt~ (or whatever)
* write contents into foo.txt~
* rename foo.txt~ to foo.txt

Until rename is done, the file does not exists and is not complete.
You will potentially have a garbage file to clean up if the program
(or system) crashes, but that is not racy in a classic sense, right?

Well. If people rsync from you, they will start fetching incomplete
foo.txt~. Plus the garbage issue.


That is not racy, just garbage (not trying to be pedantic, just trying to 
understand). I can see that the ~ file is annoying, but we have dealt with it 
for a *long* time :)


Until it has the right name (on either the source or target system for rsync), 
it is not the file you are looking for.



This is more of a garbage clean up issue?

Also. Plus sometimes you want temporary file that is
deleted. Terminals use it for history, etc...


There you would have a race, you can create a file and unlink it of course and 
still write to it, but you would have a potential empty file issue?


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-03-30 Thread Ric Wheeler

On 03/30/2013 05:57 PM, Myklebust, Trond wrote:

On Mar 30, 2013, at 5:45 PM, Pavel Machek 
  wrote:


On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:

On 2013-03-30, at 12:49 PM, Pavel Machek wrote:

Hmm, really? AFAICT it would be simple to provide an
open_deleted_file("directory") syscall. You'd open_deleted_file(),
copy source file into it, then fsync(), then link it into filesystem.

That should have atomicity properties reflected.

Actually, the open_deleted_file() syscall is quite useful for many
different things all by itself.  Lots of applications need to create
temporary files that are unlinked at application failure (without a
race if app crashes after creating the file, but before unlinking).
It also avoids exposing temporary files into the namespace if other
applications are accessing the directory.

Hmm. open_deleted_file() will still need to get a directory... so it
will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
be acceptable interface?
Pavel

...and what's the big plan to make this work on anything other than ext4 and 
btrfs?

Cheers,
   Trond


I know that change can be a good thing, but are we really solving a pressing 
problem given that application developers have dealt with open/rename as the way 
to get "atomic" file creation for several decades now ?


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-03-30 Thread Ric Wheeler

On 03/30/2013 05:57 PM, Myklebust, Trond wrote:

On Mar 30, 2013, at 5:45 PM, Pavel Machek pa...@ucw.cz
  wrote:


On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:

On 2013-03-30, at 12:49 PM, Pavel Machek wrote:

Hmm, really? AFAICT it would be simple to provide an
open_deleted_file(directory) syscall. You'd open_deleted_file(),
copy source file into it, then fsync(), then link it into filesystem.

That should have atomicity properties reflected.

Actually, the open_deleted_file() syscall is quite useful for many
different things all by itself.  Lots of applications need to create
temporary files that are unlinked at application failure (without a
race if app crashes after creating the file, but before unlinking).
It also avoids exposing temporary files into the namespace if other
applications are accessing the directory.

Hmm. open_deleted_file() will still need to get a directory... so it
will still need a path. Perhaps open(/foo/bar/mnt, O_DELETED) would
be acceptable interface?
Pavel

...and what's the big plan to make this work on anything other than ext4 and 
btrfs?

Cheers,
   Trond


I know that change can be a good thing, but are we really solving a pressing 
problem given that application developers have dealt with open/rename as the way 
to get atomic file creation for several decades now ?


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-25 Thread Ric Wheeler

On 02/25/2013 04:14 PM, Andy Lutomirski wrote:

On 02/21/2013 02:24 PM, Zach Brown wrote:

On Thu, Feb 21, 2013 at 08:50:27PM +, Myklebust, Trond wrote:

On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:

Il 21/02/2013 15:57, Ric Wheeler ha scritto:

sendfile64() pretty much already has the right arguments for a
"copyfile", however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

That would seem to be enough to me and has the advantage that it is an
relatively obvious extension to something that is at least not totally
unknown to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink
need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

So how would a system call based on sendfile64() plus my flag parameter
prevent an underlying implementation from meeting your criterion?

If I'm guessing correctly, sendfile64()+flags would be annoying because
it's missing an out_fd_offset.  The host will want to offload the
guest's copies by calling sendfile on block ranges of a guest disk image
file that correspond to the mappings of the in and out files in the
guest.

You could make it work with some locking and out_fd seeking to set the
write offset before calling sendfile64()+flags, but ugh.

  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
   out_offset, size_t count, int flags);

That seems closer.

We might also want to pre-emptively offer iovs instead of offsets,
because that's the very first thing that's going to be requested after
people prototype having to iterate calling sendfile() for each
contiguous copy region.

I thought the first thing people would ask for is to atomically create a
new file and copy the old file into it (at least on local file systems).
  The idea is that nothing should see an empty destination file, either
by race or by crash.  (This feature would perhaps be described as a
pony, but it should be implementable.)

This would be like a better link(2).

--Andy


Why would this need to be atomic? That would seem to be a very difficult 
property to provide across all target types with multi-GB sized files...


Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-25 Thread Ric Wheeler

On 02/25/2013 04:14 PM, Andy Lutomirski wrote:

On 02/21/2013 02:24 PM, Zach Brown wrote:

On Thu, Feb 21, 2013 at 08:50:27PM +, Myklebust, Trond wrote:

On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:

Il 21/02/2013 15:57, Ric Wheeler ha scritto:

sendfile64() pretty much already has the right arguments for a
copyfile, however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

That would seem to be enough to me and has the advantage that it is an
relatively obvious extension to something that is at least not totally
unknown to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink
need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

So how would a system call based on sendfile64() plus my flag parameter
prevent an underlying implementation from meeting your criterion?

If I'm guessing correctly, sendfile64()+flags would be annoying because
it's missing an out_fd_offset.  The host will want to offload the
guest's copies by calling sendfile on block ranges of a guest disk image
file that correspond to the mappings of the in and out files in the
guest.

You could make it work with some locking and out_fd seeking to set the
write offset before calling sendfile64()+flags, but ugh.

  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
   out_offset, size_t count, int flags);

That seems closer.

We might also want to pre-emptively offer iovs instead of offsets,
because that's the very first thing that's going to be requested after
people prototype having to iterate calling sendfile() for each
contiguous copy region.

I thought the first thing people would ask for is to atomically create a
new file and copy the old file into it (at least on local file systems).
  The idea is that nothing should see an empty destination file, either
by race or by crash.  (This feature would perhaps be described as a
pony, but it should be implementable.)

This would be like a better link(2).

--Andy


Why would this need to be atomic? That would seem to be a very difficult 
property to provide across all target types with multi-GB sized files...


Ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-22 Thread Ric Wheeler

On 02/22/2013 10:47 AM, Paolo Bonzini wrote:

Il 21/02/2013 23:24, Zach Brown ha scritto:

You could make it work with some locking and out_fd seeking to set the
write offset before calling sendfile64()+flags, but ugh.

  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
   out_offset, size_t count, int flags);

That seems closer.

We might also want to pre-emptively offer iovs instead of offsets,
because that's the very first thing that's going to be requested after
people prototype having to iterate calling sendfile() for each
contiguous copy region.

Indeed, I was about to propose that exactly.  So that would be
psendfilev.  I don't think psendfile is useful, and can be easily
provided at the libc level.

Paolo


This seems to be suspiciously close to a clear consensus on how to move forward 
after many years of spinning our wheels. Anyone want to promote an actual patch 
before we change our collective minds?


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-22 Thread Ric Wheeler

On 02/21/2013 11:13 PM, Myklebust, Trond wrote:

On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote:

On 02/21/2013 09:00 PM, Paolo Bonzini wrote:

Il 21/02/2013 15:57, Ric Wheeler ha scritto:

sendfile64() pretty much already has the right arguments for a
"copyfile", however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

That would seem to be enough to me and has the advantage that it is an
relatively obvious extension to something that is at least not totally
unknown to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink
need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

Paolo

I don't think that the NFS protocol allows arbitrary ranges, but the SCSI
commands are ranged based.

If I remember what the windows people said at a SNIA event a few years back,
they have a requirement that the target file be pre-allocated (at least for the
SCSI based copy). Not clear to me where they iterate over that target file to do
the block range copies, but I suspect it is in their kernel.

The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary
byte ranges. The main target for that functionality is indeed
virtualisation and thin provisioning of virtual machines.



For background, here is a pointer to Fred Knight's SNIA talk on the SCSI support 
for offload:


https://snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_Storage_Data_Movement_Offload.pdf

and a talk from Spencer Shepler that gives some detail on the NFS spec, 
including the "server side copy" bits:


https://snia.org/sites/default/files2/SDC2011/presentations/wednesday/SpencerShepler_IETF_NFSv4_Working_Group_v4.pdf

The talks both have references to the actual specs for the gory details.

Ric




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-22 Thread Ric Wheeler

On 02/21/2013 11:13 PM, Myklebust, Trond wrote:

On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote:

On 02/21/2013 09:00 PM, Paolo Bonzini wrote:

Il 21/02/2013 15:57, Ric Wheeler ha scritto:

sendfile64() pretty much already has the right arguments for a
copyfile, however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

That would seem to be enough to me and has the advantage that it is an
relatively obvious extension to something that is at least not totally
unknown to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink
need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

Paolo

I don't think that the NFS protocol allows arbitrary ranges, but the SCSI
commands are ranged based.

If I remember what the windows people said at a SNIA event a few years back,
they have a requirement that the target file be pre-allocated (at least for the
SCSI based copy). Not clear to me where they iterate over that target file to do
the block range copies, but I suspect it is in their kernel.

The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary
byte ranges. The main target for that functionality is indeed
virtualisation and thin provisioning of virtual machines.



For background, here is a pointer to Fred Knight's SNIA talk on the SCSI support 
for offload:


https://snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_Storage_Data_Movement_Offload.pdf

and a talk from Spencer Shepler that gives some detail on the NFS spec, 
including the server side copy bits:


https://snia.org/sites/default/files2/SDC2011/presentations/wednesday/SpencerShepler_IETF_NFSv4_Working_Group_v4.pdf

The talks both have references to the actual specs for the gory details.

Ric




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-22 Thread Ric Wheeler

On 02/22/2013 10:47 AM, Paolo Bonzini wrote:

Il 21/02/2013 23:24, Zach Brown ha scritto:

You could make it work with some locking and out_fd seeking to set the
write offset before calling sendfile64()+flags, but ugh.

  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
   out_offset, size_t count, int flags);

That seems closer.

We might also want to pre-emptively offer iovs instead of offsets,
because that's the very first thing that's going to be requested after
people prototype having to iterate calling sendfile() for each
contiguous copy region.

Indeed, I was about to propose that exactly.  So that would be
psendfilev.  I don't think psendfile is useful, and can be easily
provided at the libc level.

Paolo


This seems to be suspiciously close to a clear consensus on how to move forward 
after many years of spinning our wheels. Anyone want to promote an actual patch 
before we change our collective minds?


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New copyfile system call - discuss before LSF?

2013-02-21 Thread Ric Wheeler

On 02/21/2013 09:00 PM, Paolo Bonzini wrote:

Il 21/02/2013 15:57, Ric Wheeler ha scritto:

sendfile64() pretty much already has the right arguments for a
"copyfile", however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

That would seem to be enough to me and has the advantage that it is an
relatively obvious extension to something that is at least not totally
unknown to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink
need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

Paolo


I don't think that the NFS protocol allows arbitrary ranges, but the SCSI 
commands are ranged based.


If I remember what the windows people said at a SNIA event a few years back, 
they have a requirement that the target file be pre-allocated (at least for the 
SCSI based copy). Not clear to me where they iterate over that target file to do 
the block range copies, but I suspect it is in their kernel.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >