Re: [LSF/MM TOPIC] The end of the DAX experiment
On 2/6/19 4:12 PM, Dan Williams wrote: Before people get too excited this isn't a proposal to kill DAX. The topic proposal is a discussion to resolve lingering open questions that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the current DAX facilities are enabled. The are 2 primary concerns to resolve. Enumerate the remaining features/fixes, and identify a path to implement it all without regressing any existing application use cases. An enumeration of remaining projects follows, please expand this list if I missed something: * "DAX" has no specific meaning by itself, users have 2 use cases for "DAX" capabilities: userspace cache management via MAP_SYNC, and page cache avoidance where the latter aspect of DAX has no current api to discover / use it. The project is to supplement MAP_SYNC with a MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an application hint to avoid / minimiize page cache usage, but no strict guarantee like what MAP_SYNC provides. Sounds like a great topic to me. Having just gone through a new round of USENIX paper reviews, it is interesting to see how many academic systems are being pitched in this space (and most of them don't mention the kernel based xfs/ext4 with dax). Regards, Ric * Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of longterm-GUP (a topic in its own right) the projects here are XFS-reflink and XFS-realtime-device support. DAX+reflink effectively requires a given physical page to be mapped into two different inodes at different (page->index) offsets. The challenge is to support DAX-reflink without violating any existing application visible semantics, the operating assumption / strawman to debate is that experimental status is not blanket permission to go change existing semantics in backwards incompatible ways. * Deprecate, but not remove, the DAX mount option. Too many flows depend on the option so it will never go away, but the facility is too coarse. Provide an option to enable MAP_SYNC and more-likely-to-do-something-useful-MAP_DIRECT on a per-directory basis. The current proposal is to allow this property to only be toggled while the directory is empty to avoid the complications of racing page invalidation with new DAX mappings. Secondary projects, i.e. important but I would submit are not in the critical path to removing the "experimental" designation: * Filesystem-integrated badblock management. Hook up the media error notifications from libnvdimm to the filesystem to allow for operations like "list files with media errors" and "enumerate bad file offsets on a granulatiy smaller than a page". Another consideration along these lines is to integrate machine-check-handling and dynamic error notification into a filesystem interface. I've heard complaints that the sigaction() based mechanism to receive BUS_MCEERR_* information, while sufficient for the "System RAM" use case, is not precise enough for the "Persistent Memory / DAX" use case where errors are repairable and sub-page error information is useful. * Userfaultfd for file-backed mappings and DAX Ideally all the usual DAX, persistent memory, and GUP suspects could be in the room to discuss this: * Jan Kara * Dave Chinner * Christoph Hellwig * Jeff Moyer * Johannes Thumshirn * Matthew Wilcox * John Hubbard * Jérôme Glisse * MM folks for the reflink vs 'struct page' vs Xarray considerations
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/16/2016 06:23 PM, Chris Mason wrote: On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote: On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote: On 3/15/16 7:06 PM, Linus Torvalds wrote: On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinnerwrote: It is pretty clear that the onus is on the patch submitter to provide justification for inclusion, not for the reviewer/Maintainer to have to prove that the solution is unworkable. I agree, but quite frankly, performance is a good justification. So if Ted can give performance numbers, that's justification enough. We've certainly taken changes with less. I've been away from ext4 for a while, so I'm really not on top of the mechanics of the underlying problem at the moment. But I would say that in addition to numbers showing that ext4 has trouble with unwritten extent conversion, we should have an explanation of why it can't be solved in a way that doesn't open up these concerns. XFS certainly has different mechanisms, but is the demonstrated workload problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the solutions that make the workload perform better on other filesystems? When I've benchmarked this in the past, doing small random buffered writes into an preallocated extent was dramatically (3x or more) slower on xfs than doing them into a fully written extent. That was two years ago, but I can redo it. So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5 kernel). This is O_DIRECT without O_SYNC. I don't think xfs will do commits for each IO into the prealloc file? O_SYNC makes it much slower, so hopefully I've got this right. The test runs for 60 seconds, and I used an iodepth of 4: prealloc file: 32,000 iops overwrite:121,000 iops If I bump the iodepth up to 512: prealloc file: 33,000 iops overwrite: 279,000 iops For streaming writes, XFS converts prealloc to written much better when the IO isn't random. You can start seeing the difference at 16K sequential O_DIRECT writes, but really its not a huge impact. The worst case is 4K: prealloc file: 227MB/s overwrite: 340MB/s I can't think of sequential workloads where this will matter, since they will either end up with bigger IO or the performance impact won't get noticed. -chris I think that these numbers are the interesting ones, see a 4x slow down is certainly significant. If you do the same patch after hacking XFS preallocation as Dave suggested with xfs_db, do we get most of the performance back? Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/16/2016 06:23 PM, Chris Mason wrote: On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote: On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote: On 3/15/16 7:06 PM, Linus Torvalds wrote: On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner wrote: It is pretty clear that the onus is on the patch submitter to provide justification for inclusion, not for the reviewer/Maintainer to have to prove that the solution is unworkable. I agree, but quite frankly, performance is a good justification. So if Ted can give performance numbers, that's justification enough. We've certainly taken changes with less. I've been away from ext4 for a while, so I'm really not on top of the mechanics of the underlying problem at the moment. But I would say that in addition to numbers showing that ext4 has trouble with unwritten extent conversion, we should have an explanation of why it can't be solved in a way that doesn't open up these concerns. XFS certainly has different mechanisms, but is the demonstrated workload problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the solutions that make the workload perform better on other filesystems? When I've benchmarked this in the past, doing small random buffered writes into an preallocated extent was dramatically (3x or more) slower on xfs than doing them into a fully written extent. That was two years ago, but I can redo it. So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5 kernel). This is O_DIRECT without O_SYNC. I don't think xfs will do commits for each IO into the prealloc file? O_SYNC makes it much slower, so hopefully I've got this right. The test runs for 60 seconds, and I used an iodepth of 4: prealloc file: 32,000 iops overwrite:121,000 iops If I bump the iodepth up to 512: prealloc file: 33,000 iops overwrite: 279,000 iops For streaming writes, XFS converts prealloc to written much better when the IO isn't random. You can start seeing the difference at 16K sequential O_DIRECT writes, but really its not a huge impact. The worst case is 4K: prealloc file: 227MB/s overwrite: 340MB/s I can't think of sequential workloads where this will matter, since they will either end up with bigger IO or the performance impact won't get noticed. -chris I think that these numbers are the interesting ones, see a 4x slow down is certainly significant. If you do the same patch after hacking XFS preallocation as Dave suggested with xfs_db, do we get most of the performance back? Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/17/2016 01:47 PM, Linus Torvalds wrote: On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnumwrote: So we've not asked for NO_HIDE_STALE on the mailing lists, but I think it was one of the problems Sage had using xfs in his BlueStore implementation and was a big part of why it moved to pure userspace. FileStore might use NO_HIDE_STALE in some places but it would be pretty limited. When it came up at Linux FAST we were discussing how it and similar things had been problems for us in the past and it would've been nice if they were upstream. Hmm. So to me it really sounds like somebody should cook up a patch, but we shouldn't put it in the upstream kernel until we get numbers and actual "yes, we'd use this" from outside of google. I say "outside of google", because inside of google not only do we not get numbers, but google can maintain their own patch. But maybe Ted could at least post the patch google uses, and somebody in the Ceph community might want to at least try it out... What *is* a big deal for FileStore (and would be easy to take advantage of) is the thematically similar O_NOMTIME flag, which is also about reducing metadata updates and got blocked on similar stupid-user grounds (although not security ones): http://thread.gmane.org/gmane.linux.kernel.api/10727. Hmm. I don't hate that patch, because the NOATIME thing really does wonders on many loads. NOMTIME makes sense. It's not like you can't do this with utimes() anyway. That said, I do wonder if people wouldn't just prefer to expand on and improve on the lazytime. Is there some reason you guys didn't use that? As noted though, we've basically given up and are moving to a pure-userspace solution as quickly as we can. That argues against worrying about this all in the kernel unless there are other users. Linus Just a note, when Greg says "user space solution", Ceph is looking at writing directly to raw block devices which is kind of a through back to early enterprise database trends. Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/17/2016 01:47 PM, Linus Torvalds wrote: On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum wrote: So we've not asked for NO_HIDE_STALE on the mailing lists, but I think it was one of the problems Sage had using xfs in his BlueStore implementation and was a big part of why it moved to pure userspace. FileStore might use NO_HIDE_STALE in some places but it would be pretty limited. When it came up at Linux FAST we were discussing how it and similar things had been problems for us in the past and it would've been nice if they were upstream. Hmm. So to me it really sounds like somebody should cook up a patch, but we shouldn't put it in the upstream kernel until we get numbers and actual "yes, we'd use this" from outside of google. I say "outside of google", because inside of google not only do we not get numbers, but google can maintain their own patch. But maybe Ted could at least post the patch google uses, and somebody in the Ceph community might want to at least try it out... What *is* a big deal for FileStore (and would be easy to take advantage of) is the thematically similar O_NOMTIME flag, which is also about reducing metadata updates and got blocked on similar stupid-user grounds (although not security ones): http://thread.gmane.org/gmane.linux.kernel.api/10727. Hmm. I don't hate that patch, because the NOATIME thing really does wonders on many loads. NOMTIME makes sense. It's not like you can't do this with utimes() anyway. That said, I do wonder if people wouldn't just prefer to expand on and improve on the lazytime. Is there some reason you guys didn't use that? As noted though, we've basically given up and are moving to a pure-userspace solution as quickly as we can. That argues against worrying about this all in the kernel unless there are other users. Linus Just a note, when Greg says "user space solution", Ceph is looking at writing directly to raw block devices which is kind of a through back to early enterprise database trends. Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/13/2016 07:30 PM, Dave Chinner wrote: On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote: On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'owrote: At the end of the day it's about whether you trust the userspace program or not. There's a big difference between "give the user rope", and "tie the rope in a noose and put a banana peel so that the user might stumble into the rope and hang himself", though. So I do think that Dave is right that we should also strive to make sure that our interfaces are not just secure in theory, but that they are also good interfaces to make mistakes less likely. At which point I have to ask: how do we safely allow filesystems to expose stale data in files? There's a big "we need to trust userspace" component in ever proposal that has been made so far - that's the part I have extreme trouble with. For example, what happens when a backup process running as root a file that has exposed stale data? Yes, we could set the "NODUMP" flag on the inode to tell backup programs to skip backing up such files, but we're now trusting some random userspace application (e.g. tar, rsync, etc) not to do something we don't want it to do with the data in that file. AFAICT, we can't stop root from copying files that have exposed stale data or changing their ownership without some kind of special handling of "contains stale data" files within the kernel. At this point we are back to needing persistent tracking of the "exposed stale data" state in the inode as the only safe way to allow us to expose stale data. That's fairly ironic given that the stated purpose of exposing stale data through fallocate is to avoid the overhead of the existing mechanisms we use to track extents containing stale data I think that once we enter this mode, the local file system has effectively ceded its role to prevent stale data exposure to the upper layer. In effect, this ceases to become a normal file system for any enabled process if we control this through fallocate() or for all processes if we do the brute force mount option that would be file system wide. That means we would not need to track this. Extents would be marked as if they always have had valid data (no more allocated but unwritten state). In the end, that is the actual goal - move this enforcement up a layer for overlay/user space file systems that are then responsible for policing this ind of thing. Regards, Ric I think we _should_ give users rope, but maybe we should also make sure that there isn't some hidden rapidly spinning saw-blade right next to the rope that the user doesn't even think about. IMO we already have a good, safe interface that provides the rope without the saw blades. I'm happy to be proven wrong, but IMO I don't see that we can provide stale data exposure in a safe, non-saw-bladey way without any kernel/filesystem side overhead. Cheers, Dave.
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/13/2016 07:30 PM, Dave Chinner wrote: On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote: On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o wrote: At the end of the day it's about whether you trust the userspace program or not. There's a big difference between "give the user rope", and "tie the rope in a noose and put a banana peel so that the user might stumble into the rope and hang himself", though. So I do think that Dave is right that we should also strive to make sure that our interfaces are not just secure in theory, but that they are also good interfaces to make mistakes less likely. At which point I have to ask: how do we safely allow filesystems to expose stale data in files? There's a big "we need to trust userspace" component in ever proposal that has been made so far - that's the part I have extreme trouble with. For example, what happens when a backup process running as root a file that has exposed stale data? Yes, we could set the "NODUMP" flag on the inode to tell backup programs to skip backing up such files, but we're now trusting some random userspace application (e.g. tar, rsync, etc) not to do something we don't want it to do with the data in that file. AFAICT, we can't stop root from copying files that have exposed stale data or changing their ownership without some kind of special handling of "contains stale data" files within the kernel. At this point we are back to needing persistent tracking of the "exposed stale data" state in the inode as the only safe way to allow us to expose stale data. That's fairly ironic given that the stated purpose of exposing stale data through fallocate is to avoid the overhead of the existing mechanisms we use to track extents containing stale data I think that once we enter this mode, the local file system has effectively ceded its role to prevent stale data exposure to the upper layer. In effect, this ceases to become a normal file system for any enabled process if we control this through fallocate() or for all processes if we do the brute force mount option that would be file system wide. That means we would not need to track this. Extents would be marked as if they always have had valid data (no more allocated but unwritten state). In the end, that is the actual goal - move this enforcement up a layer for overlay/user space file systems that are then responsible for policing this ind of thing. Regards, Ric I think we _should_ give users rope, but maybe we should also make sure that there isn't some hidden rapidly spinning saw-blade right next to the rope that the user doesn't even think about. IMO we already have a good, safe interface that provides the rope without the saw blades. I'm happy to be proven wrong, but IMO I don't see that we can provide stale data exposure in a safe, non-saw-bladey way without any kernel/filesystem side overhead. Cheers, Dave.
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/11/2016 12:03 AM, Linus Torvalds wrote: On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler <ricwhee...@gmail.com> wrote: What was objectionable at the time this patch was raised years back (not just to me, but to pretty much every fs developer at LSF/MM that year) centered on the concern that this would be viewed as a "performance" mode and we get pressure to support this for non-priveleged users. It gives any user effectively the ability to read the block device content for previously allocated data without restriction. The sane way to do it would be to just check permissions of the underlying block device. That way, people can just set the permissions for that to whatever they want. If google right now uses some magical group for this, they could make the underlying block device be writable for that group. We can do the security check at the filesystem level, because we have sb->s_bdev->bd_inode, and if you have read and write permissions to that inode, you might as well have permission to create a unsafe hole. That doesn't sound very hacky to me. Linus I agree that this sounds quite reasonable. Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/11/2016 12:03 AM, Linus Torvalds wrote: On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler wrote: What was objectionable at the time this patch was raised years back (not just to me, but to pretty much every fs developer at LSF/MM that year) centered on the concern that this would be viewed as a "performance" mode and we get pressure to support this for non-priveleged users. It gives any user effectively the ability to read the block device content for previously allocated data without restriction. The sane way to do it would be to just check permissions of the underlying block device. That way, people can just set the permissions for that to whatever they want. If google right now uses some magical group for this, they could make the underlying block device be writable for that group. We can do the security check at the filesystem level, because we have sb->s_bdev->bd_inode, and if you have read and write permissions to that inode, you might as well have permission to create a unsafe hole. That doesn't sound very hacky to me. Linus I agree that this sounds quite reasonable. Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/10/2016 04:38 AM, Theodore Ts'o wrote: On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote: I really am sensitive to the security concerns, just know that if it's a permanent blocker you're essentially blocking out a growing category of disk users (who run on an awfully large number of disks!). Or they just have to use kernels with out-of-tree patches installed. :-P If you want to consider how many disks Google has that are using this patch, I probably could have appealed to Linus and asked him to accept the patch if I forced the issue. The only reason why I didn't was that people like Ric Wheeler threatened to have distro-specific patches to disable the feature, and at the end of the day, I didn't care that much. After all, if it makes it harder for large scale cloud companies besides Google to create more efficient userspace cluster file systems, it's not like I was keeping the patch a secret. So ultimately, if the Ceph developers want to make a case to Red Hat management that this is important, great. If not, it's not that hard for those people who need the patch and who are running large cloud infrastructures to simply apply the out-of-tree patch if they need it. Cheers, - Ted What was objectionable at the time this patch was raised years back (not just to me, but to pretty much every fs developer at LSF/MM that year) centered on the concern that this would be viewed as a "performance" mode and we get pressure to support this for non-priveleged users. It gives any user effectively the ability to read the block device content for previously allocated data without restriction. At the time, I also don't recall seeing the patch posted on upstream lists for debate or justification. As we discussed a few weeks back, I don't object to having support for doing this in carefully controlled ways for things like user space file systems. In effect, the problem of preventing other people's data being handed over to the end user is taken on by that layer of code. I suspect that fits the use case at google and Ceph both. Regards, Ric
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/10/2016 04:38 AM, Theodore Ts'o wrote: On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote: I really am sensitive to the security concerns, just know that if it's a permanent blocker you're essentially blocking out a growing category of disk users (who run on an awfully large number of disks!). Or they just have to use kernels with out-of-tree patches installed. :-P If you want to consider how many disks Google has that are using this patch, I probably could have appealed to Linus and asked him to accept the patch if I forced the issue. The only reason why I didn't was that people like Ric Wheeler threatened to have distro-specific patches to disable the feature, and at the end of the day, I didn't care that much. After all, if it makes it harder for large scale cloud companies besides Google to create more efficient userspace cluster file systems, it's not like I was keeping the patch a secret. So ultimately, if the Ceph developers want to make a case to Red Hat management that this is important, great. If not, it's not that hard for those people who need the patch and who are running large cloud infrastructures to simply apply the out-of-tree patch if they need it. Cheers, - Ted What was objectionable at the time this patch was raised years back (not just to me, but to pretty much every fs developer at LSF/MM that year) centered on the concern that this would be viewed as a "performance" mode and we get pressure to support this for non-priveleged users. It gives any user effectively the ability to read the block device content for previously allocated data without restriction. At the time, I also don't recall seeing the patch posted on upstream lists for debate or justification. As we discussed a few weeks back, I don't object to having support for doing this in carefully controlled ways for things like user space file systems. In effect, the problem of preventing other people's data being handed over to the end user is taken on by that layer of code. I suspect that fits the use case at google and Ceph both. Regards, Ric
Re: Linux Foundation Technical Advisory Board Elections and Nomination process
I would like to nominate Sage Weil with his consent. Sage has lead the ceph project since its inception, contributed to the kernel as well as had an influence on projects like openstack. thanks! Ric On 10/06/2015 01:06 PM, Grant Likely wrote: [Resending because I messed up the first one] The elections for five of the ten members of the Linux Foundation Technical Advisory Board (TAB) are held every year[1]. This year the election will be at the 2015 Kernel Summit in Seoul, South Korea (probably on the Monday, 26 October) and will be open to all attendees of both Kernel Summit and Korea Linux Forum. Anyone is eligible to stand for election, simply send your nomination to: tech-board-disc...@lists.linux-foundation.org We currently have 3 nominees for five places: Thomas Gleixner Greg Kroah-Hartman Stephen Hemminger The deadline for receiving nominations is up until the beginning of the event where the election is held. Although, please remember if you're not going to be present that things go wrong with both networks and mailing lists, so get your nomination in early). Grant Likely, TAB Chair [1] TAB members sit for a term of 2 years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are half way through their term and will be up for election next year. The history of the TAB elections can be found here: https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_YQoTNCKA/edit#gid=0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Foundation Technical Advisory Board Elections and Nomination process
I would like to nominate Sage Weil with his consent. Sage has lead the ceph project since its inception, contributed to the kernel as well as had an influence on projects like openstack. thanks! Ric On 10/06/2015 01:06 PM, Grant Likely wrote: [Resending because I messed up the first one] The elections for five of the ten members of the Linux Foundation Technical Advisory Board (TAB) are held every year[1]. This year the election will be at the 2015 Kernel Summit in Seoul, South Korea (probably on the Monday, 26 October) and will be open to all attendees of both Kernel Summit and Korea Linux Forum. Anyone is eligible to stand for election, simply send your nomination to: tech-board-disc...@lists.linux-foundation.org We currently have 3 nominees for five places: Thomas Gleixner Greg Kroah-Hartman Stephen Hemminger The deadline for receiving nominations is up until the beginning of the event where the election is held. Although, please remember if you're not going to be present that things go wrong with both networks and mailing lists, so get your nomination in early). Grant Likely, TAB Chair [1] TAB members sit for a term of 2 years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are half way through their term and will be up for election next year. The history of the TAB elections can be found here: https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_YQoTNCKA/edit#gid=0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:37 PM, Chris Mason wrote: > Circling back to what we might talk about at the conference, Ric do you > have any ideas on when these drives might hit the wild? > > -chris I will poke at vendors to see if we can get someone to make a public statement, but I cannot do that for them. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:35 PM, James Bottomley wrote: On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Only if you think laying out stuff requires block size changes. If a 4k block filesystem's allocation algorithm tried to allocate on a 16k boundary for instance, that gets us a lot of the performance without needing a lot of alteration. The key here is that we cannot assume that writes happen only during allocation/append mode. Unless the block size enforces it, we will have non-aligned, small block IO done to allocated regions that won't get coalesced. It's not even obvious that an ignorant 4k layout is going to be so bad ... the RMW occurs only at the ends of the transfers, not in the middle. If we say 16k physical block and average 128k transfers, probabalistically we misalign on 6 out of 31 sectors (or 19% of the time). We can make that better by increasing the transfer size (it comes down to 10% for 256k transfers. This really depends on the nature of the device. Some devices could produce very erratic performance or even (not today, but some day) reject the IO. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. But you're making assumptions about needing larger block sizes. I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is &
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 11:03 AM, James Bottomley wrote: On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. My memory is that Nick's work just didn't have the momentum to get pushed in. It all seemed very reasonable though, I think our hatred of buffered heads just wasn't yet bigger than the fear of moving away. But, the bigger question is how big are the blocks going to be? At some point (64K?) we might as well just make a log structured dm target and have a single setup for both shingled and large sector drives. There is no real point. Even with 4k drives today using 4k sectors in the filesystem, we still get 512 byte writes because of journalling and the buffer cache. I think that you are wrong here James. Even with 512 byte drives, the IO's we send down tend to be 4k or larger. Do you have traces that show this and details? The question is what would we need to do to support these devices and the answer is "try to send IO in x byte multiples x byte aligned" this really becomes an ioscheduler problem, not a supporting large page problem. James Not that simple. The requirement of some of these devices are that you *never* send down a partial write or an unaligned write. Also keep in mind that larger block sizes allow us to track larger files with smaller amounts of metadata which is a second win. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 09:34 AM, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. I will have to see if I can get a storage vendor to make a public statement, but there are vendors hoping to see this land in Linux in the next few years. I assume that anyone with a shipping device will have to at least emulate the 4KB sector size for years to come, but that there might be a significant performance win for platforms that can do a larger block. Note that windows seems to suffer from the exact same limitation, so we are not alone here with the vm page size/fs block size entanglement ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 09:34 AM, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. I will have to see if I can get a storage vendor to make a public statement, but there are vendors hoping to see this land in Linux in the next few years. I assume that anyone with a shipping device will have to at least emulate the 4KB sector size for years to come, but that there might be a significant performance win for platforms that can do a larger block. Note that windows seems to suffer from the exact same limitation, so we are not alone here with the vm page size/fs block size entanglement ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 11:03 AM, James Bottomley wrote: On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. My memory is that Nick's work just didn't have the momentum to get pushed in. It all seemed very reasonable though, I think our hatred of buffered heads just wasn't yet bigger than the fear of moving away. But, the bigger question is how big are the blocks going to be? At some point (64K?) we might as well just make a log structured dm target and have a single setup for both shingled and large sector drives. There is no real point. Even with 4k drives today using 4k sectors in the filesystem, we still get 512 byte writes because of journalling and the buffer cache. I think that you are wrong here James. Even with 512 byte drives, the IO's we send down tend to be 4k or larger. Do you have traces that show this and details? The question is what would we need to do to support these devices and the answer is try to send IO in x byte multiples x byte aligned this really becomes an ioscheduler problem, not a supporting large page problem. James Not that simple. The requirement of some of these devices are that you *never* send down a partial write or an unaligned write. Also keep in mind that larger block sizes allow us to track larger files with smaller amounts of metadata which is a second win. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:35 PM, James Bottomley wrote: On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and fill in some other size here on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me no to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Only if you think laying out stuff requires block size changes. If a 4k block filesystem's allocation algorithm tried to allocate on a 16k boundary for instance, that gets us a lot of the performance without needing a lot of alteration. The key here is that we cannot assume that writes happen only during allocation/append mode. Unless the block size enforces it, we will have non-aligned, small block IO done to allocated regions that won't get coalesced. It's not even obvious that an ignorant 4k layout is going to be so bad ... the RMW occurs only at the ends of the transfers, not in the middle. If we say 16k physical block and average 128k transfers, probabalistically we misalign on 6 out of 31 sectors (or 19% of the time). We can make that better by increasing the transfer size (it comes down to 10% for 256k transfers. This really depends on the nature of the device. Some devices could produce very erratic performance or even (not today, but some day) reject the IO. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. But you're making assumptions about needing larger block sizes. I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:37 PM, Chris Mason wrote: Circling back to what we might talk about at the conference, Ric do you have any ideas on when these drives might hit the wild? -chris I will poke at vendors to see if we can get someone to make a public statement, but I cannot do that for them. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. LSF/MM seems to be pretty much the only event of the year that most of the key people will be present, so should be a great topic for a joint session. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. LSF/MM seems to be pretty much the only event of the year that most of the key people will be present, so should be a great topic for a joint session. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: status of block-integrity
On 12/23/2013 09:35 PM, Martin K. Petersen wrote: "Christoph" == Christoph Hellwig writes: Christoph> We have the block integrity code to support DIF/DIX in the Christoph> the tree for about 5 and a half years, and we still don't Christoph> have a single consumer of it. What do you mean? If you have a DIX-capable HBA (lpfc, qla2xxx, zfcp) then integrity protection is active from the block layer down. The only code that's not currently being exercised are the tag interleaving functions. I was hoping the FS people would use them for back pointers but nobody seemed to bite. Christoph> Given that we'll have a lot of work to do in this area with Christoph> block multiqueue I think it's time to either kill it off for Christoph> good or make sure we can actually use and test it. I don't understand why multiqueue would require a lot of work? It's just an extra scatterlist per request. And obviously, if there's anything that needs to be done in this area I'll be happy to do so... One of the major knocks on linux file systems (except for btrfs) that I hear is the lack of full data path checksums. DIF/DIX + xfs or ext4 done right will give us another answer here. I don't think it will be common, it is a request that comes in for very large storage customers most commonly. We do have devices that support this and are working to get more vendor testing done, so I would hate to see us throw out the code instead of fixing it up for the end users that see value here. I think that we can get this working & agree with the call to continue this discussion (here and at LSF :)) Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: status of block-integrity
On 12/23/2013 09:35 PM, Martin K. Petersen wrote: Christoph == Christoph Hellwig h...@infradead.org writes: Christoph We have the block integrity code to support DIF/DIX in the Christoph the tree for about 5 and a half years, and we still don't Christoph have a single consumer of it. What do you mean? If you have a DIX-capable HBA (lpfc, qla2xxx, zfcp) then integrity protection is active from the block layer down. The only code that's not currently being exercised are the tag interleaving functions. I was hoping the FS people would use them for back pointers but nobody seemed to bite. Christoph Given that we'll have a lot of work to do in this area with Christoph block multiqueue I think it's time to either kill it off for Christoph good or make sure we can actually use and test it. I don't understand why multiqueue would require a lot of work? It's just an extra scatterlist per request. And obviously, if there's anything that needs to be done in this area I'll be happy to do so... One of the major knocks on linux file systems (except for btrfs) that I hear is the lack of full data path checksums. DIF/DIX + xfs or ext4 done right will give us another answer here. I don't think it will be common, it is a request that comes in for very large storage customers most commonly. We do have devices that support this and are working to get more vendor testing done, so I would hate to see us throw out the code instead of fixing it up for the end users that see value here. I think that we can get this working agree with the call to continue this discussion (here and at LSF :)) Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage & file systems
I would like to attend this year and continue to talk about the work on enabling the new class of persistent memory devices. Specifically, very interested in talking about both using a block driver under our existing stack and also progress at the file system layer (adding xip/mmap tweaks to existing file systems and looking at new file systems). We also have a lot of work left to do on unifying management, it would be good to resync on that. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[LSF/MM TOPIC] [ATTEND] persistent memory progress, management of storage file systems
I would like to attend this year and continue to talk about the work on enabling the new class of persistent memory devices. Specifically, very interested in talking about both using a block driver under our existing stack and also progress at the file system layer (adding xip/mmap tweaks to existing file systems and looking at new file systems). We also have a lot of work left to do on unifying management, it would be good to resync on that. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 07:22 PM, Pavel Machek wrote: On Sat 2013-11-23 18:01:32, Ric Wheeler wrote: On 11/23/2013 03:36 PM, Pavel Machek wrote: On Wed 2013-11-20 08:02:33, Howard Chu wrote: Theodore Ts'o wrote: Historically, Intel has been really good about avoiding this, but since they've moved to using 3rd party flash controllers, I now advise everyone who plans to use any flash storage, regardless of the manufacturer, to do their own explicit power fail testing (hitting the reset button is not good enough, you need to kick the power plug out of the wall, or better yet, use a network controlled power switch you so you can repeat the power fail test dozens or hundreds of times for your qualification run) before being using flash storage in a mission critical situation where you care about data integrity after a power fail event. Speaking of which, what would you use to automate this sort of test? I'm thinking an SSD connected by eSATA, with an external power supply, and the host running inside a VM. Drop power to the drive at the same time as doing a kill -9 on the VM, then you can resume the VM pretty quickly instead of waiting for a full reboot sequence. I was just pulling power on sata drive. It uncovered "interesting" stuff. I plugged power back, and kernel re-estabilished communication with that drive, but any settings with hdparm were forgotten. I'd say there's some room for improvement there... Hi Pavel, When you drop power, your drive normally loses temporary settings (like a change to write cache, etc). Depending on the class of the device, there are ways to make that permanent (look at hdparm or sdparm for details). This is a feature of the drive and its firmware, not something we reset in the device each time it re-appears. Yes, and I'm arguing that is a bug (as in, < 0.01% people are using hdparm correctly). Almost no end users use hdparm. Those who do should read the man page and add the -K flag :) Or system scripts that tweak should invoke it with the right flags. Ric So you used hparm to disable write cache so that ext3 can be safely used on your hdd. Now you have glitch on power. Then, system continues to operate in dangerous mode until reboot. I guess it would be safer not to reattach drives after power fail... (also I wonder what this does to data integrity. Drive lost content of its writeback cache, but kernel continues... Journal will not prevent data corruption in this case). Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 03:36 PM, Pavel Machek wrote: On Wed 2013-11-20 08:02:33, Howard Chu wrote: Theodore Ts'o wrote: Historically, Intel has been really good about avoiding this, but since they've moved to using 3rd party flash controllers, I now advise everyone who plans to use any flash storage, regardless of the manufacturer, to do their own explicit power fail testing (hitting the reset button is not good enough, you need to kick the power plug out of the wall, or better yet, use a network controlled power switch you so you can repeat the power fail test dozens or hundreds of times for your qualification run) before being using flash storage in a mission critical situation where you care about data integrity after a power fail event. Speaking of which, what would you use to automate this sort of test? I'm thinking an SSD connected by eSATA, with an external power supply, and the host running inside a VM. Drop power to the drive at the same time as doing a kill -9 on the VM, then you can resume the VM pretty quickly instead of waiting for a full reboot sequence. I was just pulling power on sata drive. It uncovered "interesting" stuff. I plugged power back, and kernel re-estabilished communication with that drive, but any settings with hdparm were forgotten. I'd say there's some room for improvement there... Pavel Hi Pavel, When you drop power, your drive normally loses temporary settings (like a change to write cache, etc). Depending on the class of the device, there are ways to make that permanent (look at hdparm or sdparm for details). This is a feature of the drive and its firmware, not something we reset in the device each time it re-appears. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 01:27 PM, Stefan Priebe wrote: Hi Ric, Am 22.11.2013 21:37, schrieb Ric Wheeler: On 11/22/2013 03:01 PM, Stefan Priebe wrote: Hi Christoph, Am 21.11.2013 11:11, schrieb Christoph Hellwig: 2. Some drives may implement CMD_FLUSH to return immediately i.e. no guarantee the data is actually on disk. In which case they aren't spec complicant. While I've seen countless data integrity bugs on lower end ATA SSDs I've not seen one that simpliy ingnores flush. If you'd want to cheat that bluntly you'd be better of just claiming to not have a writeback cache. You solve your performance problem by completely disabling any chance of having data integrity guarantees, and do so in a way that is not detectable for applications or users. If you have a workload with lots of small synchronous writes disabling the writeback cache on the disk does indeed often help, especially with the non-queueable FLUSH on all but the most recent ATA devices. But this isn't correct for drives with capicitors like Crucial m500, Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable this for drives like these? /sys/block/sdX/device/ignore_flush If you know 100% for sure that your drive has a non-volatile write cache, you can run the file system without the flushing by mounting "-o nobarrier". With most devices, this is not needed since they tend to simply ignore the flushes if they know they are power failure safe. Block level, we did something similar for users who are not running through a file system for SCSI devices - James added support to echo "temporary" into the sd's device's cache_type field: See: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 At least to me this does not work. I get the same awful speed as before - also the I/O waits stay the same. I'm still seeing CMD flushes going to the devices. Is there any way to check whether the temporary got accepted and works? I simply executed: for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary write back >$i; done Stefan What kernel are you running? This is a new addition Also, you can "cat" the same file to see what it says. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 01:27 PM, Stefan Priebe wrote: Hi Ric, Am 22.11.2013 21:37, schrieb Ric Wheeler: On 11/22/2013 03:01 PM, Stefan Priebe wrote: Hi Christoph, Am 21.11.2013 11:11, schrieb Christoph Hellwig: 2. Some drives may implement CMD_FLUSH to return immediately i.e. no guarantee the data is actually on disk. In which case they aren't spec complicant. While I've seen countless data integrity bugs on lower end ATA SSDs I've not seen one that simpliy ingnores flush. If you'd want to cheat that bluntly you'd be better of just claiming to not have a writeback cache. You solve your performance problem by completely disabling any chance of having data integrity guarantees, and do so in a way that is not detectable for applications or users. If you have a workload with lots of small synchronous writes disabling the writeback cache on the disk does indeed often help, especially with the non-queueable FLUSH on all but the most recent ATA devices. But this isn't correct for drives with capicitors like Crucial m500, Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable this for drives like these? /sys/block/sdX/device/ignore_flush If you know 100% for sure that your drive has a non-volatile write cache, you can run the file system without the flushing by mounting -o nobarrier. With most devices, this is not needed since they tend to simply ignore the flushes if they know they are power failure safe. Block level, we did something similar for users who are not running through a file system for SCSI devices - James added support to echo temporary into the sd's device's cache_type field: See: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 At least to me this does not work. I get the same awful speed as before - also the I/O waits stay the same. I'm still seeing CMD flushes going to the devices. Is there any way to check whether the temporary got accepted and works? I simply executed: for i in /sys/class/scsi_disk/*/cache_type; do echo $i; echo temporary write back $i; done Stefan What kernel are you running? This is a new addition Also, you can cat the same file to see what it says. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 03:36 PM, Pavel Machek wrote: On Wed 2013-11-20 08:02:33, Howard Chu wrote: Theodore Ts'o wrote: Historically, Intel has been really good about avoiding this, but since they've moved to using 3rd party flash controllers, I now advise everyone who plans to use any flash storage, regardless of the manufacturer, to do their own explicit power fail testing (hitting the reset button is not good enough, you need to kick the power plug out of the wall, or better yet, use a network controlled power switch you so you can repeat the power fail test dozens or hundreds of times for your qualification run) before being using flash storage in a mission critical situation where you care about data integrity after a power fail event. Speaking of which, what would you use to automate this sort of test? I'm thinking an SSD connected by eSATA, with an external power supply, and the host running inside a VM. Drop power to the drive at the same time as doing a kill -9 on the VM, then you can resume the VM pretty quickly instead of waiting for a full reboot sequence. I was just pulling power on sata drive. It uncovered interesting stuff. I plugged power back, and kernel re-estabilished communication with that drive, but any settings with hdparm were forgotten. I'd say there's some room for improvement there... Pavel Hi Pavel, When you drop power, your drive normally loses temporary settings (like a change to write cache, etc). Depending on the class of the device, there are ways to make that permanent (look at hdparm or sdparm for details). This is a feature of the drive and its firmware, not something we reset in the device each time it re-appears. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/23/2013 07:22 PM, Pavel Machek wrote: On Sat 2013-11-23 18:01:32, Ric Wheeler wrote: On 11/23/2013 03:36 PM, Pavel Machek wrote: On Wed 2013-11-20 08:02:33, Howard Chu wrote: Theodore Ts'o wrote: Historically, Intel has been really good about avoiding this, but since they've moved to using 3rd party flash controllers, I now advise everyone who plans to use any flash storage, regardless of the manufacturer, to do their own explicit power fail testing (hitting the reset button is not good enough, you need to kick the power plug out of the wall, or better yet, use a network controlled power switch you so you can repeat the power fail test dozens or hundreds of times for your qualification run) before being using flash storage in a mission critical situation where you care about data integrity after a power fail event. Speaking of which, what would you use to automate this sort of test? I'm thinking an SSD connected by eSATA, with an external power supply, and the host running inside a VM. Drop power to the drive at the same time as doing a kill -9 on the VM, then you can resume the VM pretty quickly instead of waiting for a full reboot sequence. I was just pulling power on sata drive. It uncovered interesting stuff. I plugged power back, and kernel re-estabilished communication with that drive, but any settings with hdparm were forgotten. I'd say there's some room for improvement there... Hi Pavel, When you drop power, your drive normally loses temporary settings (like a change to write cache, etc). Depending on the class of the device, there are ways to make that permanent (look at hdparm or sdparm for details). This is a feature of the drive and its firmware, not something we reset in the device each time it re-appears. Yes, and I'm arguing that is a bug (as in, 0.01% people are using hdparm correctly). Almost no end users use hdparm. Those who do should read the man page and add the -K flag :) Or system scripts that tweak should invoke it with the right flags. Ric So you used hparm to disable write cache so that ext3 can be safely used on your hdd. Now you have glitch on power. Then, system continues to operate in dangerous mode until reboot. I guess it would be safer not to reattach drives after power fail... (also I wonder what this does to data integrity. Drive lost content of its writeback cache, but kernel continues... Journal will not prevent data corruption in this case). Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/22/2013 03:01 PM, Stefan Priebe wrote: Hi Christoph, Am 21.11.2013 11:11, schrieb Christoph Hellwig: 2. Some drives may implement CMD_FLUSH to return immediately i.e. no guarantee the data is actually on disk. In which case they aren't spec complicant. While I've seen countless data integrity bugs on lower end ATA SSDs I've not seen one that simpliy ingnores flush. If you'd want to cheat that bluntly you'd be better of just claiming to not have a writeback cache. You solve your performance problem by completely disabling any chance of having data integrity guarantees, and do so in a way that is not detectable for applications or users. If you have a workload with lots of small synchronous writes disabling the writeback cache on the disk does indeed often help, especially with the non-queueable FLUSH on all but the most recent ATA devices. But this isn't correct for drives with capicitors like Crucial m500, Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable this for drives like these? /sys/block/sdX/device/ignore_flush If you know 100% for sure that your drive has a non-volatile write cache, you can run the file system without the flushing by mounting "-o nobarrier". With most devices, this is not needed since they tend to simply ignore the flushes if they know they are power failure safe. Block level, we did something similar for users who are not running through a file system for SCSI devices - James added support to echo "temporary" into the sd's device's cache_type field: See: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 Ric Again, what your patch does is to explicitly ignore the data integrity request from the application. While this will usually be way faster, it will also cause data loss. Simply disabling the writeback cache feature of the disk using hdparm will give you much better performance than issueing all the FLUSH command, especially if they are non-queued, but without breaking the gurantee to the application. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is O_DSYNC on linux so slow / what's wrong with my SSD?
On 11/22/2013 03:01 PM, Stefan Priebe wrote: Hi Christoph, Am 21.11.2013 11:11, schrieb Christoph Hellwig: 2. Some drives may implement CMD_FLUSH to return immediately i.e. no guarantee the data is actually on disk. In which case they aren't spec complicant. While I've seen countless data integrity bugs on lower end ATA SSDs I've not seen one that simpliy ingnores flush. If you'd want to cheat that bluntly you'd be better of just claiming to not have a writeback cache. You solve your performance problem by completely disabling any chance of having data integrity guarantees, and do so in a way that is not detectable for applications or users. If you have a workload with lots of small synchronous writes disabling the writeback cache on the disk does indeed often help, especially with the non-queueable FLUSH on all but the most recent ATA devices. But this isn't correct for drives with capicitors like Crucial m500, Intel DC S3500, DC S3700 isn't it? Shouldn't the linux kernel has an option to disable this for drives like these? /sys/block/sdX/device/ignore_flush If you know 100% for sure that your drive has a non-volatile write cache, you can run the file system without the flushing by mounting -o nobarrier. With most devices, this is not needed since they tend to simply ignore the flushes if they know they are power failure safe. Block level, we did something similar for users who are not running through a file system for SCSI devices - James added support to echo temporary into the sd's device's cache_type field: See: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2ee3e26c673e75c05ef8b914f54fadee3d7b9c88 Ric Again, what your patch does is to explicitly ignore the data integrity request from the application. While this will usually be way faster, it will also cause data loss. Simply disabling the writeback cache feature of the disk using hdparm will give you much better performance than issueing all the FLUSH command, especially if they are non-queued, but without breaking the gurantee to the application. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] update xfs maintainers
On 11/08/2013 05:17 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 05:07:45PM -0500, Ric Wheeler wrote: On 11/08/2013 05:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Sorry if my read on that was wrong. I do appreciate the work and effort you and the SGI team put in but think that this will be a good way to keep the community happier and even more productive going forward. Dave simply has earned the right to take on the formal leadership role of maintainer. Then we're gonna need some Reviewed-bys. ;) Those should come from the developers, thanks! I actually do need your Reviewed-by. We'll try and get this one in 3.13. ;) Thanks, Ben Happy to do that - I do think that Dave mostly posts from his redhat.com account, but he can comment once he gets back online. Reviewed-by: Ric Wheeler From: Ben Myers xfs: update maintainers Add Dave as maintainer of XFS. Signed-off-by: Ben Myers --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc +M: Dave Chinner M:Ben Myers -M: Alex Elder M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] update xfs maintainers
On 11/08/2013 05:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Sorry if my read on that was wrong. I do appreciate the work and effort you and the SGI team put in but think that this will be a good way to keep the community happier and even more productive going forward. Dave simply has earned the right to take on the formal leadership role of maintainer. Then we're gonna need some Reviewed-bys. ;) Those should come from the developers, thanks! Ric From: Ben Myers xfs: update maintainers Add Dave as maintainer of XFS. Signed-off-by: Ben Myers --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc +M: Dave Chinner M:Ben Myers -M: Alex Elder M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Dave simply has earned the right to take on the formal leadership role of maintainer. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 02:34 PM, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. This sounds like exactly the right thing to do to me as well, Ric I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 01:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 06:03:41AM -0500, Ric Wheeler wrote: In the XFS community, we have 2 clear leaders in terms of contributions of significant feaures and depth of knowledge - Christoph and Dave. If you look at the number of patches submitted by developers since 3.0 who have more than 10 patches, we get the following: 319 Author: Dave Chinner 163 Author: Christoph Hellwig 51 Author: Christoph Hellwig 35 Author: Linus Torvalds 34 Author: Chandra Seetharaman 29 Author: Al Viro 28 Author: Brian Foster 25 Author: Zhi Yong Wu 24 Author: Jeff Liu 21 Author: Jie Liu 20 Author: Mark Tinguely 16 Author: Dave Chinner 12 Author: Eric Sandeen 12 Author: Carlos Maiolino If we as a community had more capacity for patch review, Dave's numbers would have jumped up even higher :) It is certainly very welcome to bring new developers into our community, but if we are going to add a co-maintainer for XFS, we really need to have one of our two leading developers in that role. Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. -Ben I don't mean any disrepect to you or to Mark, but maintainership is something that you earn over time by proving yourself in the community as a developer and a leader of the technology on a personal level. It is not something that gets managed by the community of developers and has the key role of keeping the most frequent developers engaged and happy. That has not been working for us as a community lately. Dave Chinner is the obvious person to take on the maintainer role as someone who has an order of magnitude more code contributed than either of you (even combined). Christoph, if he has time, would also be an excellent candidate. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS leadership and a new co-maintainer candidate
In the XFS community, we have 2 clear leaders in terms of contributions of significant feaures and depth of knowledge - Christoph and Dave. If you look at the number of patches submitted by developers since 3.0 who have more than 10 patches, we get the following: 319 Author: Dave Chinner 163 Author: Christoph Hellwig 51 Author: Christoph Hellwig 35 Author: Linus Torvalds 34 Author: Chandra Seetharaman 29 Author: Al Viro 28 Author: Brian Foster 25 Author: Zhi Yong Wu 24 Author: Jeff Liu 21 Author: Jie Liu 20 Author: Mark Tinguely 16 Author: Dave Chinner 12 Author: Eric Sandeen 12 Author: Carlos Maiolino If we as a community had more capacity for patch review, Dave's numbers would have jumped up even higher :) It is certainly very welcome to bring new developers into our community, but if we are going to add a co-maintainer for XFS, we really need to have one of our two leading developers in that role. Best regards, Ric On 11/07/2013 09:23 PM, Ric Wheeler wrote: Hi Ben, How exactly did we decide to add a new co-maintainer? Shouldn't we have some discussion on the list and see some substantial history of contributions? Best regards, Ric On 11/07/2013 05:08 PM, Mark Tinguely wrote: Updated maintainer info. Signed-off-by: Ben Myers Reviewed-by: Mark Tinguely --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS2013-11-07 15:42:04.554561805 -0600 +++ b/MAINTAINERS2013-11-07 15:42:59.034889770 -0600 @@ -9388,7 +9388,7 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc M:Ben Myers -M:Alex Elder +M:Mark Tinguely M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS leadership and a new co-maintainer candidate
In the XFS community, we have 2 clear leaders in terms of contributions of significant feaures and depth of knowledge - Christoph and Dave. If you look at the number of patches submitted by developers since 3.0 who have more than 10 patches, we get the following: 319 Author: Dave Chinner dchin...@redhat.com 163 Author: Christoph Hellwig h...@infradead.org 51 Author: Christoph Hellwig h...@lst.de 35 Author: Linus Torvalds torva...@linux-foundation.org 34 Author: Chandra Seetharaman sekha...@us.ibm.com 29 Author: Al Viro v...@zeniv.linux.org.uk 28 Author: Brian Foster bfos...@redhat.com 25 Author: Zhi Yong Wu wu...@linux.vnet.ibm.com 24 Author: Jeff Liu jeff@oracle.com 21 Author: Jie Liu jeff@oracle.com 20 Author: Mark Tinguely tingu...@sgi.com 16 Author: Dave Chinner da...@fromorbit.com 12 Author: Eric Sandeen sand...@redhat.com 12 Author: Carlos Maiolino cmaiol...@redhat.com If we as a community had more capacity for patch review, Dave's numbers would have jumped up even higher :) It is certainly very welcome to bring new developers into our community, but if we are going to add a co-maintainer for XFS, we really need to have one of our two leading developers in that role. Best regards, Ric On 11/07/2013 09:23 PM, Ric Wheeler wrote: Hi Ben, How exactly did we decide to add a new co-maintainer? Shouldn't we have some discussion on the list and see some substantial history of contributions? Best regards, Ric On 11/07/2013 05:08 PM, Mark Tinguely wrote: Updated maintainer info. Signed-off-by: Ben Myers b...@sgi.com Reviewed-by: Mark Tinguely tingu...@sgi.com --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS2013-11-07 15:42:04.554561805 -0600 +++ b/MAINTAINERS2013-11-07 15:42:59.034889770 -0600 @@ -9388,7 +9388,7 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc M:Ben Myers b...@sgi.com -M:Alex Elder el...@kernel.org +M:Mark Tinguely tingu...@sgi.com M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 01:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 06:03:41AM -0500, Ric Wheeler wrote: In the XFS community, we have 2 clear leaders in terms of contributions of significant feaures and depth of knowledge - Christoph and Dave. If you look at the number of patches submitted by developers since 3.0 who have more than 10 patches, we get the following: 319 Author: Dave Chinner dchin...@redhat.com 163 Author: Christoph Hellwig h...@infradead.org 51 Author: Christoph Hellwig h...@lst.de 35 Author: Linus Torvalds torva...@linux-foundation.org 34 Author: Chandra Seetharaman sekha...@us.ibm.com 29 Author: Al Viro v...@zeniv.linux.org.uk 28 Author: Brian Foster bfos...@redhat.com 25 Author: Zhi Yong Wu wu...@linux.vnet.ibm.com 24 Author: Jeff Liu jeff@oracle.com 21 Author: Jie Liu jeff@oracle.com 20 Author: Mark Tinguely tingu...@sgi.com 16 Author: Dave Chinner da...@fromorbit.com 12 Author: Eric Sandeen sand...@redhat.com 12 Author: Carlos Maiolino cmaiol...@redhat.com If we as a community had more capacity for patch review, Dave's numbers would have jumped up even higher :) It is certainly very welcome to bring new developers into our community, but if we are going to add a co-maintainer for XFS, we really need to have one of our two leading developers in that role. Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. -Ben I don't mean any disrepect to you or to Mark, but maintainership is something that you earn over time by proving yourself in the community as a developer and a leader of the technology on a personal level. It is not something that gets managed by the community of developers and has the key role of keeping the most frequent developers engaged and happy. That has not been working for us as a community lately. Dave Chinner is the obvious person to take on the maintainer role as someone who has an order of magnitude more code contributed than either of you (even combined). Christoph, if he has time, would also be an excellent candidate. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 02:34 PM, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. This sounds like exactly the right thing to do to me as well, Ric I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS leadership and a new co-maintainer candidate
On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Dave simply has earned the right to take on the formal leadership role of maintainer. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] update xfs maintainers
On 11/08/2013 05:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Sorry if my read on that was wrong. I do appreciate the work and effort you and the SGI team put in but think that this will be a good way to keep the community happier and even more productive going forward. Dave simply has earned the right to take on the formal leadership role of maintainer. Then we're gonna need some Reviewed-bys. ;) Those should come from the developers, thanks! Ric From: Ben Myers b...@sgi.com xfs: update maintainers Add Dave as maintainer of XFS. Signed-off-by: Ben Myers b...@sgi.com --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc +M: Dave Chinner dchin...@fromorbit.com M:Ben Myers b...@sgi.com -M: Alex Elder el...@kernel.org M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] update xfs maintainers
On 11/08/2013 05:17 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 05:07:45PM -0500, Ric Wheeler wrote: On 11/08/2013 05:03 PM, Ben Myers wrote: Hey Ric, On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Sorry if my read on that was wrong. I do appreciate the work and effort you and the SGI team put in but think that this will be a good way to keep the community happier and even more productive going forward. Dave simply has earned the right to take on the formal leadership role of maintainer. Then we're gonna need some Reviewed-bys. ;) Those should come from the developers, thanks! I actually do need your Reviewed-by. We'll try and get this one in 3.13. ;) Thanks, Ben Happy to do that - I do think that Dave mostly posts from his redhat.com account, but he can comment once he gets back online. Reviewed-by: Ric Wheeler rwhee...@redhat.com From: Ben Myers b...@sgi.com xfs: update maintainers Add Dave as maintainer of XFS. Signed-off-by: Ben Myers b...@sgi.com --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P:Silicon Graphics Inc +M: Dave Chinner dchin...@fromorbit.com M:Ben Myers b...@sgi.com -M: Alex Elder el...@kernel.org M:x...@oss.sgi.com L:x...@oss.sgi.com W:http://oss.sgi.com/projects/xfs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 04:00 PM, Bernd Schubert wrote: pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own interface? And userspace needs to address all of them differently? The NFS and SCSI groups have each defined a standard which Zach's proposal abstracts into a common user API. Distributed file systems tend to be rather unique and do not have similar standard bodies, but a lot of them could hide server specific implementations under the current proposed interfaces. What is not a good idea is to drag out the core, simple copy offload discussion for another 5 years to pull in every odd use case :) ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:46 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler wrote: The way the array based offload (and some software side reflink works) is not a byte by byte copy. We cannot assume that a valid count can be returned or that such a count would be an indication of a sequential segment of good data. The whole thing would normally have to be reissued. To make that a true assumption, you would have to mandate that in each of the specifications (and sw targets)... You're missing my point. - user issues SIZE_MAX splice request - fs issues *64M* (or whatever) request to offload - when that completes *fully* then we return 64M to userspace - if it completes partially, then we return an error to userspace Again, wouldn't that work? Thanks, Miklos Yes, if you send a copy offload command and it works, you can assume that it worked fully. It would be pretty interesting if that were not true :) If it fails, we cannot assume anything about partial completion. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:38 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler wrote: On 09/30/2013 10:24 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler wrote: On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a "long" time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. Huh? - app calls splice(from, 0, to, 0, SIZE_MAX) 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) 1.a) fs reflinks the whole file in a jiffy and returns the size of the file 1 b) fs does copy offload of, say, 64MB and returns 64M 2) VFS does page copy of, say, 1MB and returns 1MB - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset ... The point is: the app is always doing the same (incrementing offset with the return value from splice) and the kernel can decide what is the best size it can service within a single uninterruptible syscall. Wouldn't that work? No. Keep in mind that the offload operation in (1) might fail partially. The target file (the copy) is allocated, the question is what ranges have valid data. You are talking about case 1.a, right? So if the offload copy 0-64MB fails partially, we return failure from splice, yet some of the copy did succeed. Is that the problem? Why? Thanks, Miklos The way the array based offload (and some software side reflink works) is not a byte by byte copy. We cannot assume that a valid count can be returned or that such a count would be an indication of a sequential segment of good data. The whole thing would normally have to be reissued. To make that a true assumption, you would have to mandate that in each of the specifications (and sw targets)... ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:24 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler wrote: On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a "long" time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. Huh? - app calls splice(from, 0, to, 0, SIZE_MAX) 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) 1.a) fs reflinks the whole file in a jiffy and returns the size of the file 1 b) fs does copy offload of, say, 64MB and returns 64M 2) VFS does page copy of, say, 1MB and returns 1MB - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset ... The point is: the app is always doing the same (incrementing offset with the return value from splice) and the kernel can decide what is the best size it can service within a single uninterruptible syscall. Wouldn't that work? Thanks, Miklos No. Keep in mind that the offload operation in (1) might fail partially. The target file (the copy) is allocated, the question is what ranges have valid data. I don't see that (2) is interesting or really needed to be done in the kernel. If nothing else, it tends to confuse the discussion ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a "long" time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. Thanks, Miklos You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. I don't believe that is in fact required by all (any?) versions of the spec :) Best just to fail and restart the whole operation. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:34 AM, J. Bruce Fields wrote: On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote: On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler wrote: I don't see the safety argument very compelling either. There are real semantic differences, however: ENOSPC on a write to a (apparentlíy) already allocated block. That could be a bit unexpected. Do we need a fallocate extension to deal with shared blocks? The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... Cheers, Trond I agree - this has been common behaviour for a very long time in the array space. Even without an array, this is the same as overwriting a block in btrfs or any file system with a read-write LVM snapshot. Okay, I'm convinced. So I suggest - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not set, fall back to page cache copy. - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app can force reflink. Both are trivial to implement and make sure that no backward incompatibility surprises happen. My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. For that reason I don't like the idea of a mount option--the choice is something that the application probably wants to make (or at least to know about). The NFS COPY operation, as specified in current drafts, allows for asynchronous copies but leaves the state of the file undefined in the case of an aborted COPY. I worry that agreeing on standard behavior in the case of an abort might be difficult. --b. I think that this is still confusing - reflink and array copy offload should not be differentiated. In effect, they should often be the same order of magnitude in performance and possibly even use the same or very similar techniques (just on different sides of the initiator/target transaction!). It is much simpler to let the application fail if the offload (or reflink) is not supported and let it do the traditional copy offload. Then you always send the largest possible offload operation and do whatever you do now if that fails. thanks! Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:34 AM, J. Bruce Fields wrote: On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote: On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler rwhee...@redhat.com wrote: I don't see the safety argument very compelling either. There are real semantic differences, however: ENOSPC on a write to a (apparentlíy) already allocated block. That could be a bit unexpected. Do we need a fallocate extension to deal with shared blocks? The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... Cheers, Trond I agree - this has been common behaviour for a very long time in the array space. Even without an array, this is the same as overwriting a block in btrfs or any file system with a read-write LVM snapshot. Okay, I'm convinced. So I suggest - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not set, fall back to page cache copy. - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app can force reflink. Both are trivial to implement and make sure that no backward incompatibility surprises happen. My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. For that reason I don't like the idea of a mount option--the choice is something that the application probably wants to make (or at least to know about). The NFS COPY operation, as specified in current drafts, allows for asynchronous copies but leaves the state of the file undefined in the case of an aborted COPY. I worry that agreeing on standard behavior in the case of an abort might be difficult. --b. I think that this is still confusing - reflink and array copy offload should not be differentiated. In effect, they should often be the same order of magnitude in performance and possibly even use the same or very similar techniques (just on different sides of the initiator/target transaction!). It is much simpler to let the application fail if the offload (or reflink) is not supported and let it do the traditional copy offload. Then you always send the largest possible offload operation and do whatever you do now if that fails. thanks! Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a long time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. Thanks, Miklos You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. I don't believe that is in fact required by all (any?) versions of the spec :) Best just to fail and restart the whole operation. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:24 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote: On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a long time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. Huh? - app calls splice(from, 0, to, 0, SIZE_MAX) 1) VFS calls -direct_splice(from, 0, to, 0, SIZE_MAX) 1.a) fs reflinks the whole file in a jiffy and returns the size of the file 1 b) fs does copy offload of, say, 64MB and returns 64M 2) VFS does page copy of, say, 1MB and returns 1MB - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset ... The point is: the app is always doing the same (incrementing offset with the return value from splice) and the kernel can decide what is the best size it can service within a single uninterruptible syscall. Wouldn't that work? Thanks, Miklos No. Keep in mind that the offload operation in (1) might fail partially. The target file (the copy) is allocated, the question is what ranges have valid data. I don't see that (2) is interesting or really needed to be done in the kernel. If nothing else, it tends to confuse the discussion ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:38 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler rwhee...@redhat.com wrote: On 09/30/2013 10:24 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote: On 09/30/2013 10:51 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote: My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a long time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. Huh? - app calls splice(from, 0, to, 0, SIZE_MAX) 1) VFS calls -direct_splice(from, 0, to, 0, SIZE_MAX) 1.a) fs reflinks the whole file in a jiffy and returns the size of the file 1 b) fs does copy offload of, say, 64MB and returns 64M 2) VFS does page copy of, say, 1MB and returns 1MB - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset ... The point is: the app is always doing the same (incrementing offset with the return value from splice) and the kernel can decide what is the best size it can service within a single uninterruptible syscall. Wouldn't that work? No. Keep in mind that the offload operation in (1) might fail partially. The target file (the copy) is allocated, the question is what ranges have valid data. You are talking about case 1.a, right? So if the offload copy 0-64MB fails partially, we return failure from splice, yet some of the copy did succeed. Is that the problem? Why? Thanks, Miklos The way the array based offload (and some software side reflink works) is not a byte by byte copy. We cannot assume that a valid count can be returned or that such a count would be an indication of a sequential segment of good data. The whole thing would normally have to be reissued. To make that a true assumption, you would have to mandate that in each of the specifications (and sw targets)... ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 10:46 AM, Miklos Szeredi wrote: On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler rwhee...@redhat.com wrote: The way the array based offload (and some software side reflink works) is not a byte by byte copy. We cannot assume that a valid count can be returned or that such a count would be an indication of a sequential segment of good data. The whole thing would normally have to be reissued. To make that a true assumption, you would have to mandate that in each of the specifications (and sw targets)... You're missing my point. - user issues SIZE_MAX splice request - fs issues *64M* (or whatever) request to offload - when that completes *fully* then we return 64M to userspace - if it completes partially, then we return an error to userspace Again, wouldn't that work? Thanks, Miklos Yes, if you send a copy offload command and it works, you can assume that it worked fully. It would be pretty interesting if that were not true :) If it fails, we cannot assume anything about partial completion. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/30/2013 04:00 PM, Bernd Schubert wrote: pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own interface? And userspace needs to address all of them differently? The NFS and SCSI groups have each defined a standard which Zach's proposal abstracts into a common user API. Distributed file systems tend to be rather unique and do not have similar standard bodies, but a lot of them could hide server specific implementations under the current proposed interfaces. What is not a good idea is to drag out the core, simple copy offload discussion for another 5 years to pull in every odd use case :) ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/28/2013 11:20 AM, Myklebust, Trond wrote: -Original Message- From: Miklos Szeredi [mailto:mik...@szeredi.hu] Sent: Saturday, September 28, 2013 12:50 AM To: Zach Brown Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux- Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan; Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong Subject: Re: [RFC] extending splice for copy offloading On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown wrote: Also, I don't get the first option above at all. The argument is that it's safer to have more copies? How much safety does another copy on the same disk really give you? Do systems that do dedup provide interfaces to turn it off per-file? I don't see the safety argument very compelling either. There are real semantic differences, however: ENOSPC on a write to a (apparentlíy) already allocated block. That could be a bit unexpected. Do we need a fallocate extension to deal with shared blocks? The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... Cheers, Trond I agree - this has been common behaviour for a very long time in the array space. Even without an array, this is the same as overwriting a block in btrfs or any file system with a read-write LVM snapshot. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/28/2013 11:20 AM, Myklebust, Trond wrote: -Original Message- From: Miklos Szeredi [mailto:mik...@szeredi.hu] Sent: Saturday, September 28, 2013 12:50 AM To: Zach Brown Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux- Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan; Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong Subject: Re: [RFC] extending splice for copy offloading On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown z...@redhat.com wrote: Also, I don't get the first option above at all. The argument is that it's safer to have more copies? How much safety does another copy on the same disk really give you? Do systems that do dedup provide interfaces to turn it off per-file? I don't see the safety argument very compelling either. There are real semantic differences, however: ENOSPC on a write to a (apparentlíy) already allocated block. That could be a bit unexpected. Do we need a fallocate extension to deal with shared blocks? The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... Cheers, Trond I agree - this has been common behaviour for a very long time in the array space. Even without an array, this is the same as overwriting a block in btrfs or any file system with a read-write LVM snapshot. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/27/2013 12:47 AM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler wrote: On 09/26/2013 03:53 PM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown wrote: But I'm not sure it's worth the effort; 99% of the use of this interface will be copying whole files. And for that perhaps we need a different API, one which has been discussed some time ago: asynchronous copyfile() returns immediately with a pollable event descriptor indicating copy progress, and some way to cancel the copy. And that can internally rely on ->direct_splice(), with appropriate algorithms for determine the optimal chunk size. And perhaps we don't. Perhaps we can provide this much simpler data-plane interface that works well enough for most everyone and can avoid going down the async rat hole, yet again. I think either buffering or async is needed to get good perforrmace without too much complexity in the app (which is not good). Buffering works quite well for regular I/O, so maybe its the way to go here as well. Thanks, Miklos Buffering misses the whole point of the copy offload - the idea is *not* to read or write the actual data in the most interesting cases which offload the operation to a smart target device or file system. I meant buffering the COPY, not the data. Doing the COPY synchronously will always incur a performance penalty, the amount depending on the latency, which can be significant with networking. We think of write(2) as a synchronous interface, because that's the appearance we get from all that hard work the page cache and delayed writeback code does to make an asynchronous operation look as if it was synchronous. So from a userspace API perspective a sync interface is nice, but inside we almost always have async interfaces to do the actual work. Thanks, Miklos I think that you are an order of magnitude off here in thinking about the scale of the operations. An enabled, synchronize copy offload to an array (or one that turns into a reflink locally) is effectively the cost of the call itself. Let's say no slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that call is much faster than that worst case number. Copying any substantial amount of data - like the target workload of VM images or media files - would be hundreds of MB's per copy and that would take seconds or minutes. We should really work on getting the basic mechanism working and robust without any complications, then we can look at real, measured performance and see if there is any justification for adding complexity. thanks! Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/27/2013 12:47 AM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler rwhee...@redhat.com wrote: On 09/26/2013 03:53 PM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote: But I'm not sure it's worth the effort; 99% of the use of this interface will be copying whole files. And for that perhaps we need a different API, one which has been discussed some time ago: asynchronous copyfile() returns immediately with a pollable event descriptor indicating copy progress, and some way to cancel the copy. And that can internally rely on -direct_splice(), with appropriate algorithms for determine the optimal chunk size. And perhaps we don't. Perhaps we can provide this much simpler data-plane interface that works well enough for most everyone and can avoid going down the async rat hole, yet again. I think either buffering or async is needed to get good perforrmace without too much complexity in the app (which is not good). Buffering works quite well for regular I/O, so maybe its the way to go here as well. Thanks, Miklos Buffering misses the whole point of the copy offload - the idea is *not* to read or write the actual data in the most interesting cases which offload the operation to a smart target device or file system. I meant buffering the COPY, not the data. Doing the COPY synchronously will always incur a performance penalty, the amount depending on the latency, which can be significant with networking. We think of write(2) as a synchronous interface, because that's the appearance we get from all that hard work the page cache and delayed writeback code does to make an asynchronous operation look as if it was synchronous. So from a userspace API perspective a sync interface is nice, but inside we almost always have async interfaces to do the actual work. Thanks, Miklos I think that you are an order of magnitude off here in thinking about the scale of the operations. An enabled, synchronize copy offload to an array (or one that turns into a reflink locally) is effectively the cost of the call itself. Let's say no slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that call is much faster than that worst case number. Copying any substantial amount of data - like the target workload of VM images or media files - would be hundreds of MB's per copy and that would take seconds or minutes. We should really work on getting the basic mechanism working and robust without any complications, then we can look at real, measured performance and see if there is any justification for adding complexity. thanks! Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 02:55 PM, Zach Brown wrote: On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown wrote: A client-side copy will be slower, but I guess it does have the advantage that the application can track progress to some degree, and abort it fairly quickly without leaving the file in a totally undefined state--and both might be useful if the copy's not a simple constant-time operation. I suppose, but can't the app achieve a nice middle ground by copying the file in smaller syscalls? Avoid bulk data motion back to the client, but still get notification every, I dunno, few hundred meg? Yes. And if "cp" could just be switched from a read+write syscall pair to a single splice syscall using the same buffer size. And then the user would only notice that things got faster in case of server side copy. No problems with long blocking times (at least not much worse than it was). Hmm, yes, that would be a nice outcome. However "cp" doesn't do reflinking by default, it has a switch for that. If we just want "cp" and the like to use splice without fearing side effects then by default we should try to be as close to read+write behavior as possible. No? I guess? I don't find requiring --reflink hugely compelling. But there it is. That's what I'm really worrying about when you want to wire up splice to reflink by default. I do think there should be a flag for that. And if on the block level some magic happens, so be it. It's not the fs deverloper's worry any more ;) Sure. So we'd have: - no flag default that forbids knowingly copying with shared references so that it will be used by default by people who feel strongly about their assumptions about independent write durability. - a flag that allows shared references for people who would otherwise use the file system shared reference ioctls (ocfs2 reflink, btrfs clone) but would like it to also do server-side read/write copies over nfs without additional intervention. - a flag that requires shared references for callers who don't want giant copies to take forever if they aren't instant. (The qemu guys asked for this at Plumbers.) I think I can live with that. - z This last flag should not prevent a remote target device (NFS or SCSI array) copy from working though since they often do reflink like operations inside of the remote target device ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 03:53 PM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown wrote: But I'm not sure it's worth the effort; 99% of the use of this interface will be copying whole files. And for that perhaps we need a different API, one which has been discussed some time ago: asynchronous copyfile() returns immediately with a pollable event descriptor indicating copy progress, and some way to cancel the copy. And that can internally rely on ->direct_splice(), with appropriate algorithms for determine the optimal chunk size. And perhaps we don't. Perhaps we can provide this much simpler data-plane interface that works well enough for most everyone and can avoid going down the async rat hole, yet again. I think either buffering or async is needed to get good perforrmace without too much complexity in the app (which is not good). Buffering works quite well for regular I/O, so maybe its the way to go here as well. Thanks, Miklos Buffering misses the whole point of the copy offload - the idea is *not* to read or write the actual data in the most interesting cases which offload the operation to a smart target device or file system. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 11:34 AM, J. Bruce Fields wrote: On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown wrote: A client-side copy will be slower, but I guess it does have the advantage that the application can track progress to some degree, and abort it fairly quickly without leaving the file in a totally undefined state--and both might be useful if the copy's not a simple constant-time operation. I suppose, but can't the app achieve a nice middle ground by copying the file in smaller syscalls? Avoid bulk data motion back to the client, but still get notification every, I dunno, few hundred meg? Yes. And if "cp" could just be switched from a read+write syscall pair to a single splice syscall using the same buffer size. Will the various magic fs-specific copy operations become inefficient when the range copied is too small? (Totally naive question, as I have no idea how they really work.) --b. I think that is not really possible to tell when we invoke it. It is very much target device (or file system, etc) dependent on how long it takes. It could be as simple as a reflink copying in a smallish amount of metadata or fall back to a full byte-by-byte copy. Also note that speed is not the only impact here, some of the mechanisms actually do not consume more space (just increment shared data references). It would probably make more sense to send it off to the target device and have it return an error when not appropriate (then the app can fall back to the old fashion copy). ric And then the user would only notice that things got faster in case of server side copy. No problems with long blocking times (at least not much worse than it was). However "cp" doesn't do reflinking by default, it has a switch for that. If we just want "cp" and the like to use splice without fearing side effects then by default we should try to be as close to read+write behavior as possible. No? That's what I'm really worrying about when you want to wire up splice to reflink by default. I do think there should be a flag for that. And if on the block level some magic happens, so be it. It's not the fs deverloper's worry any more ;) Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 11:34 AM, J. Bruce Fields wrote: On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote: A client-side copy will be slower, but I guess it does have the advantage that the application can track progress to some degree, and abort it fairly quickly without leaving the file in a totally undefined state--and both might be useful if the copy's not a simple constant-time operation. I suppose, but can't the app achieve a nice middle ground by copying the file in smaller syscalls? Avoid bulk data motion back to the client, but still get notification every, I dunno, few hundred meg? Yes. And if cp could just be switched from a read+write syscall pair to a single splice syscall using the same buffer size. Will the various magic fs-specific copy operations become inefficient when the range copied is too small? (Totally naive question, as I have no idea how they really work.) --b. I think that is not really possible to tell when we invoke it. It is very much target device (or file system, etc) dependent on how long it takes. It could be as simple as a reflink copying in a smallish amount of metadata or fall back to a full byte-by-byte copy. Also note that speed is not the only impact here, some of the mechanisms actually do not consume more space (just increment shared data references). It would probably make more sense to send it off to the target device and have it return an error when not appropriate (then the app can fall back to the old fashion copy). ric And then the user would only notice that things got faster in case of server side copy. No problems with long blocking times (at least not much worse than it was). However cp doesn't do reflinking by default, it has a switch for that. If we just want cp and the like to use splice without fearing side effects then by default we should try to be as close to read+write behavior as possible. No? That's what I'm really worrying about when you want to wire up splice to reflink by default. I do think there should be a flag for that. And if on the block level some magic happens, so be it. It's not the fs deverloper's worry any more ;) Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 03:53 PM, Miklos Szeredi wrote: On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote: But I'm not sure it's worth the effort; 99% of the use of this interface will be copying whole files. And for that perhaps we need a different API, one which has been discussed some time ago: asynchronous copyfile() returns immediately with a pollable event descriptor indicating copy progress, and some way to cancel the copy. And that can internally rely on -direct_splice(), with appropriate algorithms for determine the optimal chunk size. And perhaps we don't. Perhaps we can provide this much simpler data-plane interface that works well enough for most everyone and can avoid going down the async rat hole, yet again. I think either buffering or async is needed to get good perforrmace without too much complexity in the app (which is not good). Buffering works quite well for regular I/O, so maybe its the way to go here as well. Thanks, Miklos Buffering misses the whole point of the copy offload - the idea is *not* to read or write the actual data in the most interesting cases which offload the operation to a smart target device or file system. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] extending splice for copy offloading
On 09/26/2013 02:55 PM, Zach Brown wrote: On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote: A client-side copy will be slower, but I guess it does have the advantage that the application can track progress to some degree, and abort it fairly quickly without leaving the file in a totally undefined state--and both might be useful if the copy's not a simple constant-time operation. I suppose, but can't the app achieve a nice middle ground by copying the file in smaller syscalls? Avoid bulk data motion back to the client, but still get notification every, I dunno, few hundred meg? Yes. And if cp could just be switched from a read+write syscall pair to a single splice syscall using the same buffer size. And then the user would only notice that things got faster in case of server side copy. No problems with long blocking times (at least not much worse than it was). Hmm, yes, that would be a nice outcome. However cp doesn't do reflinking by default, it has a switch for that. If we just want cp and the like to use splice without fearing side effects then by default we should try to be as close to read+write behavior as possible. No? I guess? I don't find requiring --reflink hugely compelling. But there it is. That's what I'm really worrying about when you want to wire up splice to reflink by default. I do think there should be a flag for that. And if on the block level some magic happens, so be it. It's not the fs deverloper's worry any more ;) Sure. So we'd have: - no flag default that forbids knowingly copying with shared references so that it will be used by default by people who feel strongly about their assumptions about independent write durability. - a flag that allows shared references for people who would otherwise use the file system shared reference ioctls (ocfs2 reflink, btrfs clone) but would like it to also do server-side read/write copies over nfs without additional intervention. - a flag that requires shared references for callers who don't want giant copies to take forever if they aren't instant. (The qemu guys asked for this at Plumbers.) I think I can live with that. - z This last flag should not prevent a remote target device (NFS or SCSI array) copy from working though since they often do reflink like operations inside of the remote target device ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Btrfs
On 09/12/2013 11:36 AM, Chris Mason wrote: Mark Fasheh's offline dedup work is also here. In this case offline means the FS is mounted and active, but the dedup work is not done inline during file IO. This is a building block where utilities are able to ask the FS to dedup a series of extents. The kernel takes care of verifying the data involved really is the same. Today this involves reading both extents, but we'll continue to evolve the patches. Nice feature! Just a note, the "offline" label is really confusing. In other storage products, they typically call this "out of band" since you are online but not during the actual write in a synchronous way :) Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Btrfs
On 09/12/2013 11:36 AM, Chris Mason wrote: Mark Fasheh's offline dedup work is also here. In this case offline means the FS is mounted and active, but the dedup work is not done inline during file IO. This is a building block where utilities are able to ask the FS to dedup a series of extents. The kernel takes care of verifying the data involved really is the same. Today this involves reading both extents, but we'll continue to evolve the patches. Nice feature! Just a note, the offline label is really confusing. In other storage products, they typically call this out of band since you are online but not during the actual write in a synchronous way :) Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop
On 08/23/2013 05:10 AM, Eiichi Tsukata wrote: (2013/08/21 3:09), Ewan Milne wrote: On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote: (2013/08/19 23:30), James Bottomley wrote: On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote: Hello, This patch adds scsi device failfast mode to avoid infinite retry loop. Currently, scsi error handling in scsi_decide_disposition() and scsi_io_completion() unconditionally retries on some errors. This is because retryable errors are thought to be temporary and the scsi device will soon recover from those errors. Normally, such retry policy is appropriate because the device will soon recover from temporary error state. But there is no guarantee that device is able to recover from error state immediately. Some hardware error may prevent device from recovering. Therefore hardware error can results in infinite command retry loop. In fact, CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite retry loop in our environment. As the comments in kernel source code says, UNIT_ATTENTION means the device must have been a power glitch and expected to immediately recover from the state. But it seems that hardware error caused permanent UNIT_ATTENTION error. To solve the above problem, this patch introduces scsi device "failfast mode". If failfast mode is enabled, retry counts of all scsi commands are limited to scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry infinitely, and immediately fails when the retry count exceeds upper limit. Failfast mode is useful on mission critical systems which are required to keep running flawlessly because they need to failover to the secondary system once they detect failures. On default, failfast mode is disabled because failfast policy is not suitable for most use cases which can accept I/O latency due to device hardware error. To enable failfast mode(default disabled): # echo 1> /sys/bus/scsi/devices/X:X:X:X/failfast To disable: # echo 0> /sys/bus/scsi/devices/X:X:X:X/failfast Furthermore, I'm planning to make the upper limit count configurable. Currently, I have two plans to implement it: (1) set same upper limit count on all errors. (2) set upper limit count on each error. The first implementation is simple and easy to implement but not flexible. Someone wants to set different upper limit count on each errors depends on the scsi device they use. The second implementation satisfies such requirement but can be too fine-grained and annoying to configure because scsi error codes are so much. The default 5 times retry may too much on some errors but too few on other errors. Which would be the appropriate implementation? Any comments or suggestions are welcome as usual. I'm afraid you'll need to propose another solution. We have a large selection of commands which, by design, retry until the command exceeds it's timeout. UA is one of those (as are most of the others you're limiting). How do you kick this device out of its UA return (because that's the recovery that needs to happen)? James Thanks for reviewing, James. Originally, I planned that once the retry count exceeds its limit, a monitoring tool stops the server with the scsi prink error message as a trigger. Current failfast mode implementation is that the command fails when retry command exceeds its limit. However, I noticed that only printing error messages on retry counts excess without changing retry logic will be enough to stop the server and take fail over. Though there is no guarantee that userspace application can work properly on disk failure condition. So, now I'm considering that just calling panic() on retry excess is better. For that reason, I propose the solution that adding "panic_on_error" option to sysfs parameter and if panic_on_error mode is enabled the server panics immediately once it detects retry excess. Of course, it is disabled on default. I would appreciate it if you could give me some comments. Eiichi -- For what it's worth, I've seen a report of a case where a storage array returned a CHECK CONDITION with invalid sense data, which caused the command to be retried indefinitely. Thank you for commenting, Ewan. I appreciate your information about indefinite retry on CHECK CONDITION. I'm not sure what you can do about this, if the device won't ever complete a command without an error. Perhaps it should be offlined after sufficiently bad behavior. I don't think you want to panic on an error, though. In a clustered environment it is possible that the other systems will all fail in the same way, for example. -Ewan Yes, basically the device should be offlined on error detection. Just offlining the disk is enough when an error occurs on "not" os-installed system disk. Panic is going too far on such case. However, in a clustered environment where computers use each its own disk and do not share the same disk, calling panic() will be suitable when an error
Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop
On 08/23/2013 05:10 AM, Eiichi Tsukata wrote: (2013/08/21 3:09), Ewan Milne wrote: On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote: (2013/08/19 23:30), James Bottomley wrote: On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote: Hello, This patch adds scsi device failfast mode to avoid infinite retry loop. Currently, scsi error handling in scsi_decide_disposition() and scsi_io_completion() unconditionally retries on some errors. This is because retryable errors are thought to be temporary and the scsi device will soon recover from those errors. Normally, such retry policy is appropriate because the device will soon recover from temporary error state. But there is no guarantee that device is able to recover from error state immediately. Some hardware error may prevent device from recovering. Therefore hardware error can results in infinite command retry loop. In fact, CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite retry loop in our environment. As the comments in kernel source code says, UNIT_ATTENTION means the device must have been a power glitch and expected to immediately recover from the state. But it seems that hardware error caused permanent UNIT_ATTENTION error. To solve the above problem, this patch introduces scsi device failfast mode. If failfast mode is enabled, retry counts of all scsi commands are limited to scsi-allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry infinitely, and immediately fails when the retry count exceeds upper limit. Failfast mode is useful on mission critical systems which are required to keep running flawlessly because they need to failover to the secondary system once they detect failures. On default, failfast mode is disabled because failfast policy is not suitable for most use cases which can accept I/O latency due to device hardware error. To enable failfast mode(default disabled): # echo 1 /sys/bus/scsi/devices/X:X:X:X/failfast To disable: # echo 0 /sys/bus/scsi/devices/X:X:X:X/failfast Furthermore, I'm planning to make the upper limit count configurable. Currently, I have two plans to implement it: (1) set same upper limit count on all errors. (2) set upper limit count on each error. The first implementation is simple and easy to implement but not flexible. Someone wants to set different upper limit count on each errors depends on the scsi device they use. The second implementation satisfies such requirement but can be too fine-grained and annoying to configure because scsi error codes are so much. The default 5 times retry may too much on some errors but too few on other errors. Which would be the appropriate implementation? Any comments or suggestions are welcome as usual. I'm afraid you'll need to propose another solution. We have a large selection of commands which, by design, retry until the command exceeds it's timeout. UA is one of those (as are most of the others you're limiting). How do you kick this device out of its UA return (because that's the recovery that needs to happen)? James Thanks for reviewing, James. Originally, I planned that once the retry count exceeds its limit, a monitoring tool stops the server with the scsi prink error message as a trigger. Current failfast mode implementation is that the command fails when retry command exceeds its limit. However, I noticed that only printing error messages on retry counts excess without changing retry logic will be enough to stop the server and take fail over. Though there is no guarantee that userspace application can work properly on disk failure condition. So, now I'm considering that just calling panic() on retry excess is better. For that reason, I propose the solution that adding panic_on_error option to sysfs parameter and if panic_on_error mode is enabled the server panics immediately once it detects retry excess. Of course, it is disabled on default. I would appreciate it if you could give me some comments. Eiichi -- For what it's worth, I've seen a report of a case where a storage array returned a CHECK CONDITION with invalid sense data, which caused the command to be retried indefinitely. Thank you for commenting, Ewan. I appreciate your information about indefinite retry on CHECK CONDITION. I'm not sure what you can do about this, if the device won't ever complete a command without an error. Perhaps it should be offlined after sufficiently bad behavior. I don't think you want to panic on an error, though. In a clustered environment it is possible that the other systems will all fail in the same way, for example. -Ewan Yes, basically the device should be offlined on error detection. Just offlining the disk is enough when an error occurs on not os-installed system disk. Panic is going too far on such case. However, in a clustered environment where computers use each its own disk and do not share the same disk, calling panic() will be suitable when an error occurs in
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/20/2013 01:04 PM, Ben Hutchings wrote: n Fri, 2013-07-19 at 13:42 -0500, Felipe Contreras wrote: >On Fri, Jul 19, 2013 at 7:08 AM, Ingo Molnar wrote: > > > >* Felipe Contreras wrote: > > >>As Linus already pointed out, not everybody has to work with everybody. > > > >That's not the point though, the point is to potentially roughly double > >the creative brain capacity of the Linux kernel project. > >Unfortunately that's impossible; we all know there aren't as many >women programmers as there are men. In some countries, though not all. But we also know (or should realise) that the gender ratio among programmers in general is much less unbalanced than in some free software communities including the Linux kernel developers. Just a couple of data points to add. When I was in graduate school in Israel, we had more women doing their phd then men. Not a huge sample, but it was interesting. The counter sample is the number of coding women we have at Red Hat in the kernel team. We are around zero per cent. Certainly a sign that we need to do better, regardless of the broader community challenges... Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/20/2013 01:04 PM, Ben Hutchings wrote: n Fri, 2013-07-19 at 13:42 -0500, Felipe Contreras wrote: On Fri, Jul 19, 2013 at 7:08 AM, Ingo Molnarmi...@kernel.org wrote: * Felipe Contrerasfelipe.contre...@gmail.com wrote: As Linus already pointed out, not everybody has to work with everybody. That's not the point though, the point is to potentially roughly double the creative brain capacity of the Linux kernel project. Unfortunately that's impossible; we all know there aren't as many women programmers as there are men. In some countries, though not all. But we also know (or should realise) that the gender ratio among programmers in general is much less unbalanced than in some free software communities including the Linux kernel developers. Just a couple of data points to add. When I was in graduate school in Israel, we had more women doing their phd then men. Not a huge sample, but it was interesting. The counter sample is the number of coding women we have at Red Hat in the kernel team. We are around zero per cent. Certainly a sign that we need to do better, regardless of the broader community challenges... Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] scsi-mq prototype discussion
On 07/17/2013 12:52 AM, James Bottomley wrote: On Tue, 2013-07-16 at 15:15 -0600, Jens Axboe wrote: On Tue, Jul 16 2013, Nicholas A. Bellinger wrote: On Sat, 2013-07-13 at 06:53 +, James Bottomley wrote: On Fri, 2013-07-12 at 12:52 +0200, Hannes Reinecke wrote: On 07/12/2013 03:33 AM, Nicholas A. Bellinger wrote: On Thu, 2013-07-11 at 18:02 -0700, Greg KH wrote: On Thu, Jul 11, 2013 at 05:23:32PM -0700, Nicholas A. Bellinger wrote: Drilling down the work items ahead of a real mainline push is high on priority list for discussion. The parties to be included in such a discussion are: - Jens Axboe (blk-mq author) - James Bottomley (scsi maintainer) - Christoph Hellwig (scsi) - Martin Petersen (scsi) - Tejun Heo (block + libata) - Hannes Reinecke (scsi error recovery) - Kent Overstreet (block, per-cpu ida) - Stephen Cameron (scsi-over-pcie driver) - Andrew Vasquez (qla2xxx LLD) - James Smart (lpfc LLD) Isn't this something that should have been discussed at the storage mini-summit a few months ago? The scsi-mq prototype, along with blk-mq (in it's current form) did not exist a few short months ago. ;) It seems very specific to one subsystem to be a kernel summit topic, don't you think? It's no more subsystem specific than half of the other proposals so far, and given it's reach across multiple subsystems (block, scsi, target), and the amount of off-list interest on the topic, I think it would make a good candidate for discussion. And it'll open up new approaches which previously were dismissed, like re-implementing multipathing on top of scsi-mq, giving us the single scsi device like other UNIX systems. Also I do think there's quite some synergy to be had, as with blk-mq we could nail each queue to a processor, which would eliminate the need for locking. Which could be useful for other subsystems, too. Lets start with discussing this on the list, please, and then see where we go from there ... Yes, the discussion is beginning to make it's way to the list. I've mostly been waiting for blk-mq to get a wider review before taking the early scsi-mq prototype driver to a larger public audience. Primarily, I'm now reaching out to the people most effected by existing scsi_request_fn() based performance limitations. Most of them have abandoned existing scsi_request_fn() based logic in favor of raw block make_request() based drivers, and are now estimating the amount of effort to move to an scsi-mq based approach. Regardless, as the prototype progresses over the next months, having a face-to-face discussion with the key parties in the room would be very helpful given the large amount of effort involved to actually make this type of generational shift in SCSI actually happen. There's a certain amount of overlap with the aio/O_DIRECT work as well. But if it's not a general session, could always be a BOF or something. I'll second the argument that most technical topics probably DO belong in a topic related workshop. But that leaves us with basically only process related topics at KS, I don't think it hurts to have a bit of tech meat on the bone too. At least I personally miss that part of KS from years gone by. Heh well, given that most of the block mq discussions at LSF have been you saying you really should get around to cleaning up and posting the code, you'll understand my wanting to see that happen first ... I suppose we could try to run a storage workshop within KS, but I think most of the mini summit slots have already gone. There's also plumbers if all slots are gone (I would say that, being biased and on the programme committee) Ric is running the storage and Filesystems MC http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/159 James And we still are looking for suggested topics - it would be great to have the multi-queue work at plumbers. You can post a proposal for it (or other topics) here: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] scsi-mq prototype discussion
On 07/17/2013 12:52 AM, James Bottomley wrote: On Tue, 2013-07-16 at 15:15 -0600, Jens Axboe wrote: On Tue, Jul 16 2013, Nicholas A. Bellinger wrote: On Sat, 2013-07-13 at 06:53 +, James Bottomley wrote: On Fri, 2013-07-12 at 12:52 +0200, Hannes Reinecke wrote: On 07/12/2013 03:33 AM, Nicholas A. Bellinger wrote: On Thu, 2013-07-11 at 18:02 -0700, Greg KH wrote: On Thu, Jul 11, 2013 at 05:23:32PM -0700, Nicholas A. Bellinger wrote: Drilling down the work items ahead of a real mainline push is high on priority list for discussion. The parties to be included in such a discussion are: - Jens Axboe (blk-mq author) - James Bottomley (scsi maintainer) - Christoph Hellwig (scsi) - Martin Petersen (scsi) - Tejun Heo (block + libata) - Hannes Reinecke (scsi error recovery) - Kent Overstreet (block, per-cpu ida) - Stephen Cameron (scsi-over-pcie driver) - Andrew Vasquez (qla2xxx LLD) - James Smart (lpfc LLD) Isn't this something that should have been discussed at the storage mini-summit a few months ago? The scsi-mq prototype, along with blk-mq (in it's current form) did not exist a few short months ago. ;) It seems very specific to one subsystem to be a kernel summit topic, don't you think? It's no more subsystem specific than half of the other proposals so far, and given it's reach across multiple subsystems (block, scsi, target), and the amount of off-list interest on the topic, I think it would make a good candidate for discussion. And it'll open up new approaches which previously were dismissed, like re-implementing multipathing on top of scsi-mq, giving us the single scsi device like other UNIX systems. Also I do think there's quite some synergy to be had, as with blk-mq we could nail each queue to a processor, which would eliminate the need for locking. Which could be useful for other subsystems, too. Lets start with discussing this on the list, please, and then see where we go from there ... Yes, the discussion is beginning to make it's way to the list. I've mostly been waiting for blk-mq to get a wider review before taking the early scsi-mq prototype driver to a larger public audience. Primarily, I'm now reaching out to the people most effected by existing scsi_request_fn() based performance limitations. Most of them have abandoned existing scsi_request_fn() based logic in favor of raw block make_request() based drivers, and are now estimating the amount of effort to move to an scsi-mq based approach. Regardless, as the prototype progresses over the next months, having a face-to-face discussion with the key parties in the room would be very helpful given the large amount of effort involved to actually make this type of generational shift in SCSI actually happen. There's a certain amount of overlap with the aio/O_DIRECT work as well. But if it's not a general session, could always be a BOF or something. I'll second the argument that most technical topics probably DO belong in a topic related workshop. But that leaves us with basically only process related topics at KS, I don't think it hurts to have a bit of tech meat on the bone too. At least I personally miss that part of KS from years gone by. Heh well, given that most of the block mq discussions at LSF have been you saying you really should get around to cleaning up and posting the code, you'll understand my wanting to see that happen first ... I suppose we could try to run a storage workshop within KS, but I think most of the mini summit slots have already gone. There's also plumbers if all slots are gone (I would say that, being biased and on the programme committee) Ric is running the storage and Filesystems MC http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/159 James And we still are looking for suggested topics - it would be great to have the multi-queue work at plumbers. You can post a proposal for it (or other topics) here: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/16/2013 07:53 PM, Myklebust, Trond wrote: On Tue, 2013-07-16 at 19:31 -0400, Ric Wheeler wrote: On 07/16/2013 07:12 PM, Sarah Sharp wrote: On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote: On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote: Yes, that's true. Some kernel developers are better at moderating their comments and tone towards individuals who are "sensitive". Others simply don't give a shit. So we need to figure out how to meet somewhere in the middle, in order to establish a baseline of civility. I have to ask this because I'm thick, and don't really understand, but ... What problem exactly are we trying to solve here? Personal attacks are not cool Steve. Some people simply don't care if a verbal tirade is directed at them. Others do not want anyone to attack them personally, but they're fine with people attacking their code. Bystanders that don't understand the kernel community structure are discouraged from contributing because they don't want to be verbally abused, and they really don't want to see either personal attacks or intense belittling, demeaning comments about code. In order to make our community better, we need to figure out where the baseline of "good" behavior is. We need to define what behavior we want from both maintainers and patch submitters. E.g. "No regressions" and "don't break userspace" and "no personal attacks". That needs to be written down somewhere, and it isn't. If it's documented somewhere, point me to the file in Documentation. Hint: it's not there. That is the problem. Sarah Sharp The problem you are pointing out - and it is a problem - makes us less effective as a community. Not really. Most of the people who already work as part of this community are completely used to it. We've created the environment, and have no problems with it. You should never judge success by being popular with those people who are already contributing and put up with things. If you did that in business, you would never reach new customers. Where it could possibly be a problem is when it comes to recruiting _new_ members to our community. Particularly so given that some journalists take a special pleasure in reporting particularly juicy comments and antics. That would tend to scare off a lot of gun-shy newbies. That is my point - recruiting new members is made harder. As some one who manages *a lot* of upstream kernel developers, I will add that it is not just new comers that find this occasionally offensive and off putting. On the other hand, it might tend to bias our recruitment toward people of a more "special" disposition. Perhaps we finally need the services of a social scientist to help us find out... To be fair, we usually do very well at this, especially with new comers to our community. I think that most of the problems come up between people who know each other quite well and are friendly with each other in person. The problem is that when you use language that you would use with good friends over drinks to tell them they are being stupid and do that on a public list, you set a tone that reaches far beyond your intended target. All of those new comers also read this list and do not see it as funny or friendly. I really don't think that we have to be politically correct or overly kind to make things better. As a very low bar, we could start by trying to avoid using language that would get you fired when you send off an email to someone that you have power over (either manage directly or indirectly control their career). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/16/2013 07:12 PM, Sarah Sharp wrote: On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote: On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote: Yes, that's true. Some kernel developers are better at moderating their comments and tone towards individuals who are "sensitive". Others simply don't give a shit. So we need to figure out how to meet somewhere in the middle, in order to establish a baseline of civility. I have to ask this because I'm thick, and don't really understand, but ... What problem exactly are we trying to solve here? Personal attacks are not cool Steve. Some people simply don't care if a verbal tirade is directed at them. Others do not want anyone to attack them personally, but they're fine with people attacking their code. Bystanders that don't understand the kernel community structure are discouraged from contributing because they don't want to be verbally abused, and they really don't want to see either personal attacks or intense belittling, demeaning comments about code. In order to make our community better, we need to figure out where the baseline of "good" behavior is. We need to define what behavior we want from both maintainers and patch submitters. E.g. "No regressions" and "don't break userspace" and "no personal attacks". That needs to be written down somewhere, and it isn't. If it's documented somewhere, point me to the file in Documentation. Hint: it's not there. That is the problem. Sarah Sharp The problem you are pointing out - and it is a problem - makes us less effective as a community. Getting the balance right is clearly difficult in a large, diverse community, but I do think that the key is to focus criticism on the code or technical arguments and avoid attacks on the individual. Being direct and funny in a critique is not the core of the issue, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/16/2013 07:12 PM, Sarah Sharp wrote: On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote: On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote: Yes, that's true. Some kernel developers are better at moderating their comments and tone towards individuals who are sensitive. Others simply don't give a shit. So we need to figure out how to meet somewhere in the middle, in order to establish a baseline of civility. I have to ask this because I'm thick, and don't really understand, but ... What problem exactly are we trying to solve here? Personal attacks are not cool Steve. Some people simply don't care if a verbal tirade is directed at them. Others do not want anyone to attack them personally, but they're fine with people attacking their code. Bystanders that don't understand the kernel community structure are discouraged from contributing because they don't want to be verbally abused, and they really don't want to see either personal attacks or intense belittling, demeaning comments about code. In order to make our community better, we need to figure out where the baseline of good behavior is. We need to define what behavior we want from both maintainers and patch submitters. E.g. No regressions and don't break userspace and no personal attacks. That needs to be written down somewhere, and it isn't. If it's documented somewhere, point me to the file in Documentation. Hint: it's not there. That is the problem. Sarah Sharp The problem you are pointing out - and it is a problem - makes us less effective as a community. Getting the balance right is clearly difficult in a large, diverse community, but I do think that the key is to focus criticism on the code or technical arguments and avoid attacks on the individual. Being direct and funny in a critique is not the core of the issue, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ksummit-2013-discuss] [ATTEND] How to act on LKML
On 07/16/2013 07:53 PM, Myklebust, Trond wrote: On Tue, 2013-07-16 at 19:31 -0400, Ric Wheeler wrote: On 07/16/2013 07:12 PM, Sarah Sharp wrote: On Tue, Jul 16, 2013 at 06:54:59PM -0400, Steven Rostedt wrote: On Tue, 2013-07-16 at 15:43 -0700, Sarah Sharp wrote: Yes, that's true. Some kernel developers are better at moderating their comments and tone towards individuals who are sensitive. Others simply don't give a shit. So we need to figure out how to meet somewhere in the middle, in order to establish a baseline of civility. I have to ask this because I'm thick, and don't really understand, but ... What problem exactly are we trying to solve here? Personal attacks are not cool Steve. Some people simply don't care if a verbal tirade is directed at them. Others do not want anyone to attack them personally, but they're fine with people attacking their code. Bystanders that don't understand the kernel community structure are discouraged from contributing because they don't want to be verbally abused, and they really don't want to see either personal attacks or intense belittling, demeaning comments about code. In order to make our community better, we need to figure out where the baseline of good behavior is. We need to define what behavior we want from both maintainers and patch submitters. E.g. No regressions and don't break userspace and no personal attacks. That needs to be written down somewhere, and it isn't. If it's documented somewhere, point me to the file in Documentation. Hint: it's not there. That is the problem. Sarah Sharp The problem you are pointing out - and it is a problem - makes us less effective as a community. Not really. Most of the people who already work as part of this community are completely used to it. We've created the environment, and have no problems with it. You should never judge success by being popular with those people who are already contributing and put up with things. If you did that in business, you would never reach new customers. Where it could possibly be a problem is when it comes to recruiting _new_ members to our community. Particularly so given that some journalists take a special pleasure in reporting particularly juicy comments and antics. That would tend to scare off a lot of gun-shy newbies. That is my point - recruiting new members is made harder. As some one who manages *a lot* of upstream kernel developers, I will add that it is not just new comers that find this occasionally offensive and off putting. On the other hand, it might tend to bias our recruitment toward people of a more special disposition. Perhaps we finally need the services of a social scientist to help us find out... To be fair, we usually do very well at this, especially with new comers to our community. I think that most of the problems come up between people who know each other quite well and are friendly with each other in person. The problem is that when you use language that you would use with good friends over drinks to tell them they are being stupid and do that on a public list, you set a tone that reaches far beyond your intended target. All of those new comers also read this list and do not see it as funny or friendly. I really don't think that we have to be politically correct or overly kind to make things better. As a very low bar, we could start by trying to avoid using language that would get you fired when you send off an email to someone that you have power over (either manage directly or indirectly control their career). Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
On 05/15/2013 04:03 PM, Zach Brown wrote: On Wed, May 15, 2013 at 07:44:05PM +, Eric Wong wrote: Why introduce a new syscall instead of extending sys_splice? Personally, I think it's ugly to have different operations use the same syscall just because their arguments match. I agree with Zach - having a system call called "splice" do copy offloads is not intuitive. This is a very reasonable name for something that battled its way through several standards bodies (for NFS and SCSI :)), so we should give it a reasonable name thanks! Ric But that preference aside, sure, if the consensus is that we'd rather use the splice() entry point then I can duck tape the pieces together to make it work. If the user doesn't need a out offset, then sendfile() should also be able to transparently utilize COPY/CLONE_RANGE, too. Perhaps, yeah. - z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
On 05/15/2013 04:03 PM, Zach Brown wrote: On Wed, May 15, 2013 at 07:44:05PM +, Eric Wong wrote: Why introduce a new syscall instead of extending sys_splice? Personally, I think it's ugly to have different operations use the same syscall just because their arguments match. I agree with Zach - having a system call called splice do copy offloads is not intuitive. This is a very reasonable name for something that battled its way through several standards bodies (for NFS and SCSI :)), so we should give it a reasonable name thanks! Ric But that preference aside, sure, if the consensus is that we'd rather use the splice() entry point then I can duck tape the pieces together to make it work. If the user doesn't need a out offset, then sendfile() should also be able to transparently utilize COPY/CLONE_RANGE, too. Perhaps, yeah. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
On 03/31/2013 07:18 PM, Pavel Machek wrote: Hi! Take a look at how many actively used filesystems out there that have some variant of sillyrename(), and explain what you want to do in those cases. Well. Yes, there are non-unix filesystems around. You have to deal with silly files on them, and this will not be different. So this would be a local POSIX filesystem only solution to a problem that has yet to be formulated? Problem is "clasical create temp file then delete it" is racy. See the archives. That is useful & common operation. Which race are you concerned with exactly? User wants to test for a file with name "foo.txt" * create "foo.txt~" (or whatever) * write contents into "foo.txt~" * rename "foo.txt~" to "foo.txt" Until rename is done, the file does not exists and is not complete. You will potentially have a garbage file to clean up if the program (or system) crashes, but that is not racy in a classic sense, right? Well. If people rsync from you, they will start fetching incomplete foo.txt~. Plus the garbage issue. That is not racy, just garbage (not trying to be pedantic, just trying to understand). I can see that the "~" file is annoying, but we have dealt with it for a *long* time :) Until it has the right name (on either the source or target system for rsync), it is not the file you are looking for. This is more of a garbage clean up issue? Also. Plus sometimes you want temporary "file" that is deleted. Terminals use it for history, etc... There you would have a race, you can create a file and unlink it of course and still write to it, but you would have a potential empty file issue? Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
On 03/31/2013 06:50 PM, Pavel Machek wrote: On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote: On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote: Hmm. open_deleted_file() will still need to get a directory... so it will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would be acceptable interface? ...and what's the big plan to make this work on anything other than ext4 and btrfs? Deleted but open files are from original unix, so it should work on anything unixy (minix, ext, ext2, ...). minix, ext, ext2... are not under active development and haven't been for more than a decade. Take a look at how many actively used filesystems out there that have some variant of sillyrename(), and explain what you want to do in those cases. Well. Yes, there are non-unix filesystems around. You have to deal with silly files on them, and this will not be different. So this would be a local POSIX filesystem only solution to a problem that has yet to be formulated? Problem is "clasical create temp file then delete it" is racy. See the archives. That is useful & common operation. Which race are you concerned with exactly? User wants to test for a file with name "foo.txt" * create "foo.txt~" (or whatever) * write contents into "foo.txt~" * rename "foo.txt~" to "foo.txt" Until rename is done, the file does not exists and is not complete. You will potentially have a garbage file to clean up if the program (or system) crashes, but that is not racy in a classic sense, right? This is more of a garbage clean up issue? Regards, Ric Problem is "atomicaly create file at target location with guaranteed right content". That's also in the archives. Looks useful if someone does rsync from your directory. Non-POSIX filesystems have problems handling deleted files, but that was always the case. That's one of the reasons they are seldomly used for root filesystems. Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
On 03/31/2013 06:50 PM, Pavel Machek wrote: On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote: On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote: Hmm. open_deleted_file() will still need to get a directory... so it will still need a path. Perhaps open(/foo/bar/mnt, O_DELETED) would be acceptable interface? ...and what's the big plan to make this work on anything other than ext4 and btrfs? Deleted but open files are from original unix, so it should work on anything unixy (minix, ext, ext2, ...). minix, ext, ext2... are not under active development and haven't been for more than a decade. Take a look at how many actively used filesystems out there that have some variant of sillyrename(), and explain what you want to do in those cases. Well. Yes, there are non-unix filesystems around. You have to deal with silly files on them, and this will not be different. So this would be a local POSIX filesystem only solution to a problem that has yet to be formulated? Problem is clasical create temp file then delete it is racy. See the archives. That is useful common operation. Which race are you concerned with exactly? User wants to test for a file with name foo.txt * create foo.txt~ (or whatever) * write contents into foo.txt~ * rename foo.txt~ to foo.txt Until rename is done, the file does not exists and is not complete. You will potentially have a garbage file to clean up if the program (or system) crashes, but that is not racy in a classic sense, right? This is more of a garbage clean up issue? Regards, Ric Problem is atomicaly create file at target location with guaranteed right content. That's also in the archives. Looks useful if someone does rsync from your directory. Non-POSIX filesystems have problems handling deleted files, but that was always the case. That's one of the reasons they are seldomly used for root filesystems. Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
On 03/31/2013 07:18 PM, Pavel Machek wrote: Hi! Take a look at how many actively used filesystems out there that have some variant of sillyrename(), and explain what you want to do in those cases. Well. Yes, there are non-unix filesystems around. You have to deal with silly files on them, and this will not be different. So this would be a local POSIX filesystem only solution to a problem that has yet to be formulated? Problem is clasical create temp file then delete it is racy. See the archives. That is useful common operation. Which race are you concerned with exactly? User wants to test for a file with name foo.txt * create foo.txt~ (or whatever) * write contents into foo.txt~ * rename foo.txt~ to foo.txt Until rename is done, the file does not exists and is not complete. You will potentially have a garbage file to clean up if the program (or system) crashes, but that is not racy in a classic sense, right? Well. If people rsync from you, they will start fetching incomplete foo.txt~. Plus the garbage issue. That is not racy, just garbage (not trying to be pedantic, just trying to understand). I can see that the ~ file is annoying, but we have dealt with it for a *long* time :) Until it has the right name (on either the source or target system for rsync), it is not the file you are looking for. This is more of a garbage clean up issue? Also. Plus sometimes you want temporary file that is deleted. Terminals use it for history, etc... There you would have a race, you can create a file and unlink it of course and still write to it, but you would have a potential empty file issue? Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 03/30/2013 05:57 PM, Myklebust, Trond wrote: On Mar 30, 2013, at 5:45 PM, Pavel Machek wrote: On Sat 2013-03-30 13:08:39, Andreas Dilger wrote: On 2013-03-30, at 12:49 PM, Pavel Machek wrote: Hmm, really? AFAICT it would be simple to provide an open_deleted_file("directory") syscall. You'd open_deleted_file(), copy source file into it, then fsync(), then link it into filesystem. That should have atomicity properties reflected. Actually, the open_deleted_file() syscall is quite useful for many different things all by itself. Lots of applications need to create temporary files that are unlinked at application failure (without a race if app crashes after creating the file, but before unlinking). It also avoids exposing temporary files into the namespace if other applications are accessing the directory. Hmm. open_deleted_file() will still need to get a directory... so it will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would be acceptable interface? Pavel ...and what's the big plan to make this work on anything other than ext4 and btrfs? Cheers, Trond I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ? Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 03/30/2013 05:57 PM, Myklebust, Trond wrote: On Mar 30, 2013, at 5:45 PM, Pavel Machek pa...@ucw.cz wrote: On Sat 2013-03-30 13:08:39, Andreas Dilger wrote: On 2013-03-30, at 12:49 PM, Pavel Machek wrote: Hmm, really? AFAICT it would be simple to provide an open_deleted_file(directory) syscall. You'd open_deleted_file(), copy source file into it, then fsync(), then link it into filesystem. That should have atomicity properties reflected. Actually, the open_deleted_file() syscall is quite useful for many different things all by itself. Lots of applications need to create temporary files that are unlinked at application failure (without a race if app crashes after creating the file, but before unlinking). It also avoids exposing temporary files into the namespace if other applications are accessing the directory. Hmm. open_deleted_file() will still need to get a directory... so it will still need a path. Perhaps open(/foo/bar/mnt, O_DELETED) would be acceptable interface? Pavel ...and what's the big plan to make this work on anything other than ext4 and btrfs? Cheers, Trond I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get atomic file creation for several decades now ? Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/25/2013 04:14 PM, Andy Lutomirski wrote: On 02/21/2013 02:24 PM, Zach Brown wrote: On Thu, Feb 21, 2013 at 08:50:27PM +, Myklebust, Trond wrote: On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote: Il 21/02/2013 15:57, Ric Wheeler ha scritto: sendfile64() pretty much already has the right arguments for a "copyfile", however it would be nice to add a 'flags' parameter: the NFSv4.2 version would use that to specify whether or not to copy file metadata. That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers. Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism? For virt we would like to be able to specify arbitrary block ranges. Copying an entire file helps some copy operations like storage migration. However, it is not enough to convert the guest's offloaded copies to host-side offloaded copies. So how would a system call based on sendfile64() plus my flag parameter prevent an underlying implementation from meeting your criterion? If I'm guessing correctly, sendfile64()+flags would be annoying because it's missing an out_fd_offset. The host will want to offload the guest's copies by calling sendfile on block ranges of a guest disk image file that correspond to the mappings of the in and out files in the guest. You could make it work with some locking and out_fd seeking to set the write offset before calling sendfile64()+flags, but ugh. ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t out_offset, size_t count, int flags); That seems closer. We might also want to pre-emptively offer iovs instead of offsets, because that's the very first thing that's going to be requested after people prototype having to iterate calling sendfile() for each contiguous copy region. I thought the first thing people would ask for is to atomically create a new file and copy the old file into it (at least on local file systems). The idea is that nothing should see an empty destination file, either by race or by crash. (This feature would perhaps be described as a pony, but it should be implementable.) This would be like a better link(2). --Andy Why would this need to be atomic? That would seem to be a very difficult property to provide across all target types with multi-GB sized files... Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/25/2013 04:14 PM, Andy Lutomirski wrote: On 02/21/2013 02:24 PM, Zach Brown wrote: On Thu, Feb 21, 2013 at 08:50:27PM +, Myklebust, Trond wrote: On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote: Il 21/02/2013 15:57, Ric Wheeler ha scritto: sendfile64() pretty much already has the right arguments for a copyfile, however it would be nice to add a 'flags' parameter: the NFSv4.2 version would use that to specify whether or not to copy file metadata. That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers. Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism? For virt we would like to be able to specify arbitrary block ranges. Copying an entire file helps some copy operations like storage migration. However, it is not enough to convert the guest's offloaded copies to host-side offloaded copies. So how would a system call based on sendfile64() plus my flag parameter prevent an underlying implementation from meeting your criterion? If I'm guessing correctly, sendfile64()+flags would be annoying because it's missing an out_fd_offset. The host will want to offload the guest's copies by calling sendfile on block ranges of a guest disk image file that correspond to the mappings of the in and out files in the guest. You could make it work with some locking and out_fd seeking to set the write offset before calling sendfile64()+flags, but ugh. ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t out_offset, size_t count, int flags); That seems closer. We might also want to pre-emptively offer iovs instead of offsets, because that's the very first thing that's going to be requested after people prototype having to iterate calling sendfile() for each contiguous copy region. I thought the first thing people would ask for is to atomically create a new file and copy the old file into it (at least on local file systems). The idea is that nothing should see an empty destination file, either by race or by crash. (This feature would perhaps be described as a pony, but it should be implementable.) This would be like a better link(2). --Andy Why would this need to be atomic? That would seem to be a very difficult property to provide across all target types with multi-GB sized files... Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/22/2013 10:47 AM, Paolo Bonzini wrote: Il 21/02/2013 23:24, Zach Brown ha scritto: You could make it work with some locking and out_fd seeking to set the write offset before calling sendfile64()+flags, but ugh. ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t out_offset, size_t count, int flags); That seems closer. We might also want to pre-emptively offer iovs instead of offsets, because that's the very first thing that's going to be requested after people prototype having to iterate calling sendfile() for each contiguous copy region. Indeed, I was about to propose that exactly. So that would be psendfilev. I don't think psendfile is useful, and can be easily provided at the libc level. Paolo This seems to be suspiciously close to a clear consensus on how to move forward after many years of spinning our wheels. Anyone want to promote an actual patch before we change our collective minds? Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/21/2013 11:13 PM, Myklebust, Trond wrote: On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote: On 02/21/2013 09:00 PM, Paolo Bonzini wrote: Il 21/02/2013 15:57, Ric Wheeler ha scritto: sendfile64() pretty much already has the right arguments for a "copyfile", however it would be nice to add a 'flags' parameter: the NFSv4.2 version would use that to specify whether or not to copy file metadata. That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers. Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism? For virt we would like to be able to specify arbitrary block ranges. Copying an entire file helps some copy operations like storage migration. However, it is not enough to convert the guest's offloaded copies to host-side offloaded copies. Paolo I don't think that the NFS protocol allows arbitrary ranges, but the SCSI commands are ranged based. If I remember what the windows people said at a SNIA event a few years back, they have a requirement that the target file be pre-allocated (at least for the SCSI based copy). Not clear to me where they iterate over that target file to do the block range copies, but I suspect it is in their kernel. The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary byte ranges. The main target for that functionality is indeed virtualisation and thin provisioning of virtual machines. For background, here is a pointer to Fred Knight's SNIA talk on the SCSI support for offload: https://snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_Storage_Data_Movement_Offload.pdf and a talk from Spencer Shepler that gives some detail on the NFS spec, including the "server side copy" bits: https://snia.org/sites/default/files2/SDC2011/presentations/wednesday/SpencerShepler_IETF_NFSv4_Working_Group_v4.pdf The talks both have references to the actual specs for the gory details. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/21/2013 11:13 PM, Myklebust, Trond wrote: On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote: On 02/21/2013 09:00 PM, Paolo Bonzini wrote: Il 21/02/2013 15:57, Ric Wheeler ha scritto: sendfile64() pretty much already has the right arguments for a copyfile, however it would be nice to add a 'flags' parameter: the NFSv4.2 version would use that to specify whether or not to copy file metadata. That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers. Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism? For virt we would like to be able to specify arbitrary block ranges. Copying an entire file helps some copy operations like storage migration. However, it is not enough to convert the guest's offloaded copies to host-side offloaded copies. Paolo I don't think that the NFS protocol allows arbitrary ranges, but the SCSI commands are ranged based. If I remember what the windows people said at a SNIA event a few years back, they have a requirement that the target file be pre-allocated (at least for the SCSI based copy). Not clear to me where they iterate over that target file to do the block range copies, but I suspect it is in their kernel. The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary byte ranges. The main target for that functionality is indeed virtualisation and thin provisioning of virtual machines. For background, here is a pointer to Fred Knight's SNIA talk on the SCSI support for offload: https://snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_Storage_Data_Movement_Offload.pdf and a talk from Spencer Shepler that gives some detail on the NFS spec, including the server side copy bits: https://snia.org/sites/default/files2/SDC2011/presentations/wednesday/SpencerShepler_IETF_NFSv4_Working_Group_v4.pdf The talks both have references to the actual specs for the gory details. Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/22/2013 10:47 AM, Paolo Bonzini wrote: Il 21/02/2013 23:24, Zach Brown ha scritto: You could make it work with some locking and out_fd seeking to set the write offset before calling sendfile64()+flags, but ugh. ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t out_offset, size_t count, int flags); That seems closer. We might also want to pre-emptively offer iovs instead of offsets, because that's the very first thing that's going to be requested after people prototype having to iterate calling sendfile() for each contiguous copy region. Indeed, I was about to propose that exactly. So that would be psendfilev. I don't think psendfile is useful, and can be easily provided at the libc level. Paolo This seems to be suspiciously close to a clear consensus on how to move forward after many years of spinning our wheels. Anyone want to promote an actual patch before we change our collective minds? Ric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New copyfile system call - discuss before LSF?
On 02/21/2013 09:00 PM, Paolo Bonzini wrote: Il 21/02/2013 15:57, Ric Wheeler ha scritto: sendfile64() pretty much already has the right arguments for a "copyfile", however it would be nice to add a 'flags' parameter: the NFSv4.2 version would use that to specify whether or not to copy file metadata. That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers. Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism? For virt we would like to be able to specify arbitrary block ranges. Copying an entire file helps some copy operations like storage migration. However, it is not enough to convert the guest's offloaded copies to host-side offloaded copies. Paolo I don't think that the NFS protocol allows arbitrary ranges, but the SCSI commands are ranged based. If I remember what the windows people said at a SNIA event a few years back, they have a requirement that the target file be pre-allocated (at least for the SCSI based copy). Not clear to me where they iterate over that target file to do the block range copies, but I suspect it is in their kernel. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/