Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Sat, Jun 02 2007, Tejun Heo wrote: Hello, Jens Axboe wrote: Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. As always, it depends :-) If you are doing pure flush barriers, then there's no difference. Unless you only guarantee ordering wrt previously submitted requests, in which case you can eliminate the post flush. If you are doing ordered tags, then just setting the ordered bit is enough. That is different from the barrier in that we don't need a flush of FUA bit set. Hmmm... I'm feeling dense. Zero-length barrier also requires only one flush to separate requests before and after it (haven't looked at the code yet, will soon). Can you enlighten me? Yeah, that's what the zero-length barrier implementation I posted does. Not sure if you have a question beyond that, if so fire away :-) I thought you were talking about adding BIO_RW_ORDERED instead of exposing zero length BIO_RW_BARRIER. Sorry about the confusion. :-) -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Design question
I am not sure there really is a best list for this, but this is the closest I can think of. I am working on host software, for a series of cards (www.picocomputing.com) All of these cards have an FPGA, most have a processor, memory, flash and other resources. They have different flavor FPGA, CPU's and busses (compact Flash, PCI, CardBus, Express Bus, ) The cards can run standalone, and they can run uCLinux, Linux, or GreenHills, or ... But when they are inserted into a Linux Host (there is already windows host software) From the host side there are several major tasks that might be performed, Reading writing target memory, IO space, preprogramming the FPGA, Flash, or reading/writing to the hardware implemented in the FPGA - which may or may not require interaction with the target OS. The performance and bandwidth requirements of data-transfers varies greatly, most being fairly mellow, but occasionally either the bandwidth or latency requirements can be high. Some of the above is vague - We are not developing a specific peice of hardware for a specific use. We are working on a product and development environment that has nearly infinite uses. We have evolved a virtual channel architecture that is to allow creating IP that goes into the FPGA, and can be accessed by applications on the host. There can be multiple cards in the same host (in some applications large numbers) despite all of the above things are not all that complex, just incredibly flexible. Anyway that is the BIG picture. Our/my original implementation of a driver(s) for this was a character driver for each card specific to its bus type, with minor device numbers for reading writing different regions, and a large collection of ioctl's to handle special functions. But the vast majority of actions consist of read/write. I have been trying to decide if it make sense to rewrite the driver as a VFS driver, with special file names for each channel or functional unit. I have also been trying to digest enough information on sysfs to determine if that is an appropriate approach. Basically I am trying to decide what kind of driver provides the best potential solution. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 06/18] fs/logfs/compr.c
On Sun, 3 June 2007 23:58:43 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +#define COMPR_LEVEL 3 + +static DEFINE_MUTEX(compr_mutex); +static struct z_stream_s stream; Is there a particular reason to choose '3' as the only compression level? Should this perhaps be a per-superblock option instead? There is no particular reason. '3' should be a reasonable value for most people. If actual users want to change this value, I can make it a mount option as well. Right now I'm just lazy and doubt the merits. Also, I thought I saw discussion about making the mutex and stream per-superblock, but don't remember if the idea was discarded. If it was, you might want to add it to the won't-happen list. It was more or less discarded. As long as the sweet spot for LogFS is small systems, saving memory is more important than multithreaded performance. Will add it to the list. Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 09/18] fs/logfs/gc.c
On Mon, 4 June 2007 00:07:36 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +static long decay(long t0, long t, long theta) +{ + long shift, fac; + + if (t = 32*theta) + return 0; + + shift = t/theta; + fac = theta - (t%theta)/2; + return (t0 shift) * fac / theta; +} I think it's confusion to work with 'long' arguments here. If you actually allow larger than 32 bit arguments, that means that the gc logic behaves differently on 32 and 64 bit CPUs, which I don't think is what you intended. Different behaviour would be fine. This function will be used to pick good candidates for garbage collection. If one segment will get chosen over another depending on BITS_PER_LONG, either one would have been a good candidate anyway. Hmm. Maybe I should s/32/BITS_PER_LONG/ in the function. Also, can any of the arguments be negative? How about making them all explicit u32 and u64 variables? That would make sense, yes. Jörn -- I've never met a human being who would want to read 17,000 pages of documentation, and if there was, I'd kill him to get him out of the gene pool. -- Joseph Costello - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LogFS take four
On Mon, 4 June 2007 00:18:21 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: Unchanged: o error handling ... Won't happen (unless I get convinced to do otherwise): o Change LOGFS_BUG() and LOGFS_BUG_ON() to inline functions These are macros for very much the same reasons BUG() and BUG_ON() are. I wonder how many of your LOGFS_BUG{,_ON} still remain after the error handling is in place to deal with broken file system contents. Ideally, I'd say the current LOGFS_BUG() should be replaced with a function that prints about the kind of error it has hit (rate-limited), potentially calls logfs_crash_dump(), and remounts the medium read-only, but _not_ call BUG(). That sounds fairly useful, actually. Do a WARN_ON(), call logfs_crash_dump(), remount read-only and finished. Rate-limiting might be unnecessary as the read-only thing already limits the rate to 1. I like the idea. Jörn -- Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 14/18] fs/logfs/segment.c
On Mon, 4 June 2007 00:21:41 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +static DEFINE_MUTEX(compr_mutex); + It seems you define a static compre_mutex in both segment.c and in compr.c, and always lock them both at the same time. Is that a correct observation? Is it intentional, or an oversight on your side? Lame coding on my side. Seems to have gone lost in my notes, but this mutex should get removed and the protected memory made per-superblock. Unlike the zlib workspace it does not consume 300k, so there is no excuse for it here. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 05/18] fs/logfs/logfs.h
On Sun, 3 June 2007 23:50:55 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +/** + * struct logfs_device_ops - device access operations + * + * @read: read from the device + * @write: write to the device + * @erase: erase part of the device + */ +struct logfs_device_ops { + int (*read)(struct super_block *sb, loff_t ofs, size_t len, void *buf); + int (*write)(struct super_block *sb, loff_t ofs, size_t len, void *buf); + int (*erase)(struct super_block *sb, loff_t ofs, size_t len); +}; I wonder if there is a way to document the prototypes of these function pointers with kerneldoc, other than having a typedef for each. What brought me to this point is that I first assumed they would return the number of bytes transferred, like read/write file operations, where your functions return zero on success. I can just add a comment about the return code in the struct documentation. For the foreseeable future there will be exactly two instances of this structure. It's not as if every driver would implement this. Jörn -- A defeated army first battles and then seeks victory. -- Sun Tzu - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 04/18] include/linux/logfs.h
On Sun, 3 June 2007 23:42:25 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +struct logfs_je_spillout { + __be64 so_segment[0]; +}__packed; All the on-disk data structures you define in this file have naturally aligned members, so the __packed attribute is not needed. Amen. It is purely paranoia and I don't even know who is out to get me. However, I think it causes gcc to generate larger and slower code on some architectures, because now it has to assume that the data structure itself has no more than byte alignment. I'd simply remove all instances of __packed therefore. In order to verify that you got it right in all cases, build with '-Wpadded -Wpacked'. Fine with me. Will do. Jörn -- Ninety percent of everything is crap. -- Sturgeon's Law - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 05/18] fs/logfs/logfs.h
On Sunday 03 June 2007, Jörn Engel wrote: +/** + * struct logfs_device_ops - device access operations + * + * @read: read from the device + * @write: write to the device + * @erase: erase part of the device + */ +struct logfs_device_ops { + int (*read)(struct super_block *sb, loff_t ofs, size_t len, void *buf); + int (*write)(struct super_block *sb, loff_t ofs, size_t len, void *buf); + int (*erase)(struct super_block *sb, loff_t ofs, size_t len); +}; I wonder if there is a way to document the prototypes of these function pointers with kerneldoc, other than having a typedef for each. What brought me to this point is that I first assumed they would return the number of bytes transferred, like read/write file operations, where your functions return zero on success. read/write functions returning bytes written would return ssize_t, just as vfs_read and vfs_write do. Jan -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Read/write counts
I have a file system that has really odd blocking. All files have a variable length header (basically a directory entry) at their start. Most but not all sectors, have a small fixed length signature as well as some link data at their start. The net result is that implimentation would be simpler if I could just read/write, the amount of data that can be done with the least amount of work, even if that is less than was requested. If I receive a request to read 512 bytes, and I return that I have read 486, is either the OS, libc, or something else going to treat that as an error, or are they coming back for the rest in a subsequent call ? I though I recalled that read()/write() returning a cound less than requested is not an error. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
On Wed 2007-05-23 18:16:45, Andreas Gruenbacher wrote: On Tuesday 15 May 2007 11:14, Pavel Machek wrote: Why is this configurable? The maximum length of a pathname is an arbitrary limit: we don't want to allocate arbitrary amounts of of kernel memory for pathnames so we introduce this limit and set it to a reasonable value. In the unlikely case that someone uses insanely long pathnames, this limit can be increased. vfs does not have configurable pathname limit, and I do not see what is so special about AA to require this kind of uglyness. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
On Monday 04 June 2007 12:55, Pavel Machek wrote: On Wed 2007-05-23 18:16:45, Andreas Gruenbacher wrote: On Tuesday 15 May 2007 11:14, Pavel Machek wrote: Why is this configurable? The maximum length of a pathname is an arbitrary limit: we don't want to allocate arbitrary amounts of of kernel memory for pathnames so we introduce this limit and set it to a reasonable value. In the unlikely case that someone uses insanely long pathnames, this limit can be increased. vfs does not have configurable pathname limit, and I do not see what is so special about AA to require this kind of uglyness. You very well know that the vfs has a limit of PATH_MAX characters (4096) for pathnames. This means that at most that many characters can be passed at once. I've really got enough of your perpetual unfounded rants. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
On Mon 2007-06-04 13:25:30, Andreas Gruenbacher wrote: On Monday 04 June 2007 12:55, Pavel Machek wrote: On Wed 2007-05-23 18:16:45, Andreas Gruenbacher wrote: On Tuesday 15 May 2007 11:14, Pavel Machek wrote: Why is this configurable? The maximum length of a pathname is an arbitrary limit: we don't want to allocate arbitrary amounts of of kernel memory for pathnames so we introduce this limit and set it to a reasonable value. In the unlikely case that someone uses insanely long pathnames, this limit can be increased. vfs does not have configurable pathname limit, and I do not see what is so special about AA to require this kind of uglyness. You very well know that the vfs has a limit of PATH_MAX characters (4096) for pathnames. This means that at most that many characters can be passed at once. Sorry then. Why not reuse the PATH_MAX when it exists already? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
On Monday 04 June 2007 13:35, Pavel Machek wrote: On Mon 2007-06-04 13:25:30, Andreas Gruenbacher wrote: On Monday 04 June 2007 12:55, Pavel Machek wrote: On Wed 2007-05-23 18:16:45, Andreas Gruenbacher wrote: On Tuesday 15 May 2007 11:14, Pavel Machek wrote: Why is this configurable? The maximum length of a pathname is an arbitrary limit: we don't want to allocate arbitrary amounts of of kernel memory for pathnames so we introduce this limit and set it to a reasonable value. In the unlikely case that someone uses insanely long pathnames, this limit can be increased. vfs does not have configurable pathname limit, and I do not see what is so special about AA to require this kind of uglyness. You very well know that the vfs has a limit of PATH_MAX characters (4096) for pathnames. This means that at most that many characters can be passed at once. What users can do is something like this: chdir(some/long/path); chdir(some/even/longer/path); ... and the total length of the path can then exceed PATH_MAX characters. We can only accept pathnames up to some upper limit, and we need to somehow define what that limit is supposed to be. We could use PATH_MAX or some other arbitrary number. In most situations PATH_MAX will be fine, but that's not always guaranteed to be the case. So what's wrong about making this configurable for special situations that we might run into? Module parameters are *really* dead cheap. Andreas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
Hi! You very well know that the vfs has a limit of PATH_MAX characters (4096) for pathnames. This means that at most that many characters can be passed at once. What users can do is something like this: chdir(some/long/path); chdir(some/even/longer/path); ... and the total length of the path can then exceed PATH_MAX characters. We can only accept pathnames up to some upper limit, and we need to somehow define what that limit is supposed to be. We could use PATH_MAX or some other arbitrary number. In most situations PATH_MAX will be fine, but that's not always guaranteed to be the case. So what's wrong about making this configurable for special situations that we might run into? Module parameters are *really* dead cheap. Parameters are cheap, but this one is ugly. How will kernel work with very long paths? I'd suspect some problems, if path is 1MB long and I attempt to print it in /proc somewhere. Perhaps vfs should be modified not to allow such crazy paths? But placing limit in aa is ugly. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 04/18] include/linux/logfs.h
On Mon, 2007-06-04 at 11:12 +0200, Jörn Engel wrote: On Sun, 3 June 2007 23:42:25 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +struct logfs_je_spillout { + __be64 so_segment[0]; +}__packed; All the on-disk data structures you define in this file have naturally aligned members, so the __packed attribute is not needed. Amen. It is purely paranoia and I don't even know who is out to get me. You can _never_ know who is out to get you, or what architecture we'll be ported to next week. The advice don't tell the compiler what you want unless you _know_ it'll do the wrong thing otherwise runs counter to everything we've learned, slowly and painfully, over the last few years. We should never rely on compiler behaviour which is undocumented and unrequired. Even if you know that the ABI forces it to continue to do the right thing on the platforms you _currently_ care about, it might not do it on new platforms (or existing platforms you didn't manage to test). It would be better if GCC had a 'nopadding' attribute which gave us what we need without the _extra_ implications about alignment. In the absence of that, though, you should at _least_ have a check on the size of the structure if you're doing to drop the packed attribute. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 06/18] fs/logfs/compr.c
On Mon, 2007-06-04 at 10:54 +0200, Jörn Engel wrote: There is no particular reason. '3' should be a reasonable value for most people. If actual users want to change this value, I can make it a mount option as well. Right now I'm just lazy and doubt the merits. I think you probably made the right choice. If you're compressing small chunks at a time, increasing the amount of history which the compressor will keep really doesn't buy you much. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases
On Sat, Jun 02, 2007 at 02:21:22PM -0400, Trond Myklebust wrote: Currently, the lease handling is done all in the VFS, and is done prior to calling any filesystem operations. Bruce's break_lease() inode operation allows the VFS to notify the filesystem that some operation is going to be called that requires the lease to be broken. My point is that in doing so, you are not atomic with the operation that requires the lease to be broken. Some different node may re-establish a lease while we're calling back down into the filesystem to perform the operation. So I agree with you. The break_lease() inode operation isn't going to work. The filesystem is going to have to figure out for itself when it needs to notify other nodes that the lease needs breaking, and it needs to figure out its own methods for ensuring atomicity. OK, I agree with you both, thanks for the explanations. It looks to me like there's probably a race in the existing code that will allow conflicting opens and leases to be granted simultaneously if the lease request is handled just after may_open() is called. These checks at the beginning of __setlease() are an attempt to prevent that race: if ((arg == F_RDLCK) (atomic_read(inode-i_writecount) 0)) goto out; if ((arg == F_WRLCK) ((atomic_read(dentry-d_count) 1) || (atomic_read(inode-i_count) 1))) goto out; But, for example, in the case of a simultaneous write open and RDLCK lease request, I believe the call to setlease could come after the may_open() but before the call to get_write_access() that bumps i_writecount. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 04/18] include/linux/logfs.h
On Mon, 4 June 2007 14:38:23 +0100, David Woodhouse wrote: On Mon, 2007-06-04 at 11:12 +0200, Jörn Engel wrote: On Sun, 3 June 2007 23:42:25 +0200, Arnd Bergmann wrote: On Sunday 03 June 2007, Jörn Engel wrote: +struct logfs_je_spillout { + __be64 so_segment[0]; +}__packed; All the on-disk data structures you define in this file have naturally aligned members, so the __packed attribute is not needed. Amen. It is purely paranoia and I don't even know who is out to get me. You can _never_ know who is out to get you, or what architecture we'll be ported to next week. The advice don't tell the compiler what you want unless you _know_ it'll do the wrong thing otherwise runs counter to everything we've learned, slowly and painfully, over the last few years. We should never rely on compiler behaviour which is undocumented and unrequired. Even if you know that the ABI forces it to continue to do the right thing on the platforms you _currently_ care about, it might not do it on new platforms (or existing platforms you didn't manage to test). It would be better if GCC had a 'nopadding' attribute which gave us what we need without the _extra_ implications about alignment. In the absence of that, though, you should at _least_ have a check on the size of the structure if you're doing to drop the packed attribute. Adding a size check is simple enough. But given all this I'll put it to the very end of my todo list. There are many other optimizations remaining. Jörn -- Joern's library part 14: http://www.sandpile.org/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/45] AppArmor: Module and LSM hooks
On Monday 04 June 2007 15:12, Pavel Machek wrote: How will kernel work with very long paths? I'd suspect some problems, if path is 1MB long and I attempt to print it in /proc somewhere. Pathnames are only used for informational purposes in the kernel, except in AppArmor of course. /proc only uses pathnames in a few places, but /proc/mounts will silently fail and produce garbage entries. That's not ideal of course; we should fix that somehow. Note that this has nothing to do with the AppArmor discussion ... Perhaps vfs should be modified not to allow such crazy paths? But placing limit in aa is ugly. Dream on. Redefining fundamental vfs semantics is not an option; we should rather make sure that we fail gracefully. Considering the alternatives, I still prefer the configurable limit. That's way more useful than allowing a process to DOS the kernel with AppArmor. Andreas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Jun 04, 2007 06:20 -0400, David H. Lynch Jr. wrote: The net result is that implimentation would be simpler if I could just read/write, the amount of data that can be done with the least amount of work, even if that is less than was requested. If I receive a request to read 512 bytes, and I return that I have read 486, is either the OS, libc, or something else going to treat that as an error, or are they coming back for the rest in a subsequent call ? I though I recalled that read()/write() returning a cound less than requested is not an error. It is not strictly an error to read/write less than the requested amount, but you will find that a lot of applications don't handle this correctly. They will assume that if the amount read/written is != amount requested that this is an error. Of course the opposite is also true - some applications assume that the amount requested == amount read/written and don't even check whether that is actually the case or not. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
It is not strictly an error to read/write less than the requested amount, but you will find that a lot of applications don't handle this correctly. I'd give it a slightly different nuance. It's not an error, and it's a reasonable thing to do, but there is value in not doing it. POSIX and its predecessors back to the beginning of Unix say read()/write() don't have to transfer the full count (they must transfer at least one byte). The main reason for this choice is that it may require more resources (e.g. a memory buffer) than the system can allocate to do the whole request at once. Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Most application programs access files via libc's fread/fwrite, which don't have partial transfers. GNU libc does handle partial (kernel) reads and writes correctly. I'd be surprised if someone can name a major application that doesn't. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Linus has explicitly forbidden short reads from being returned. The original poster may get away with it for a specialised case, but for example, signals may not cause a return to userspace with a short read for exactly this reason. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 11:02:23AM -0600, Matthew Wilcox wrote: On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Linus has explicitly forbidden short reads from being returned. The original poster may get away with it for a specialised case, but for example, signals may not cause a return to userspace with a short read for exactly this reason. Hmm, I'm not sure I would go that far. Per the POSIX specification, we support the optional BSD-style restartable system calls for signals which will avoid short reads; but this is only true if SA_RESTART is passed to sigaction(). Without SA_RESTART, we will indeed return short reads, as required by POSIX. I don't think Linus has said that short reads are always evil; I certainly can't remember him ever making that statement. Do you have a pointer to a LKML message where he's said that? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
Hi, On Mon, 4 Jun 2007, Theodore Tso wrote: Hmm, I'm not sure I would go that far. Per the POSIX specification, we support the optional BSD-style restartable system calls for signals which will avoid short reads; but this is only true if SA_RESTART is passed to sigaction(). Without SA_RESTART, we will indeed return short reads, as required by POSIX. I don't think Linus has said that short reads are always evil; I certainly can't remember him ever making that statement. Do you have a pointer to a LKML message where he's said that? That's the last discussion about signals and I/O I can remember: http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html bye, Roman - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 08:57:16PM +0200, Roman Zippel wrote: On Mon, 4 Jun 2007, Theodore Tso wrote: Hmm, I'm not sure I would go that far. Per the POSIX specification, we support the optional BSD-style restartable system calls for signals which will avoid short reads; but this is only true if SA_RESTART is passed to sigaction(). Without SA_RESTART, we will indeed return short reads, as required by POSIX. I don't think Linus has said that short reads are always evil; I certainly can't remember him ever making that statement. Do you have a pointer to a LKML message where he's said that? That's the last discussion about signals and I/O I can remember: http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html He said 'disk read', not 'read(2)'. I'd expect he means certain things like stat(2) and readdir(2) when they have to go to disk. read(2) explicitly lists EINTR as a valid result, and often folks use signals to interrupt read(2). The world certainly writes programs to expect short read(2). Joel -- Gone to plant a weeping willow On the bank's green edge it will roll, roll, roll. Sing a lulaby beside the waters. Lovers come and go, the river roll, roll, rolls. Joel Becker Principal Software Developer Oracle E-mail: [EMAIL PROTECTED] Phone: (650) 506-8127 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 08:57:16PM +0200, Roman Zippel wrote: That's the last discussion about signals and I/O I can remember: http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html Well, I think Linus was saying that we have to do both (where the signal interrupts and where it doesn't), and I agree with that: There are enough reasons to discourage people from using uninterruptible sleep (this f*cking application won't die when the network goes down) that I don't think this is an issue. We need to handle both cases, and ^ while we can expand on the two cases we have now, we can't remove them. ^^^ Fortunately, although the -ERESTARTSYS framework is a little awkward (and people can shoot arrows at me for creating it 15 year ago :-), we do have a way of supporting both styles without _too_ much pain. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Tuesday 15 May 2007 11:20, Pavel Machek wrote: Hi! Pathname matching, transition table loading, profile loading and manipulation. So we get small interpretter of state machines, and reason we need is is 'apparmor is misdesigned and works with paths when it should have worked with handles'. I assume you mean labels instead of handles. AppArmor's design is around paths not labels, and independent of whether or not you like AppArmor, this design leads to a useful security model distinct from the SELinux security model (which is useful in its own ways). The differences between those models cannot be argued away, neither is a subset of the other, and neither is a misdesign. I would be thankful if you could stop spreading this lie. If you solve the 'new file problem', aa becomes subset of selinux. And I'm pretty sure patch will be nicer than this. You are quite mistaken. SELinux turns pathnames into labels when it initially labels all files (when a policy is rolled out), whereas AppArmor computes the label of each file when a file is opened. The two models start to diverge as soon as files are renamed: in SELinux, labels stick with the files. In AppArmor, labels stick with the names. So what you advocate for is a hybrid between the SELinux and the AppArmor model, not a superset. It could be that the SELinux folks will solve the issues they are having with new files using something better than restorecond in the future, perhaps even an in-kernel mechanism (although I somewhat doubt it). But then again, their basic model makes sense even without any live file relabeling, and so that's probably not very high up on the priority list. Andreas - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html