Re: fs/udf: vm pages overlap while reading large dir
In message [EMAIL PROTECTED], Andriy Gapon writes: on 06/02/2008 18:29 Andriy Gapon said the following: Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
valgrind or workalike on FreeBSD/amd64 7.0/8.0?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Is there anyone working on valgrind on newer FreeBSD releases, or some work-alikes? Cheers, - -- Xin LI [EMAIL PROTECTED] http://www.delphij.net/ FreeBSD - The Power to Serve! -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFHsfi3i+vbBBjt66ARAnT6AJ9jkaR3SRcGnkb2+DELmnZxFNQxoACgiECb eNHfq6uxaeUh1X2D7DIYxmY= =swuy -END PGP SIGNATURE- ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: retrive data from mbuf chain
Hi, thanks to everyone for providing me with different ideas. First I am trying to use m_copydata() method because I think it will be easy for me to copy data back and forth (using m_copydataback() ). But right now I am having problem using m_copydata() function. I want to copy data in all mbufs (only payload but no tcp/ip header) except the first Mbuf in chain. If payload is small enough to fit within 1st mbuf then I don't need that either. I am getting kernel panic ( please see below). I can see custom message Starting m_copydata() in log file. So I assume the problem is due to incorrect parameter in m_copydata(). here is the sample of code I am trying to use: // caddr_t my_data_copy = NULL; /* check if m_len m_pkthdr.len */ if ( m-m_len m-m_pkthdr.len ) { /* copy data if there are more than 1 Mbufs in Chain */ log (LOG_DEBUG,Starting m_copydata() \n); m_copydata( m, m-m_len , m-m_pkthdr.len - m-m_len , my_data_copy); log (LOG_DEBUG,%d Byte of Data copied\n, m-m_pkthdr.len - m-m_len); } else { /* skip if there is only 1 MBUF */ //log (LOG_DEBUG,There must Only 1 MBUF in chain\n); } // Kernel Panic: Feb 12 11:36:09 bsd1 /kernel: Fatal trap 12: page fault while in kernel mode Feb 12 11:36:09 bsd1 /kernel: fault virtual address = 0x0 Feb 12 11:36:09 bsd1 /kernel: fault code= supervisor write, page not present Feb 12 11:36:09 bsd1 /kernel: instruction pointer = 0x8:0xc024efc2 Feb 12 11:36:09 bsd1 /kernel: stack pointer = 0x10:0xd15e8d08 Feb 12 11:36:09 bsd1 /kernel: frame pointer = 0x10:0xd15e8d2c Feb 12 11:36:09 bsd1 /kernel: code segment = base 0x0, limit 0xf, type 0x1b Feb 12 11:36:09 bsd1 /kernel: = DPL 0, pres 1, def32 1, gran 1 Feb 12 11:36:09 bsd1 /kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Feb 12 11:36:09 bsd1 /kernel: current process = 154 (ping) Feb 12 11:36:09 bsd1 /kernel: interrupt mask= Feb 12 11:36:09 bsd1 /kernel: I am using ping -s 1200 host to send larger packets so that system creates at least 2 mbufs. On Feb 7, 2008 3:26 PM, Sam Leffler [EMAIL PROTECTED] wrote: Biks N wrote: Hi, I am new to FreeBSD kernel programming. Currently I am trying to work on mbuf data manupulation. From my understanding: data (payload) is stored into one or more mufs which are chained together through m_next pointer. Now, I need to retrive all data in mbuf chain ( mbufs linked by m_next). I am working ip_output() in netinet/ip_output.c Does there exist inbuilt function/macro to retrive all the data in mbuf chain? man 9 mbuf; look for m_copydata. Sam ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Andriy Gapon wrote: on 12/02/2008 15:11 Bruce Evans said the following: On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Me too. Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. I think the size is mnt_stat.f_iosize in general (but not for device vnodes). And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. Is this setting ever used (and the corresponding setting for writing) ever used? I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In fact, ffs already overrides the setting of bo_bsize for the device vnode to a different wrong setting -- g_vfs_open() sets the sector size, and ffs_mount() changes the setting to fs_bsize, but ffs actually needs the setting to be DEV_BSIZE (I think). Other bugs from this: - ffs_rawread wants the sector size, and it assumes that this is in bo_bsize for the device vnode, but ffs_mount() has changed this to fs_bsize. - multiple r/o mounts are supposed to work, but don't, since there is only one device vnode with a shared bufobj, but the bufobj needs to be per-file-system since all mounts write to it. Various bad things including panics result. There is a well-know panic from bo_private becoming garbage on unmount. I just noticed more possibilities for panics. bo_ops points to static storage, so it never becomes complete garbage. However, at least ffs sets it blindly early on in ffs_mountfs(), before looking at the file system to see if ffs can mount it. Thus if the file system is already mounted by another ffs, then ffs clobbers the other fs's bo_ops. The ffs mount will presumably fail, leaving bo_ops clobbered. Also, a successful g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little earlier. g_vfs_open() is fundamental to looking at the file system, since I/O is not set up until it completes. Clobbering the pointers is most dangerous, but just clobbering bo_bsize breaks blkno calculations for any code that uses bo_bsize. Apart from these bugs, the fix for the blkno calculations for device vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device vp of all disk file systems (since all disk file systems use expect this size). Currently, ffs seems to be the only file system that overrides g_vfs_open()'s default of the sector size. Nothing in any of fs/, gnu/fs/ and contrib/ references bo_bsize. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. I think I/O on the disk doesn't use bo_bsize, but only the sector size (to scale the byte count to a sector count). Otherwise, ffs's override would break I/O. geom is another place that has few references to bo_bsize -- it just clobbers it in g_vfs_open(). Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. Yes. They can set it more easily as they can tell g_vfs_open() what to set it to. Except, until the bugs are fixed properly, g_vfs_open() can more easily do tests to prevent clobbering. For bo_bsize and bo_ops, sharing a common value is safe and safe changes can be detected by setting to a special value on last unmount. For bo_private, a safety check would probably disallow multiple mounts (since cp is dynamically allocated (?)). Bruce
Re: fs/udf: vm pages overlap while reading large dir
on 12/02/2008 17:58 Bruce Evans said the following: On Tue, 12 Feb 2008, Andriy Gapon wrote: And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. Is this setting ever used (and the corresponding setting for writing) ever used? Bruce, replying only to this part (digesting the others): yes, it is used by g_vfs_strategy which is a bufobj startegy after g_vfs_open. It passes b_iooffset as is to bio_offset. P.S. of course, I am speaking about the current sources - I know you have almost an OS of your own, kidding :-) -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
on 12/02/2008 15:11 Bruce Evans said the following: On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. I missed the dbtob() conversions in vfs_bio.c when I replied previously So blkno cannot be a cookie; it must be for a 512-block. So how did the cases where bsize != DEV_BSIZE in getblk() ever work? It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY kicked in and did the right thing, see below. 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. But it seems that there are many possibilities for fixups and various filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc). -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. I missed the dbtob() conversions in vfs_bio.c when I replied previously So blkno cannot be a cookie; it must be for a 512-block. So how did the cases where bsize != DEV_BSIZE in getblk() ever work? 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. That seems to include nfs(client). In fact, nfs_getcacheblk() does weird scaling which seems to be mainly to compensate for for weird scaling here. It calls getblk() with a bn arg that seems to be f_iosize units. Then at then end, for the VREG case, it sets bp-b_blkno to this bn scaled to normal DEV_BSIZE units. bp-b_blkno seems to have DEV_BSIZE units for all uses of it in nfs. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset (bsize - 1)), (suitable diagnostic)); Barely worth checking. The current bug has nothing to do with this. The offset is just wrong at this point, after using a scale factor that is inconsistent with the units of blkno, for all (?) disk (and other?) file systems whose sector size isn't 512. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message [EMAIL PROTECTED], Andriy Gapon writes: on 06/02/2008 18:29 Andriy Gapon said the following: Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. 4. The block size required for bread() and friends (almost always DEV_BSIZE). In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. This bug has nothing to do with subsectors, and very little to do with udf. udf is just depending on bread()'s API being unbroken. That API requires starting with blocks consisting of whole sectors (else the subsequent I/O would fail) and converting to blocks of size DEV_BSIZE, phyexcept for unusual (non-disk?) file systems like nfs (and no others?). All (?) disk file systems do this conversion. The bug is just more noticeable for udf since it is used more often on disks with a block size != DEV_BSIZE, and it apparently does something that makes the bug mess up VM more than other file systems. Of course, it might be better for the API to take blocks in units of the sector size, but that is not what it takes, and this can't be changed easily or safely. The standard macro btodb() is often used to convert from bytes to blocks of this size, and doesn't support bo_bsize or the bufobj pointer that would be needed to reach bo_bsize. ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block (frag?)-sized units to blocks in DEV_BSIZE units so as to pass them to unbroken bread() and friends. ffs initializes bo_bsize to its block (not frag) size, and then never uses it directly except in ffs_rawread(), where it is used to check that the I/O is in units of whole sectors (otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest of ffs). The details of fsbtodb() are interesting. They show signs of previous attempts to use units of sectors. From ffs/fs.h: % /* % * Turn filesystem block numbers into disk block addresses. % * This maps filesystem blocks to device size blocks. % */ % #define fsbtodb(fs, b) ((daddr_t)(b) (fs)-fs_fsbtodb) % #define dbtofsb(fs, b) ((b) (fs)-fs_fsbtodb) Creation of fs_fsbtodb is left to newfs. The units of DEV_BSIZE are thus built in to the on-disk data struct (in a fairly easy to change and/or override way, but there are a lot of existing disks). From newfs.c: % realsectorsize = sectorsize; % if (sectorsize != DEV_BSIZE) { /* XXX */ % int secperblk = sectorsize / DEV_BSIZE; % % sectorsize = DEV_BSIZE; % fssize *= secperblk; % if (pp != NULL) % pp-p_size *= secperblk; % } % mkfs(pp, special); Though mkfs() appears to support disk blocks of size sector size = sectorsize, its sectorsize parameter is hacked on here, so it always generates fs_fsbtodb and other derived parameters for disk blocks of size DEV_BSIZE, as is required for the unbroken bread() API. msdosfs is similar to ffs here, except it constructs the shift count at mount time, to arrange for always converting to DEV_BSIZE blocks for passing the bread() and friends. udf is effectively similar, but pessimal and disorganized. For the conversion for bread(), it uses btodb(). In udf_bmap(), it constructs a shift count on every call by subtracting DEV_BSHIFT from its block shift count. In udf_strategy(), on every call it constructs a multiplier instead of a shift count, by dividing its block size by DEV_BSIZE. It's weird that the strategy routing, which will soon end up sending sectors to the disk, is converting to DEV_BSIZE units, a size that cannot be the size for udf's normal media. cd9660 uses btodb() for one conversion for bread() and (its block shift count - DEV_BSHIFT) in 7 other conversions to DEV_BSIZE units. So it's surprising that any file systems work on CDs and DVDs. Maybe the bug affects udf more because udf crosses page boundaries more. It's also surprising that the bug has such small effects. This seems to be because the
Re: fs/udf: vm pages overlap while reading large dir
on 12/02/2008 13:47 Poul-Henning Kamp said the following: In message [EMAIL PROTECTED], Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset (bsize - 1)), (suitable diagnostic)); Thank you for this very insightful and neat suggestion! I think that it must work but I will try test it tonight on whatever media and FS-s I have available. Thank you again! P.S. hope to not get 'suitable diagnostic' from something like msdosfs :-) -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
In message [EMAIL PROTECTED], Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; That's clearly wrong. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset (bsize - 1)), (suitable diagnostic)); -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: fs/udf: vm pages overlap while reading large dir
on 12/02/2008 10:53 Poul-Henning Kamp said the following: In message [EMAIL PROTECTED], Andriy Gapon writes: on 06/02/2008 18:29 Andriy Gapon said the following: Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. Poul, I agree with what you say, but I think that I didn't properly explain what is UDF code doing and why it might be important in general. Let me try to do it step-by-step (sorry if I'll say something too obvious). And please correct me if I misunderstand something in the fundamental steps. 1.1. UDF is typically used with CD/DVD media, so provider's sector size is 2048. 1.2. udf vfs mount method calls g_vfs_open. 1.3. g_vfs_open creates vobject for a device vnode. 1.4. g_vfs_open sets bo_bsize=pp-sectorsize in the device vnode's bufobj. 1.5. g_vfs_open also overrides bo_ops for the bufobj. 2.1. UDF directory reading code performs bread() via the device vnode. [*] 2.2. this code passes to bread a size that is multiple of 2048. 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [*] - this seems to be an uncommon thing among filesystems. [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. E.g. breadn() code has the following: bp-b_iooffset = dbtob(bp-b_blkno); -- effectively multiplies by 512 bstrategy(bp); And g_vfs_strategy has the following: bip-bio_offset = bp-b_iooffset; So, if udf code would pass blkno==sector, then bio_offset would be incorrect. Or maybe g_vfs_strategy should do some translation here from b_iooffset to bio_offset taking into account bo_bsize ?? So that the actual, non-adjusted, sector number could be passed to the bread() ? 3.1. for a fresh buf getlbk would assign the following: bsize = bo-bo_bsize; offset = blkno * bsize; ... bp-b_blkno = bp-b_lblkno = blkno; bp-b_offset = offset; --- this is where this discussion started so b_offset of a buffer is 4*sector*2048. This is a source of the trouble. 3.2. getblk would set bp-b_flags |= B_VMIO; if a vnode has a vobject and our device vnode has it (step 1.3). 3.3. consequently allocbuf will execute code for B_VMIO case. 3.4. allocbuf will lookup/allocate pages by index which is calculated from base index of OFF_TO_IDX(bp-b_offset), which is, in essence, bp-b_offset divided by page size, 4096. So our base index is 2*sector. 4.1. Let's assume we bread() (via the device vnode, as described above) from physical sector N and we read 6 physical sectors (6*2048 bytes). 4.2 Sectors will get mapped/tied/bound to VM pages as follows (according to the above calculations): sectors[N, N+1] - page[2*N], sectors[N+2, N+3] - page[2*N + 1], /* the next page */ sectors[N+4, N+5] - page[2*N + 2] /* the next page */ 4.5 Now lets consider the next read of X sectors but now starting from sector N+1; repeating the calculations we get the following mapping: sectors[(N+1), (N+1) + 1] - page[2*(N+1)] = page[2N +2] But this page already has cached data from sectors[N+4, N+5]. Problem!! Theoretical calculations show it and practical testing confirms that. So this is a proof that bread()-ing via a device vnode is broken if: C1) the vnode was set up by g_vfs_open(); C2) sector size of underlying geom provider is not 512, but any non-trivial multiple of it; C3) bread size is sufficiently big; Current UDF code for directory reading is the only known to me place that meets all the 3 above conditions (for sufficiently large directories to meet condition C3). So I stated this, now let the wise speak. I already have a patch that makes UDF read directories via the directory vnodes. But the problem in general would remain. Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is the place, maybe something, maybe don't bother. I don't know. -- Andriy Gapon