Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Poul-Henning Kamp
In message [EMAIL PROTECTED], Andriy Gapon writes:
on 06/02/2008 18:29 Andriy Gapon said the following:
 Small summary of the above long description.
 For directory reading fs/udf performs bread() on a (underlying) device
 vnode. It passes block number as if block size was 512 bytes (i.e.
 byte_offset_within_dev/512).

We have three sizes of relevance here, the sectorsize of the provider,
the blocksize of the filesystem and the page size of the system.

In general it is adventurous to have any of them be anything but
powers of two, and it is at best ill-adviced and more likely asking
for trouble to do requests that are not multiple of and aligned to
the sectorsize of the provider.

So up front, I would say that it is an UDF bug to do 512 byte reads off
a 2k sector provider.

Making the buffer cache handle this is possible, but it is not the
direction we have planned for the buffercache (ie: that it should
become a wrapper for using the VM system, rather than being a
separately allocated cache).

So if the objective is to make UDF work in the short term, your
change might work, if the objective is to move FreeBSD's kernel
architecture forward, the right thing to do is to fix UDF to not
access subsectors.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


valgrind or workalike on FreeBSD/amd64 7.0/8.0?

2008-02-12 Thread Xin LI
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Is there anyone working on valgrind on newer FreeBSD releases, or some
work-alikes?

Cheers,
- --
Xin LI [EMAIL PROTECTED]  http://www.delphij.net/
FreeBSD - The Power to Serve!
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFHsfi3i+vbBBjt66ARAnT6AJ9jkaR3SRcGnkb2+DELmnZxFNQxoACgiECb
eNHfq6uxaeUh1X2D7DIYxmY=
=swuy
-END PGP SIGNATURE-
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: retrive data from mbuf chain

2008-02-12 Thread Biks N
Hi, thanks to everyone for providing me with different ideas.

First I am trying to use m_copydata() method because I think it will
be easy for me to copy data back and forth (using m_copydataback() ).

But right now I am having problem using m_copydata() function.

I want to copy data in all mbufs (only payload but no tcp/ip header)
except the first Mbuf in chain.
If payload is small enough to fit within 1st mbuf then I don't need that either.

I am getting kernel panic ( please see below).
I can see custom message Starting m_copydata() in log file.
So I assume the problem is due to incorrect parameter in m_copydata().


here is the sample of code I am trying to use:

//
caddr_t my_data_copy = NULL;


  /* check if m_len  m_pkthdr.len */

   if ( m-m_len   m-m_pkthdr.len ) {

  /* copy data if there are more than 1 Mbufs in Chain */
  log (LOG_DEBUG,Starting m_copydata() \n);

  m_copydata( m, m-m_len , m-m_pkthdr.len - m-m_len , my_data_copy);

  log (LOG_DEBUG,%d Byte of Data copied\n, m-m_pkthdr.len -
 m-m_len);

}
else {
  /* skip if there is only 1 MBUF */
  //log (LOG_DEBUG,There must Only 1 MBUF in chain\n);
}
//


Kernel Panic:

Feb 12 11:36:09 bsd1 /kernel: Fatal trap 12: page fault while in kernel mode
Feb 12 11:36:09 bsd1 /kernel: fault virtual address = 0x0
Feb 12 11:36:09 bsd1 /kernel: fault code= supervisor
write, page not present
Feb 12 11:36:09 bsd1 /kernel: instruction pointer   = 0x8:0xc024efc2
Feb 12 11:36:09 bsd1 /kernel: stack pointer = 0x10:0xd15e8d08
Feb 12 11:36:09 bsd1 /kernel: frame pointer = 0x10:0xd15e8d2c
Feb 12 11:36:09 bsd1 /kernel: code segment  = base 0x0,
limit 0xf, type 0x1b
Feb 12 11:36:09 bsd1 /kernel: = DPL 0, pres 1, def32 1, gran 1
Feb 12 11:36:09 bsd1 /kernel: processor eflags  = interrupt enabled,
resume, IOPL = 0
Feb 12 11:36:09 bsd1 /kernel: current process   = 154 (ping)
Feb 12 11:36:09 bsd1 /kernel: interrupt mask=
Feb 12 11:36:09 bsd1 /kernel:


I am using ping -s 1200 host to send larger packets so that system
creates at least 2 mbufs.



On Feb 7, 2008 3:26 PM, Sam Leffler [EMAIL PROTECTED] wrote:

 Biks N wrote:
  Hi,
 
  I am new to FreeBSD kernel programming.
 
  Currently I am trying to work on mbuf data manupulation.
 
  From my understanding: data (payload) is stored into one or more mufs
  which are chained together through m_next pointer.
 
  Now, I need to retrive all data in mbuf chain ( mbufs linked by
  m_next). I am working ip_output() in netinet/ip_output.c
 
  Does there exist inbuilt function/macro to retrive all the data in mbuf 
  chain?
 

 man 9 mbuf; look for m_copydata.

Sam


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Andriy Gapon wrote:


on 12/02/2008 15:11 Bruce Evans said the following:

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:



3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;

That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.


O, I missed this obvious thing!


Me too.


Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.


I think the size is mnt_stat.f_iosize in general (but not for device vnodes).


And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.


Is this setting ever used (and the corresponding setting for writing)
ever used?


I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.


In fact, ffs already overrides the setting of bo_bsize for the device
vnode to a different wrong setting -- g_vfs_open() sets the sector size,
and ffs_mount() changes the setting to fs_bsize, but ffs actually needs
the setting to be DEV_BSIZE (I think).  Other bugs from this:
- ffs_rawread wants the sector size, and it assumes that this is in bo_bsize
  for the device vnode, but ffs_mount() has changed this to fs_bsize.
- multiple r/o mounts are supposed to work, but don't, since there is only
  one device vnode with a shared bufobj, but the bufobj needs to be
  per-file-system since all mounts write to it.  Various bad things
  including panics result.  There is a well-know panic from bo_private
  becoming garbage on unmount.  I just noticed more possibilities for
  panics.  bo_ops points to static storage, so it never becomes complete
  garbage.  However, at least ffs sets it blindly early on in
  ffs_mountfs(), before looking at the file system to see if ffs can
  mount it.  Thus if the file system is already mounted by another
  ffs, then ffs clobbers the other fs's bo_ops.  The ffs mount will
  presumably fail, leaving bo_ops clobbered.  Also, a successful
  g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little
  earlier.  g_vfs_open() is fundamental to looking at the file system,
  since I/O is not set up until it completes.  Clobbering the pointers
  is most dangerous, but just clobbering bo_bsize breaks blkno
  calculations for any code that uses bo_bsize.

Apart from these bugs, the fix for the blkno calculations for device
vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device
vp of all disk file systems (since all disk file systems use expect
this size).  Currently, ffs seems to be the only file system that
overrides g_vfs_open()'s default of the sector size.  Nothing in any
of fs/, gnu/fs/ and contrib/ references bo_bsize.


In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.


I think I/O on the disk doesn't use bo_bsize, but only the sector size
(to scale the byte count to a sector count).  Otherwise, ffs's override
would break I/O.  geom is another place that has few references to
bo_bsize -- it just clobbers it in g_vfs_open().


Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.


Yes.  They can set it more easily as they can tell g_vfs_open() what to
set it to.  Except, until the bugs are fixed properly, g_vfs_open() can
more easily do tests to prevent clobbering.  For bo_bsize and bo_ops,
sharing a common value is safe and safe changes can be detected by
setting to a special value on last unmount.  For bo_private, a safety
check would probably disallow multiple mounts (since cp is dynamically
allocated (?)).

Bruce

Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 17:58 Bruce Evans said the following:
 On Tue, 12 Feb 2008, Andriy Gapon wrote:
 And the actual reading works correctly because udf_strategy is called
 for such vnodes and it translates block numbers from physical to logical
 and also correctly re-calculates b_iooffset for actual reading. So
 b_iooffset value set in breadn() (using bdtob) is completely ignored.
 
 Is this setting ever used (and the corresponding setting for writing)
 ever used?

Bruce,

replying only to this part (digesting the others): yes, it is used by
g_vfs_strategy which is a bufobj startegy after g_vfs_open. It passes
b_iooffset as is to bio_offset.

P.S. of course, I am speaking about the current sources - I know you
have almost an OS of your own, kidding :-)

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 15:11 Bruce Evans said the following:
 On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:
 
 In message [EMAIL PROTECTED], Andriy Gapon writes:

 2.3. this code passes to bread blkno that is calculated as 4*sector,
 where sector is a number of a physical 2048-byte sector. [**]
 [**] - I think that this is a requirement of buffcache system, because
 internally it performs many calculations that seem to assume that block
 size is always 512.
 Yes, the buf-cache works in 512 bytes units throughout.
 
 I missed the dbtob() conversions in vfs_bio.c when I replied previously
 So blkno cannot be a cookie; it must be for a 512-block.  So how did
 the cases where bsize != DEV_BSIZE in getblk() ever work?

It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY
kicked in and did the right thing, see below.

 3.1. for a fresh buf getlbk would assign the following:
 bsize = bo-bo_bsize;
 offset = blkno * bsize;
 That's clearly wrong.
 
 If units were always 512-blocks, then anything except bsize = DEV_BSIZE
 would be clearly wrong.  Things aren't that simple (but probably should
 be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.

O, I missed this obvious thing!
Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.
And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.

I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.

In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.

Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.
But it seems that there are many possibilities for fixups and various
filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc).

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:


2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.


Yes, the buf-cache works in 512 bytes units throughout.


I missed the dbtob() conversions in vfs_bio.c when I replied previously
So blkno cannot be a cookie; it must be for a 512-block.  So how did
the cases where bsize != DEV_BSIZE in getblk() ever work?


3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;


That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.
That seems to include nfs(client).  In fact, nfs_getcacheblk() does
weird scaling which seems to be mainly to compensate for for weird scaling
here.  It calls getblk() with a bn arg that seems to be f_iosize units.
Then at then end, for the VREG case, it sets bp-b_blkno to this bn
scaled to normal DEV_BSIZE units.  bp-b_blkno seems to have DEV_BSIZE
units for all uses of it in nfs.


We need to assert that the blkno is aligned to the start of a sector
and use the 512 byte units, so I guess it would be:

   offset = dbtob(blkno);
   KASSERT(!(offset  (bsize - 1)), (suitable diagnostic));


Barely worth checking.

The current bug has nothing to do with this.  The offset is just wrong
at this point, after using a scale factor that is inconsistent with the
units of blkno, for all (?) disk (and other?) file systems whose sector
size isn't 512.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message [EMAIL PROTECTED], Andriy Gapon writes:

on 06/02/2008 18:29 Andriy Gapon said the following:

Small summary of the above long description.
For directory reading fs/udf performs bread() on a (underlying) device
vnode. It passes block number as if block size was 512 bytes (i.e.
byte_offset_within_dev/512).


We have three sizes of relevance here, the sectorsize of the provider,
the blocksize of the filesystem and the page size of the system.


4. The block size required for bread() and friends (almost always
   DEV_BSIZE).


In general it is adventurous to have any of them be anything but
powers of two, and it is at best ill-adviced and more likely asking
for trouble to do requests that are not multiple of and aligned to
the sectorsize of the provider.

So up front, I would say that it is an UDF bug to do 512 byte reads off
a 2k sector provider.

Making the buffer cache handle this is possible, but it is not the
direction we have planned for the buffercache (ie: that it should
become a wrapper for using the VM system, rather than being a
separately allocated cache).

So if the objective is to make UDF work in the short term, your
change might work, if the objective is to move FreeBSD's kernel
architecture forward, the right thing to do is to fix UDF to not
access subsectors.


This bug has nothing to do with subsectors, and very little to do with
udf.  udf is just depending on bread()'s API being unbroken.  That API
requires starting with blocks consisting of whole sectors (else the
subsequent I/O would fail) and converting to blocks of size DEV_BSIZE,
phyexcept for unusual (non-disk?) file systems like nfs (and no others?).
All (?) disk file systems do this conversion.  The bug is just more
noticeable for udf since it is used more often on disks with a block
size != DEV_BSIZE, and it apparently does something that makes the bug
mess up VM more than other file systems.

Of course, it might be better for the API to take blocks in units of
the sector size, but that is not what it takes, and this can't be
changed easily or safely.

The standard macro btodb() is often used to convert from bytes to
blocks of this size, and doesn't support bo_bsize or the bufobj pointer
that would be needed to reach bo_bsize.

ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block
(frag?)-sized units to blocks in DEV_BSIZE units so as to pass them
to unbroken bread() and friends.  ffs initializes bo_bsize to its block
(not frag) size, and then never uses it directly except in ffs_rawread(),
where it is used to check that the I/O is in units of whole sectors
(otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest
of ffs).

The details of fsbtodb() are interesting.  They show signs of previous
attempts to use units of sectors.  From ffs/fs.h:

% /*
%  * Turn filesystem block numbers into disk block addresses.
%  * This maps filesystem blocks to device size blocks.
%  */
% #define   fsbtodb(fs, b)  ((daddr_t)(b)  (fs)-fs_fsbtodb)
% #define   dbtofsb(fs, b)  ((b)  (fs)-fs_fsbtodb)

Creation of fs_fsbtodb is left to newfs.  The units of DEV_BSIZE are thus
built in to the on-disk data struct (in a fairly easy to change and/or
override way, but there are a lot of existing disks).  From newfs.c:

%   realsectorsize = sectorsize;
%   if (sectorsize != DEV_BSIZE) {  /* XXX */
%   int secperblk = sectorsize / DEV_BSIZE;
% 
% 		sectorsize = DEV_BSIZE;

%   fssize *= secperblk;
%   if (pp != NULL)
%   pp-p_size *= secperblk;
%   }
%   mkfs(pp, special);

Though mkfs() appears to support disk blocks of size sector size =
sectorsize, its sectorsize parameter is hacked on here, so it always
generates fs_fsbtodb and other derived parameters for disk blocks of
size DEV_BSIZE, as is required for the unbroken bread() API.

msdosfs is similar to ffs here, except it constructs the shift count
at mount time, to arrange for always converting to DEV_BSIZE blocks for
passing the bread() and friends.

udf is effectively similar, but pessimal and disorganized.  For the
conversion for bread(), it uses btodb().  In udf_bmap(), it constructs
a shift count on every call by subtracting DEV_BSHIFT from its block shift
count.  In udf_strategy(), on every call it constructs a multiplier
instead of a shift count, by dividing its block size by DEV_BSIZE.  It's
weird that the strategy routing, which will soon end up sending sectors
to the disk, is converting to DEV_BSIZE units, a size that cannot be the
size for udf's normal media.

cd9660 uses btodb() for one conversion for bread() and
(its block shift count - DEV_BSHIFT) in 7 other conversions to
DEV_BSIZE units.

So it's surprising that any file systems work on CDs and DVDs.  Maybe
the bug affects udf more because udf crosses page boundaries more.

It's also surprising that the bug has such small effects.  This seems
to be because the 

Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 13:47 Poul-Henning Kamp said the following:
 In message [EMAIL PROTECTED], Andriy Gapon writes:
 
 2.3. this code passes to bread blkno that is calculated as 4*sector,
 where sector is a number of a physical 2048-byte sector. [**]
 [**] - I think that this is a requirement of buffcache system, because
 internally it performs many calculations that seem to assume that block
 size is always 512.
 
 Yes, the buf-cache works in 512 bytes units throughout.
 
 3.1. for a fresh buf getlbk would assign the following:
 bsize = bo-bo_bsize;
 offset = blkno * bsize;
 
 That's clearly wrong.
 
 We need to assert that the blkno is aligned to the start of a sector
 and use the 512 byte units, so I guess it would be:
 
 offset = dbtob(blkno);
 KASSERT(!(offset  (bsize - 1)), (suitable diagnostic));
 
 

Thank you for this very insightful and neat suggestion!
I think that it must work but I will try test it tonight on whatever
media and FS-s I have available.
Thank you again!

P.S. hope to not get 'suitable diagnostic' from something like msdosfs :-)

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Poul-Henning Kamp
In message [EMAIL PROTECTED], Andriy Gapon writes:

2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.

Yes, the buf-cache works in 512 bytes units throughout.

3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;

That's clearly wrong.

We need to assert that the blkno is aligned to the start of a sector
and use the 512 byte units, so I guess it would be:

offset = dbtob(blkno);
KASSERT(!(offset  (bsize - 1)), (suitable diagnostic));


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fs/udf: vm pages overlap while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 10:53 Poul-Henning Kamp said the following:
 In message [EMAIL PROTECTED], Andriy Gapon writes:
 on 06/02/2008 18:29 Andriy Gapon said the following:
 Small summary of the above long description.
 For directory reading fs/udf performs bread() on a (underlying) device
 vnode. It passes block number as if block size was 512 bytes (i.e.
 byte_offset_within_dev/512).
 
 We have three sizes of relevance here, the sectorsize of the provider,
 the blocksize of the filesystem and the page size of the system.
 
 In general it is adventurous to have any of them be anything but
 powers of two, and it is at best ill-adviced and more likely asking
 for trouble to do requests that are not multiple of and aligned to
 the sectorsize of the provider.
 
 So up front, I would say that it is an UDF bug to do 512 byte reads off
 a 2k sector provider.
 
 Making the buffer cache handle this is possible, but it is not the
 direction we have planned for the buffercache (ie: that it should
 become a wrapper for using the VM system, rather than being a
 separately allocated cache).
 
 So if the objective is to make UDF work in the short term, your
 change might work, if the objective is to move FreeBSD's kernel
 architecture forward, the right thing to do is to fix UDF to not
 access subsectors.
 

Poul,

I agree with what you say, but I think that I didn't properly explain
what is UDF code doing and why it might be important in general.
Let me try to do it step-by-step (sorry if I'll say something too
obvious). And please correct me if I misunderstand something in the
fundamental steps.

1.1. UDF is typically used with CD/DVD media, so provider's sector size
is 2048.
1.2. udf vfs mount method calls g_vfs_open.
1.3. g_vfs_open creates vobject for a device vnode.
1.4. g_vfs_open sets bo_bsize=pp-sectorsize in the device vnode's bufobj.
1.5. g_vfs_open also overrides bo_ops for the bufobj.

2.1. UDF directory reading code performs bread() via the device vnode. [*]
2.2. this code passes to bread a size that is multiple of 2048.
2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]

[*]  - this seems to be an uncommon thing among filesystems.
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.
E.g. breadn() code has the following:
bp-b_iooffset = dbtob(bp-b_blkno); -- effectively multiplies by 512
bstrategy(bp);
And g_vfs_strategy has the following:
bip-bio_offset = bp-b_iooffset;
So, if udf code would pass blkno==sector, then bio_offset would be
incorrect.
Or maybe g_vfs_strategy should do some translation here from b_iooffset
to bio_offset taking into account bo_bsize ?? So that the actual,
non-adjusted, sector number could be passed to the bread() ?

3.1. for a fresh buf getlbk would assign the following:
bsize = bo-bo_bsize;
offset = blkno * bsize;
...
bp-b_blkno = bp-b_lblkno = blkno;
bp-b_offset = offset; --- this is where this discussion started
so b_offset of a buffer is 4*sector*2048.
This is a source of the trouble.
3.2. getblk would set bp-b_flags |= B_VMIO; if a vnode has a vobject
and our device vnode has it (step 1.3).
3.3. consequently allocbuf will execute code for B_VMIO case.
3.4. allocbuf will lookup/allocate pages by index which is calculated
from base index of OFF_TO_IDX(bp-b_offset), which is, in essence,
bp-b_offset divided by page size, 4096. So our base index is 2*sector.

4.1. Let's assume we bread() (via the device vnode, as described above)
from physical sector N and we read 6 physical sectors (6*2048 bytes).
4.2 Sectors will get mapped/tied/bound to VM pages as follows
(according to the above calculations):
sectors[N, N+1] - page[2*N],
sectors[N+2, N+3] - page[2*N + 1], /* the next page */
sectors[N+4, N+5] - page[2*N + 2]  /* the next page */
4.5 Now lets consider the next read of X sectors but now starting from
sector N+1; repeating the calculations we get the following mapping:
sectors[(N+1), (N+1) + 1] - page[2*(N+1)] = page[2N +2]
But this page already has cached data from sectors[N+4, N+5].
Problem!!

Theoretical calculations show it and practical testing confirms that.
So this is a proof that bread()-ing via a device vnode is broken if:
C1) the vnode was set up by g_vfs_open();
C2) sector size of underlying geom provider is not 512, but any
non-trivial multiple of it;
C3) bread size is sufficiently big;

Current UDF code for directory reading is the only known to me place
that meets all the 3 above conditions (for sufficiently large
directories to meet condition C3).

So I stated this, now let the wise speak.

I already have a patch that makes UDF read directories via the directory
vnodes. But the problem in general would remain.
Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is
the place, maybe something, maybe don't bother. I don't know.

-- 
Andriy Gapon