Re: Why is TUNABLE_INT discouraged?

2010-08-08 Thread Garrett Cooper
2010/8/7 Ivan Voras ivo...@freebsd.org:
 2010/8/8 Dag-Erling Smørgrav d...@des.no:
 Garrett Cooper gcoo...@freebsd.org writes:
 Dag-Erling Smørgrav d...@des.no writes:
  Perhaps.  I don't remember all the details; I can't find a discussion in
  the list archives (other than me announcing the change in response to a
  bug report), but there must have been one, either on IRC or in Karlsruhe.
  In any case, I never removed TUNABLE_INT(), so...
 It does matter for integers on 64-bit vs 32-bit architectures though,
 right

 Not sure what you mean.  The original issue was that someone had used
 TUNABLE_INT() for something that was actually a memory address.  I
 changed it to TUNABLE_ULONG().  Of course, if your tunable is a boolean
 value or something like maxprocs, an int is fine - but so is a long.

Why would someone express a tunable in a memory address (not being
sarcastic... I just don't see why it makes sense right now, but if
there's a valid reason I'm more than happy to be educated :)..)?

 Semantically valid but using TUNABLE_INT to hold pointers is a
 developer bug, not the fault of the API, and there's nothing wrong
 with int as a data type in this context.

 Unless there is a real hidden danger in using TUNABLE_INT (and/or
 adding TUNABLE_UINT etc.) in the expected way, I'd vote for either
 removing the cautioning comment or rewriting it to say something like
 developers are hereby warned that ints cannot hold pointers on all
 architectures, if it is indeed such a little known fact among kernel
 developers :P

*grins cheekily in agreement* :).

Thanks,
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: glabel force sectorsize patch

2010-08-08 Thread Pawel Jakub Dawidek
On Sun, Aug 08, 2010 at 03:57:44AM +0200, Ivan Voras wrote:
 Hi,
 
 In order to help users having 4k sector drives which the system
 recognizes as 512 byte sector drives, I'm proposing a patch to glabel
 which enables it to use a forced sector size for its native-labeled
 providers. It is naturally only usable with glabel-native labels
 (those created by glabel label) and not partition and file system
 labels because we cannot add arbitrary new fields to metadata of those
 types.
 
 The patch is here:
 
 http://people.freebsd.org/~ivoras/diffs/glabel_ssize.patch
[...]
 This mechanism is a band-aid until there's a better way of dealing
 with 4k drives.

So why do you want to obfuscate glabel with it? For people to start
depend on it? Once we start supporting 4kB sectors what do we do with
such a change? Remove it and decrease version number? What people will
do with providers already labeled this way?

If its temporary, just allow to list providers you want to increase
sector size in /boot/loader.conf. Once we start supporting it properly
people might simply remove it from loader.conf and it should just work.

Glabel is not for that and I don't agree for such obfuscation.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp9g74Rergrb.pgp
Description: PGP signature


Re: glabel force sectorsize patch

2010-08-08 Thread Pawel Jakub Dawidek
On Sun, Aug 08, 2010 at 02:02:17PM +0200, Ivan Voras wrote:
 On 8.8.2010 12:30, Pawel Jakub Dawidek wrote:
  So why do you want to obfuscate glabel with it? For people to start
  depend on it? Once we start supporting 4kB sectors what do we do with
  such a change? Remove it and decrease version number? What people will
  do with providers already labeled this way?
  
  If its temporary, just allow to list providers you want to increase
  sector size in /boot/loader.conf. Once we start supporting it properly
  people might simply remove it from loader.conf and it should just work.
  
  Glabel is not for that and I don't agree for such obfuscation.
 
 Of course, there are good and bad sides to it. My take on it is that the
 only bad side is that it really isn't glabel's primary function to
 (optionally) fixup geometry, while the good sides are:

It isn't its secondary function either.

 * glabel is in GENERIC and judging by the mailing lists' traffic it is
 one of the better used parts of the system so people are familiar with
 it. It is also already used as a perfectly valid fixup for device
 renaming, making both UFS and ZFS more stable for usage.

That's an excellent argument. But you know what? The em(4) is also in
GENERIC, why not to add it in there?

 * You can't really make people depend on glabel both because it is in
 GENERIC and because of it storing metadata in the last sector, making
 the rest of the drive completely usable without it in the event native
 4k sector support is grown.

I never said that. I do want people to depend on glabel, because it is
free of such ugly hacks, so I know it won't bite them in the future.

I don't want people to start depend on the fact that glabel supports
changing sector sizes.

Once we start supporting 4kB sectors properly people configuration will
stop working, because glabel won't be able to read its metadata anymore.
Your hack will break all configurations that started to depend on your
hack. In what I proposed, GEOM provider will be presented to glabel (or
any other GEOM class) as 4kB provider and everything will just work,
also after adding proper support for 4kB sectors.

 I'd like to hear comments from the wider audience. In respect with your
 comment, I will compromise: as 4k sector drives have become available
 over the counter more than 6 months ago and so far I think this is the
 first effort to give some support for them, I will commit this patch
 before 9.0 code freeze only if no other support gets developed.

I'll repeat. You won't commit this patch, because it is totally wrong
solution and can only do a lot of damage in the future.
If you look forward, even temporary solutions can be done right.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp8NDzCMjfAM.pgp
Description: PGP signature


Re: Why is TUNABLE_INT discouraged?

2010-08-08 Thread Ed Maste
On Sun, Aug 08, 2010 at 01:30:19AM +0200, Ivan Voras wrote:

 2010/8/8 Dag-Erling Sm??rgrav d...@des.no:
  Garrett Cooper gcoo...@freebsd.org writes:
  Dag-Erling Sm??rgrav d...@des.no writes:
   Perhaps. ??I don't remember all the details; I can't find a discussion in
   the list archives (other than me announcing the change in response to a
   bug report), but there must have been one, either on IRC or in Karlsruhe.
   In any case, I never removed TUNABLE_INT(), so...
  It does matter for integers on 64-bit vs 32-bit architectures though,
  right
 
  Not sure what you mean. ??The original issue was that someone had used
  TUNABLE_INT() for something that was actually a memory address. ??I
  changed it to TUNABLE_ULONG(). ??Of course, if your tunable is a boolean
  value or something like maxprocs, an int is fine - but so is a long.
 
 Semantically valid but using TUNABLE_INT to hold pointers is a
 developer bug, not the fault of the API, and there's nothing wrong
 with int as a data type in this context.

I agree with Ivan here.  If we can't find an actual reason to
universally prefer long I'd like to remove this comment.

As a counterpoint to the preference for long I can offer a reason to
prefer int: the same variable is often exposed by both a tunable and a
sysctls, and a sysctl using long can have 32-bit compat issues.  (That
is, a 32-bit app using sysctlbyname will try to use 4 bytes for a long.)

-Ed
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: glabel force sectorsize patch

2010-08-08 Thread Pawel Jakub Dawidek
On Sun, Aug 08, 2010 at 02:57:20PM +0200, Marius Nünnerich wrote:
 On Sun, Aug 8, 2010 at 14:02, Ivan Voras ivo...@freebsd.org wrote:
  I'd like to hear comments from the wider audience. In respect with your
  comment, I will compromise: as 4k sector drives have become available
  over the counter more than 6 months ago and so far I think this is the
  first effort to give some support for them, I will commit this patch
  before 9.0 code freeze only if no other support gets developed.
 
 I do not like this at all. Even if it's just for the KISS and POLA
 principles. A geom should do one thing and do it right imo.
 Why not write a new geom class that does what you want?

New GEOM class only for sectorsize conversion that can operate on
metadata will be useful, not only to solve this particular problem.
Although keep in mind that if at some point disks will be detected and
presented as 4kB providers to the GEOM, this class won't be able to find
its metadata anymore (as it was stored in the last 512 bytes, not in the
last 4 kilobytes).

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpLef8MwEhAp.pgp
Description: PGP signature


Re: glabel force sectorsize patch

2010-08-08 Thread Marius Nünnerich
On Sun, Aug 8, 2010 at 14:02, Ivan Voras ivo...@freebsd.org wrote:
 On 8.8.2010 12:30, Pawel Jakub Dawidek wrote:
 On Sun, Aug 08, 2010 at 03:57:44AM +0200, Ivan Voras wrote:
 Hi,

 In order to help users having 4k sector drives which the system
 recognizes as 512 byte sector drives, I'm proposing a patch to glabel
 which enables it to use a forced sector size for its native-labeled
 providers. It is naturally only usable with glabel-native labels
 (those created by glabel label) and not partition and file system
 labels because we cannot add arbitrary new fields to metadata of those
 types.

 The patch is here:

 http://people.freebsd.org/~ivoras/diffs/glabel_ssize.patch
 [...]
 This mechanism is a band-aid until there's a better way of dealing
 with 4k drives.

 So why do you want to obfuscate glabel with it? For people to start
 depend on it? Once we start supporting 4kB sectors what do we do with
 such a change? Remove it and decrease version number? What people will
 do with providers already labeled this way?

 If its temporary, just allow to list providers you want to increase
 sector size in /boot/loader.conf. Once we start supporting it properly
 people might simply remove it from loader.conf and it should just work.

 Glabel is not for that and I don't agree for such obfuscation.

 Of course, there are good and bad sides to it. My take on it is that the
 only bad side is that it really isn't glabel's primary function to
 (optionally) fixup geometry, while the good sides are:

 * glabel is in GENERIC and judging by the mailing lists' traffic it is
 one of the better used parts of the system so people are familiar with
 it. It is also already used as a perfectly valid fixup for device
 renaming, making both UFS and ZFS more stable for usage.

 * You can't really make people depend on glabel both because it is in
 GENERIC and because of it storing metadata in the last sector, making
 the rest of the drive completely usable without it in the event native
 4k sector support is grown.

 I'd like to hear comments from the wider audience. In respect with your
 comment, I will compromise: as 4k sector drives have become available
 over the counter more than 6 months ago and so far I think this is the
 first effort to give some support for them, I will commit this patch
 before 9.0 code freeze only if no other support gets developed.

I do not like this at all. Even if it's just for the KISS and POLA
principles. A geom should do one thing and do it right imo.
Why not write a new geom class that does what you want?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-08 Thread Attilio Rao
2010/8/4  m...@freebsd.org:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the PC
 value after the read of listlocks after the interrupt.  However, in that case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

 I wonder how that's interpreted in the microcode, though?  I.e. if the
 incr instruction decodes to load, add, store, does the h/w allow the
 later reads to pass the final store?

 I added the 

Re: sched_pin() versus PCPU_GET

2010-08-08 Thread mdf
On Sun, Aug 8, 2010 at 2:43 PM, Attilio Rao atti...@freebsd.org wrote:
 2010/8/4  m...@freebsd.org:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different 
 locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to 
 provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the 
 PC
 value after the read of listlocks after the interrupt.  However, in that 
 case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying 
 td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

 I wonder how that's interpreted in the microcode, though?  I.e. if the
 incr instruction decodes to load, add, 

Re: Why is TUNABLE_INT discouraged?

2010-08-08 Thread M. Warner Losh
In message: 20100808130624.gb40...@sandvine.com
Ed Maste ema...@freebsd.org writes:
: On Sun, Aug 08, 2010 at 01:30:19AM +0200, Ivan Voras wrote:
: 
:  2010/8/8 Dag-Erling Sm??rgrav d...@des.no:
:   Garrett Cooper gcoo...@freebsd.org writes:
:   Dag-Erling Sm??rgrav d...@des.no writes:
:Perhaps. ??I don't remember all the details; I can't find a discussion 
in
:the list archives (other than me announcing the change in response to a
:bug report), but there must have been one, either on IRC or in 
Karlsruhe.
:In any case, I never removed TUNABLE_INT(), so...
:   It does matter for integers on 64-bit vs 32-bit architectures though,
:   right
:  
:   Not sure what you mean. ??The original issue was that someone had used
:   TUNABLE_INT() for something that was actually a memory address. ??I
:   changed it to TUNABLE_ULONG(). ??Of course, if your tunable is a boolean
:   value or something like maxprocs, an int is fine - but so is a long.
:  
:  Semantically valid but using TUNABLE_INT to hold pointers is a
:  developer bug, not the fault of the API, and there's nothing wrong
:  with int as a data type in this context.
: 
: I agree with Ivan here.  If we can't find an actual reason to
: universally prefer long I'd like to remove this comment.
: 
: As a counterpoint to the preference for long I can offer a reason to
: prefer int: the same variable is often exposed by both a tunable and a
: sysctls, and a sysctl using long can have 32-bit compat issues.  (That
: is, a 32-bit app using sysctlbyname will try to use 4 bytes for a long.)

There's absolutely no reason to not use TUNABLE_INT for simple
integers.  Back in the deep, dark, dim past, there was no
TUNABLE_*LONG.  TUNABLE_INT was introduce in r77900 by peter.  DES
added TUNABLE_LONG in r137099, as well as adding the comment.

The comment is bogus, or at least not quite specific enough.  It has
been routinely ignored since it was added.

There's absolutely nothing wrong with TUNABLE_INT.

We really should have a TUNABLE_ADDRESS for this case, although
there's at least three types of address that we need to worry about:
Physical Addresses, Virtual Addresses and Bus Addresses.  You don't
know if ULONG or LONG is the right type for an address, or if you need
a quad (for example, 32-bit MIPS can address, through PTE entries,
addresses beyond the 32-bit barrier, so you'd need a QUAD).

I'm in favor of changing the comment to:

/*
 * Please do not use for addresses or pointers
 */

Warner


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: glabel force sectorsize patch

2010-08-08 Thread Marius Nünnerich
On Sun, Aug 8, 2010 at 21:08, Ivan Voras ivo...@freebsd.org wrote:
 On 8.8.2010 14:57, Marius Nünnerich wrote:
 On Sun, Aug 8, 2010 at 14:02, Ivan Voras ivo...@freebsd.org wrote:

 This mechanism is a band-aid until there's a better way of dealing
 with 4k drives.

 I do not like this at all. Even if it's just for the KISS and POLA
 principles. A geom should do one thing and do it right imo.

 As the addition will not modify general behaviour of glabel but add a
 new feature (which is actually clean and trivial to implement) invisible
 to most of the users, I don't think either KISS nor POLA are in any
 danger here.

Adding a new feature maps directly to KISS, especially if the
feature is in the wrong module.
POLA: I wouldn't guess that a blocksize resizing is hidden in a _part_
of glabel. I am not using the native glabel part at all, just the
named GPT partitions as most of the users seem to prefer nower days
(and I guess will get even more traction after Dan's blog post).

 I do agree that it shouldn't be glabel's job to do this but also am
 *very* strongly against shipping 9.0 without any support for 4k drives,
 and the way I've chosen is the lesser of two evils.

I am against workarounds for stupid hardware vendors most of the time.
Especially if it's just a minority, they break pola intentionally and
is fixed easily without this kludge. Afaik if you align your
Partitions to higher values (I use 1MB for example) ufs is not having
any performance issues (I have not benchmarked this myself).

 Code and patches by others are of course welcome. I'm hoping this
 discussion will trigger someone with experience in the lower levels of
 kernel to go and finally add the drive info parsing so it gets solved
 the right way :)

 Why not write a new geom class that does what you want?

 I'm not against this approach also. Technically, if we go this way, the
 new GEOM class will be almost a line-for-line copy-paste of glabel with
 this single metadata field added, so I'd rather fold it into glabel.

I did not think of a new GEOM class that looks like glabel but one
that has no metadata stored on disk . It is then activated and
controlled by loader.conf variables. (Maybe like gnop? If I remember
correctly, I did not take a look at that class for ages).

This way you would get:
 - Your feature
 - no KISS violation, that class should be really simple
 - no POLA violation, feature is in a class with a discriptive name,
glabel is left alone
 - no metadata store problem (is it in the last 512 or 4K bytes?)
 - No problem with future compatibilty, a user would have to active
the class and it's configuration by hand, no magic here

Marius
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Why is TUNABLE_INT discouraged?

2010-08-08 Thread Garrett Cooper
On Sun, Aug 8, 2010 at 11:14 AM, M. Warner Losh i...@bsdimp.com wrote:
 In message: 20100808130624.gb40...@sandvine.com
            Ed Maste ema...@freebsd.org writes:
 : On Sun, Aug 08, 2010 at 01:30:19AM +0200, Ivan Voras wrote:
 :
 :  2010/8/8 Dag-Erling Sm??rgrav d...@des.no:
 :   Garrett Cooper gcoo...@freebsd.org writes:
 :   Dag-Erling Sm??rgrav d...@des.no writes:
 :Perhaps. ??I don't remember all the details; I can't find a 
 discussion in
 :the list archives (other than me announcing the change in response 
 to a
 :bug report), but there must have been one, either on IRC or in 
 Karlsruhe.
 :In any case, I never removed TUNABLE_INT(), so...
 :   It does matter for integers on 64-bit vs 32-bit architectures though,
 :   right
 :  
 :   Not sure what you mean. ??The original issue was that someone had used
 :   TUNABLE_INT() for something that was actually a memory address. ??I
 :   changed it to TUNABLE_ULONG(). ??Of course, if your tunable is a boolean
 :   value or something like maxprocs, an int is fine - but so is a long.
 : 
 :  Semantically valid but using TUNABLE_INT to hold pointers is a
 :  developer bug, not the fault of the API, and there's nothing wrong
 :  with int as a data type in this context.
 :
 : I agree with Ivan here.  If we can't find an actual reason to
 : universally prefer long I'd like to remove this comment.
 :
 : As a counterpoint to the preference for long I can offer a reason to
 : prefer int: the same variable is often exposed by both a tunable and a
 : sysctls, and a sysctl using long can have 32-bit compat issues.  (That
 : is, a 32-bit app using sysctlbyname will try to use 4 bytes for a long.)

 There's absolutely no reason to not use TUNABLE_INT for simple
 integers.  Back in the deep, dark, dim past, there was no
 TUNABLE_*LONG.  TUNABLE_INT was introduce in r77900 by peter.  DES
 added TUNABLE_LONG in r137099, as well as adding the comment.

 The comment is bogus, or at least not quite specific enough.  It has
 been routinely ignored since it was added.

 There's absolutely nothing wrong with TUNABLE_INT.

 We really should have a TUNABLE_ADDRESS for this case, although
 there's at least three types of address that we need to worry about:
 Physical Addresses, Virtual Addresses and Bus Addresses.  You don't
 know if ULONG or LONG is the right type for an address, or if you need
 a quad (for example, 32-bit MIPS can address, through PTE entries,
 addresses beyond the 32-bit barrier, so you'd need a QUAD).

 I'm in favor of changing the comment to:

 /*
  * Please do not use for addresses or pointers
  */

Then comes my next request: could someone please review this patch
and commit it if it's ok? This adds comments to TUNABLE_INT
(recommended by Warner) and also adds functionality for TUNABLE_UINT
as well. I have a simple module and test script attached for it.
Adding this functionality would make things more consistent with
the TUNABLE KPIs, make my experimenting with sound(4) easier, and
could help improve the sound(4) subsystem's code (I'll have to discuss
some things with ariff@, because I'm not sure whether or not some
quantities need to be signed or not).
Thanks!
-Garrett


uint_tunable.diff
Description: Binary data
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server hangs (was no subject)

2010-08-08 Thread Rick Macklem
 I have a similar problem.
 
 I have a NFS server (8.0 upgraded a couple times since Feb 2010) that
 locks up
 and requires a reboot.
 
 The clients are busy vm's from VMWare ESXi using the NFS server for
 vmdk virtual
 disk storage.
 
 The ESXi reports nfs server inactive and all the vm's post disk write
 errors when
 trying to write to their disk.
 
 /etc/rc.d/nfsd restart fails to work (it can not kill the nfsd
 process)
 
 The nfsd process runs at 100% cpu at rc_lo state in top.
 
 reboot is the only fix.
 
 It has only happened under two circumstances.
 1) Installation of a VM using Windows 2008.
 2) Migrating 16 million mail messages from a physical server to a VM
 running FreeBSD with ZFS file system as a VM on the ESXi box that uses
 NFS to store the VM's ZFS disk.
 
 The NFS server uses ZFS also.

I don't think what you are seeing is the same as what others have reported.
(I have a hunch that your problem might be a replay cache problem.)

Please try the attached patch and make sure that your sys/rpc/svc.c
is at r205562 (upgrade if it isn't).

If this patch doesn't help, you could try using the experimental nfs
server (which doesn't use the generic replay cache), by adding -e to
mountd and nfsd.

Please let me know if the patch or switching to the experimental nfs
server helps, rick

--- rpc/replay.c.sav	2010-08-08 18:05:50.0 -0400
+++ rpc/replay.c	2010-08-08 18:16:43.0 -0400
@@ -90,8 +90,10 @@
 replay_setsize(struct replay_cache *rc, size_t newmaxsize)
 {
 
+	mtx_lock(rc-rc_lock);
 	rc-rc_maxsize = newmaxsize;
 	replay_prune(rc);
+	mtx_unlock(rc-rc_lock);
 }
 
 void
@@ -144,8 +146,8 @@
 	bool_t freed_one;
 
 	if (rc-rc_count = REPLAY_MAX || rc-rc_size  rc-rc_maxsize) {
-		freed_one = FALSE;
 		do {
+			freed_one = FALSE;
 			/*
 			 * Try to free an entry. Don't free in-progress entries
 			 */
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org