Re: cam / ata timeout limited to 2147 due to overflow bug?
On Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote: I'm working on adding security methods to camcontrol and have come up against a strange issue. It seems that the timeout value for cam, at least on ata (ahci), is limited to less than 2148 seconds. This can be seen by running:- camcontrol identify ada0 -t 2148 -v (pass0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (pass0:ahcich0:0:0:0): CAM status: Command timeout Also seen in /var/log/messages at this time is:- Aug 4 23:29:51 cfdev kernel: ahcich0: Timeout on slot 24 Aug 4 23:29:51 cfdev kernel: ahcich0: is cs 0100 ss rs 0100 tfd d0 serr Dropping the timeout down to 2147 and the command runs fine. I've done some digging and it seems like this is implemented via:- sys/dev/ahci/ahci.c ahci_execute_transaction(struct ahci_slot *slot) { ... /* Start command execution timeout */ callout_reset(slot-timeout, (int)ccb-ccb_h.timeout * hz / 2000, (timeout_t*)ahci_timeout, slot); Now its documented that:- Non-positive values of ticks are silently converted to the value 1 So I suspect that this is what's happening resulting in an extremely small timeout instead of a large one. Now I know that passed in value to the timeout is seconds * 1000 so we should be seeing 2148000 for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over the int wrap point 2147483647. So instead of the wrap point being 2147483 seconds (24 days), I suspect because of the way this is structured its actually 2147 seconds (26mins). If this is the case the fix is likely to be something like:- callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)), For hz == 1000, hz / 2000 == 0 according to the C rules, so the result will be 0 always. Does this sound reasonable? What I don't understand is why the /2000? For reference the reason for wanting a large timeout is that a secure erase of large media could take many hours so I'm using the erase time reported by the drive for this, in my case here is 400 minutes. Currently this instantly fails with a Command timeout which is clearly not right. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org pgpMO4v8aH0Re.pgp Description: PGP signature
Re: cam / ata timeout limited to 2147 due to overflow bug?
Steven, good day. Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote: So I suspect that this is what's happening resulting in an extremely small timeout instead of a large one. Now I know that passed in value to the timeout is seconds * 1000 so we should be seeing 2148000 for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over the int wrap point 2147483647. So instead of the wrap point being 2147483 seconds (24 days), I suspect because of the way this is structured its actually 2147 seconds (26mins). If this is the case the fix is likely to be something like:- callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)), It will give you 0 timeout for all values of hz that are lower than 2000: hz is int, so you'll get integer division. Since ccb_h.timeout is u_int32_t, the proper way to handle this situation would be {{{ (u_int64_t)ccb-ccb_h.timeout * (u_int32_t)hz)/2000 }}} as long as the value of hz won't be greater than 2^32. Can you try the patch at http://codelabs.ru/fbsd/patches/ahci/AHCI-properly-convert-CAM-timeout-to-ticks.diff What I don't understand is why the /2000 It gives (timeout_in_ticks)/2. The code in ahci_timeout does the following: {{{ /* Check if slot was not being executed last time we checked. */ if (slot-state AHCI_SLOT_EXECUTING) { /* Check if slot started executing. */ sstatus = ATA_INL(ch-r_mem, AHCI_P_SACT); ccs = (ATA_INL(ch-r_mem, AHCI_P_CMD) AHCI_P_CMD_CCS_MASK) AHCI_P_CMD_CCS_SHIFT; if ((sstatus (1 slot-slot)) != 0 || ccs == slot-slot || ch-fbs_enabled) slot-state = AHCI_SLOT_EXECUTING; callout_reset(slot-timeout, (int)slot-ccb-ccb_h.timeout * hz / 2000, (timeout_t*)ahci_timeout, slot); return; } }}} So, my theory is that the first half of the timeout time is devoted to the transition from AHCI_SLOT_RUNNING - AHCI_SLOT_EXECUTING and the second one is the transition from AHCI_SLOT_RUNNING - TIMEOUT to give the whole process the duration of a full timeout. However, judging by the code, if the slot won't start executing at the first invocation of ahci_timeout that was spawned by the callout armed in ahci_execute_transaction, we can have timeouts more than for the specified amount of time. And if the slot will never start its execution, the callout will spin forever, unless I am missing something important here. May be Alexander can shed some light into this? -- Eygene Ryabinkin,,,^..^,,, [ Life's unfair - but root password helps! | codelabs.ru ] [ 82FE 06BC D497 C0DE 49EC 4FF0 16AF 9EAE 8152 ECFB | freebsd.org ] pgpECHSLq9JWo.pgp Description: PGP signature
Re: cam / ata timeout limited to 2147 due to overflow bug?
Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote: So I suspect that this is what's happening resulting in an extremely small timeout instead of a large one. Now I know that passed in value to the timeout is seconds * 1000 so we should be seeing 2148000 for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over the int wrap point 2147483647. So instead of the wrap point being 2147483 seconds (24 days), I suspect because of the way this is structured its actually 2147 seconds (26mins). If this is the case the fix is likely to be something like:- callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)), It will give you 0 timeout for all values of hz that are lower than 2000: hz is int, so you'll get integer division. Since ccb_h.timeout is u_int32_t, the proper way to handle this situation would be {{{ (u_int64_t)ccb-ccb_h.timeout * (u_int32_t)hz)/2000 }}} as long as the value of hz won't be greater than 2^32. Ahh of course, was late ;-) Can you try the patch at http://codelabs.ru/fbsd/patches/ahci/AHCI-properly-convert-CAM-timeout-to-ticks.diff What I don't understand is why the /2000 It gives (timeout_in_ticks)/2. The code in ahci_timeout does the following: {{{ /* Check if slot was not being executed last time we checked. */ if (slot-state AHCI_SLOT_EXECUTING) { snip.. So, my theory is that the first half of the timeout time is devoted to the transition from AHCI_SLOT_RUNNING - AHCI_SLOT_EXECUTING and the second one is the transition from AHCI_SLOT_RUNNING - TIMEOUT to give the whole process the duration of a full timeout. However, judging by the code, if the slot won't start executing at the first invocation of ahci_timeout that was spawned by the callout armed in ahci_execute_transaction, we can have timeouts more than for the specified amount of time. And if the slot will never start its execution, the callout will spin forever, unless I am missing something important here. May be Alexander can shed some light into this? Interesting thanks for the explaination. I've tried the patch and it a few cut and paste errors, which I've fixed, and confirmed it works as expected, so thanks for that :) There's also a load more drivers with the same issue so I've gone through and fixed all the occurances I can find. Here's the updated patch:- http://blog.multiplay.co.uk/dropzone/freebsd/ccb_timeout.patch Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: cam / ata timeout limited to 2147 due to overflow bug?
Fri, Aug 05, 2011 at 10:59:43AM +0100, Steven Hartland wrote: I've tried the patch and it a few cut and paste errors, which I've fixed, Thanks for spotting that! and confirmed it works as expected, so thanks for that :) There's also a load more drivers with the same issue so I've gone through and fixed all the occurances I can find. Here's the updated patch:- http://blog.multiplay.co.uk/dropzone/freebsd/ccb_timeout.patch I had found a couple of missed drivers, fixed overlong lines and fixed the missing 10 in the sys/dev/hptrr/hptrr_os_bsd.c. Also changed ciss to have u_int32_t timeouts instead of int ones: this should not harm anything, because all passed timeouts are explicit numbers that are not larger than 10. And I had also renamed CAM_HDR_TIMEOUT_TO_TICKS to the base CAM_TIMEOUT_TO_TICKS, because it seems that every CAM timeout is 32-bit long. The new patch lives at http://codelabs.ru/fbsd/patches/cam/CAM-properly-convert-timeout-to-ticks.diff But there are some cases where the argument to the CAM_TIMEOUT_TO_TICKS is int and not u_int32_t. It should be mostly harmless for now, since the values do not exceed 2^32, but my current feeling about timeouts that are counted in milliseconds that there should be an in-kernel type for this stuff. Seems like 32-bit wide unsigned value is good for it: maximal value is around 46 days that should be fine for the millisecond-precision timeout. Through my grep session for the kernel sources I had seen other (t * hz / 1000) constructs, so I feel that the fix should be extended to cover these cases as well. I am interested in the other's opinions on this. Thanks again! -- Eygene Ryabinkin,,,^..^,,, [ Life's unfair - but root password helps! | codelabs.ru ] [ 82FE 06BC D497 C0DE 49EC 4FF0 16AF 9EAE 8152 ECFB | freebsd.org ] pgpHUczlg0J1x.pgp Description: PGP signature
Re: module_register_init fails, but driver is still loaded?
On 4 August 2011 20:23, Garrett Cooper yaneg...@gmail.com wrote: Hi hackers, I noticed that if anything fails while initializing a driver, the driver stays attached to the kernel as a module instead of being kicked when all references to the driver go to 0. Is this desired behavior (it doesn't seem like it, but I can see potential pros and cons of kicking the driver out of the kernel immediately when a failure state occurs)? I've seen this on 7.2 ~ 9-CURRENT. Example sourcecode and invocation attached below. Hi. I have cooked something that might work, though I don't know how much is it correct from locking cleanup side. Can you try it? Anyway, in its current form we cannot return error from module_register_init() because it's usually called from SYSINIT, so kldload(8) will say nonsense: can't load ./bad_module.ko: No error: 0. Index: sys/kern/kern_module.c === --- sys/kern/kern_module.c (revision 224471) +++ sys/kern/kern_module.c (working copy) @@ -112,6 +117,7 @@ module_register_init(const void *arg) const moduledata_t *data = (const moduledata_t *)arg; int error; module_t mod; + linker_file_t lf; mtx_lock(Giant); MOD_SLOCK; @@ -123,12 +129,14 @@ module_register_init(const void *arg) error = MOD_EVENT(mod, MOD_LOAD); if (error) { MOD_EVENT(mod, MOD_UNLOAD); + lf = mod-file; MOD_XLOCK; module_release(mod); MOD_XUNLOCK; printf(module_register_init: MOD_LOAD (%s, %p, %p) error %d\n, data-name, (void *)data-evhand, data-priv, error); + linker_release_module(NULL, NULL, lf); } else { MOD_XLOCK; if (mod-file) { -- wbr, pluknet ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: module_register_init fails, but driver is still loaded?
on 05/08/2011 18:21 Sergey Kandaurov said the following: On 4 August 2011 20:23, Garrett Cooper yaneg...@gmail.com wrote: Hi hackers, I noticed that if anything fails while initializing a driver, the driver stays attached to the kernel as a module instead of being kicked when all references to the driver go to 0. Is this desired behavior (it doesn't seem like it, but I can see potential pros and cons of kicking the driver out of the kernel immediately when a failure state occurs)? I've seen this on 7.2 ~ 9-CURRENT. Example sourcecode and invocation attached below. Hi. I have cooked something that might work, though I don't know how much is it correct from locking cleanup side. Can you try it? Anyway, in its current form we cannot return error from module_register_init() because it's usually called from SYSINIT, so kldload(8) will say nonsense: can't load ./bad_module.ko: No error: 0. Have you also accounted for a situation where multiple logical modules shared the same kernel loadable file? Especially when some modules load successfully while others fail. Index: sys/kern/kern_module.c === --- sys/kern/kern_module.c (revision 224471) +++ sys/kern/kern_module.c (working copy) @@ -112,6 +117,7 @@ module_register_init(const void *arg) const moduledata_t *data = (const moduledata_t *)arg; int error; module_t mod; + linker_file_t lf; mtx_lock(Giant); MOD_SLOCK; @@ -123,12 +129,14 @@ module_register_init(const void *arg) error = MOD_EVENT(mod, MOD_LOAD); if (error) { MOD_EVENT(mod, MOD_UNLOAD); + lf = mod-file; MOD_XLOCK; module_release(mod); MOD_XUNLOCK; printf(module_register_init: MOD_LOAD (%s, %p, %p) error %d\n, data-name, (void *)data-evhand, data-priv, error); + linker_release_module(NULL, NULL, lf); } else { MOD_XLOCK; if (mod-file) { -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
top(1) loses process user time count when threads end
I have the process that first runs in 3 threads but later two active threads exit. top(1) shows this moment this way (1 sec intervals): 30833 yuri3 760 4729M 4225M nanslp 4 0:32 88.62% app 30833 yuri3 760 4729M 4225M nanslp 6 0:34 90.92% app 30833 yuri1 960 4729M 4225M CPU11 0:03 1.17% app 30833 yuri1 980 4729M 4226M CPU11 0:04 12.89% app Process time goes down: 0:34 - 0:03. Also WCPU goes down 90.92% - 1.17% even though this process is CPU bound and does intense things right after threads exit. getrusage(2) though, called in the process, shows the correct user time. I think this is the major bug in the process time accounting. 8.2-STABLE Yuri ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org