Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-05 Thread Kostik Belousov
On Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote:
 I'm working on adding security methods to camcontrol and have
 come up against a strange issue. It seems that the timeout
 value for cam, at least on ata (ahci), is limited to less than
 2148 seconds.
 
 This can be seen by running:-
 camcontrol identify ada0 -t 2148 -v
 (pass0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 
 00
 (pass0:ahcich0:0:0:0): CAM status: Command timeout
 
 Also seen in /var/log/messages at this time is:-
 Aug  4 23:29:51 cfdev kernel: ahcich0: Timeout on slot 24
 Aug  4 23:29:51 cfdev kernel: ahcich0: is  cs 0100 ss  
 rs 0100 tfd d0 serr 
 
 Dropping the timeout down to 2147 and the command runs fine.
 
 I've done some digging and it seems like this is implemented via:-
 sys/dev/ahci/ahci.c
 ahci_execute_transaction(struct ahci_slot *slot)
 {
 ...
/* Start command execution timeout */
callout_reset(slot-timeout, (int)ccb-ccb_h.timeout * hz / 2000,
(timeout_t*)ahci_timeout, slot);
 
 Now its documented that:-
 Non-positive values of ticks are silently converted to the value 1
 
 So I suspect that this is what's happening resulting in an extremely
 small timeout instead of a large one. Now I know that passed in value
 to the timeout is seconds * 1000 so we should be seeing 2148000
 for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over
 the int wrap point 2147483647.
 
 So instead of the wrap point being 2147483 seconds (24 days), I suspect
 because of the way this is structured its actually 2147 seconds (26mins).
 
 If this is the case the fix is likely to be something like:-
 callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)),
For hz == 1000, hz / 2000 == 0 according to the C rules, so the
result will be 0 always.

 
 Does this sound reasonable? What I don't understand is why the /2000?
 
 For reference the reason for wanting a large timeout is that a
 secure erase of large media could take many hours so I'm using
 the erase time reported by the drive for this, in my case here is
 400 minutes.
 
 Currently this instantly fails with a Command timeout which is
 clearly not right.
 
Regards
Steve
 
 
 This e.mail is private and confidential between Multiplay (UK) Ltd. and the 
 person or entity to whom it is addressed. In the event of misdirection, the 
 recipient is prohibited from using, copying, printing or otherwise 
 disseminating it or any information contained in it. 
 In the event of misdirection, illegible or incomplete transmission please 
 telephone +44 845 868 1337
 or return the E.mail to postmas...@multiplay.co.uk.
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


pgpMO4v8aH0Re.pgp
Description: PGP signature


Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-05 Thread Eygene Ryabinkin
Steven, good day.

Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote:
 So I suspect that this is what's happening resulting in an extremely
 small timeout instead of a large one. Now I know that passed in value
 to the timeout is seconds * 1000 so we should be seeing 2148000
 for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over
 the int wrap point 2147483647.
 
 So instead of the wrap point being 2147483 seconds (24 days), I suspect
 because of the way this is structured its actually 2147 seconds (26mins).
 
 If this is the case the fix is likely to be something like:-
  callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)),

It will give you 0 timeout for all values of hz that are lower than
2000: hz is int, so you'll get integer division.  Since ccb_h.timeout
is u_int32_t, the proper way to handle this situation would be
{{{
(u_int64_t)ccb-ccb_h.timeout * (u_int32_t)hz)/2000
}}}
as long as the value of hz won't be greater than 2^32.

Can you try the patch at
  
http://codelabs.ru/fbsd/patches/ahci/AHCI-properly-convert-CAM-timeout-to-ticks.diff

 What I don't understand is why the /2000

It gives (timeout_in_ticks)/2.  The code in ahci_timeout does the following:
{{{
/* Check if slot was not being executed last time we checked. */
if (slot-state  AHCI_SLOT_EXECUTING) {
/* Check if slot started executing. */
sstatus = ATA_INL(ch-r_mem, AHCI_P_SACT);
ccs = (ATA_INL(ch-r_mem, AHCI_P_CMD)  AHCI_P_CMD_CCS_MASK)
 AHCI_P_CMD_CCS_SHIFT;
if ((sstatus  (1  slot-slot)) != 0 || ccs == slot-slot ||
ch-fbs_enabled)
slot-state = AHCI_SLOT_EXECUTING;

callout_reset(slot-timeout,
(int)slot-ccb-ccb_h.timeout * hz / 2000,
(timeout_t*)ahci_timeout, slot);
return;
}
}}}

So, my theory is that the first half of the timeout time is devoted
to the transition from AHCI_SLOT_RUNNING - AHCI_SLOT_EXECUTING and
the second one is the transition from AHCI_SLOT_RUNNING - TIMEOUT
to give the whole process the duration of a full timeout.  However,
judging by the code, if the slot won't start executing at the first
invocation of ahci_timeout that was spawned by the callout armed in
ahci_execute_transaction, we can have timeouts more than for the
specified amount of time.  And if the slot will never start its
execution, the callout will spin forever, unless I am missing something
important here.

May be Alexander can shed some light into this?
-- 
Eygene Ryabinkin,,,^..^,,,
[ Life's unfair - but root password helps!   | codelabs.ru ]
[ 82FE 06BC D497 C0DE 49EC  4FF0 16AF 9EAE 8152 ECFB | freebsd.org ]


pgpECHSLq9JWo.pgp
Description: PGP signature


Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-05 Thread Steven Hartland

Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote:

So I suspect that this is what's happening resulting in an extremely
small timeout instead of a large one. Now I know that passed in value
to the timeout is seconds * 1000 so we should be seeing 2148000
for ccb-ccb_h.timeout now multiply that by 1000 (hz) and your over
the int wrap point 2147483647.

So instead of the wrap point being 2147483 seconds (24 days), I suspect
because of the way this is structured its actually 2147 seconds (26mins).

If this is the case the fix is likely to be something like:-
 callout_reset(slot-timeout, (int)(ccb-ccb_h.timeout * (hz / 2000)),


It will give you 0 timeout for all values of hz that are lower than
2000: hz is int, so you'll get integer division.  Since ccb_h.timeout
is u_int32_t, the proper way to handle this situation would be
{{{
(u_int64_t)ccb-ccb_h.timeout * (u_int32_t)hz)/2000
}}}
as long as the value of hz won't be greater than 2^32.


Ahh of course, was late ;-)


Can you try the patch at
 
http://codelabs.ru/fbsd/patches/ahci/AHCI-properly-convert-CAM-timeout-to-ticks.diff


What I don't understand is why the /2000


It gives (timeout_in_ticks)/2.  The code in ahci_timeout does the following:
{{{
   /* Check if slot was not being executed last time we checked. */
   if (slot-state  AHCI_SLOT_EXECUTING) {

snip..


So, my theory is that the first half of the timeout time is devoted
to the transition from AHCI_SLOT_RUNNING - AHCI_SLOT_EXECUTING and
the second one is the transition from AHCI_SLOT_RUNNING - TIMEOUT
to give the whole process the duration of a full timeout.  However,
judging by the code, if the slot won't start executing at the first
invocation of ahci_timeout that was spawned by the callout armed in
ahci_execute_transaction, we can have timeouts more than for the
specified amount of time.  And if the slot will never start its
execution, the callout will spin forever, unless I am missing something
important here.

May be Alexander can shed some light into this?


Interesting thanks for the explaination.

I've tried the patch and it a few cut and paste errors, which I've fixed,
and confirmed it works as expected, so thanks for that :)

There's also a load more drivers with the same issue so I've gone through
and fixed all the occurances I can find. Here's the updated patch:-
http://blog.multiplay.co.uk/dropzone/freebsd/ccb_timeout.patch

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-05 Thread Eygene Ryabinkin
Fri, Aug 05, 2011 at 10:59:43AM +0100, Steven Hartland wrote:
 I've tried the patch and it a few cut and paste errors, which I've fixed,

Thanks for spotting that!

 and confirmed it works as expected, so thanks for that :)
 
 There's also a load more drivers with the same issue so I've gone through
 and fixed all the occurances I can find. Here's the updated patch:-
 http://blog.multiplay.co.uk/dropzone/freebsd/ccb_timeout.patch

I had found a couple of missed drivers, fixed overlong lines and fixed
the missing 10 in the sys/dev/hptrr/hptrr_os_bsd.c.  Also changed ciss
to have u_int32_t timeouts instead of int ones: this should not harm
anything, because all passed timeouts are explicit numbers that are
not larger than 10.  And I had also renamed
CAM_HDR_TIMEOUT_TO_TICKS to the base CAM_TIMEOUT_TO_TICKS, because it
seems that every CAM timeout is 32-bit long.  The new patch lives at
  http://codelabs.ru/fbsd/patches/cam/CAM-properly-convert-timeout-to-ticks.diff

But there are some cases where the argument to the
CAM_TIMEOUT_TO_TICKS is int and not u_int32_t.  It should be mostly
harmless for now, since the values do not exceed 2^32, but my current
feeling about timeouts that are counted in milliseconds that there
should be an in-kernel type for this stuff.  Seems like 32-bit wide
unsigned value is good for it: maximal value is around 46 days that
should be fine for the millisecond-precision timeout.

Through my grep session for the kernel sources I had seen other
(t * hz / 1000) constructs, so I feel that the fix should be
extended to cover these cases as well.

I am interested in the other's opinions on this.

Thanks again!
-- 
Eygene Ryabinkin,,,^..^,,,
[ Life's unfair - but root password helps!   | codelabs.ru ]
[ 82FE 06BC D497 C0DE 49EC  4FF0 16AF 9EAE 8152 ECFB | freebsd.org ]


pgpHUczlg0J1x.pgp
Description: PGP signature


Re: module_register_init fails, but driver is still loaded?

2011-08-05 Thread Sergey Kandaurov
On 4 August 2011 20:23, Garrett Cooper yaneg...@gmail.com wrote:
 Hi hackers,
    I noticed that if anything fails while initializing a driver, the
 driver stays attached to the kernel as a module instead of being
 kicked when all references to the driver go to 0. Is this desired
 behavior (it doesn't seem like it, but I can see potential pros and
 cons of kicking the driver out of the kernel immediately when a
 failure state occurs)? I've seen this on 7.2 ~ 9-CURRENT. Example
 sourcecode and invocation attached below.

Hi.
I have cooked something that might work, though I don't know how much
is it correct from locking  cleanup side. Can you try it? Anyway, in its
current form we cannot return error from module_register_init() because
it's usually called from SYSINIT, so kldload(8) will say nonsense:
can't load ./bad_module.ko: No error: 0.

Index: sys/kern/kern_module.c
===
--- sys/kern/kern_module.c  (revision 224471)
+++ sys/kern/kern_module.c  (working copy)
@@ -112,6 +117,7 @@ module_register_init(const void *arg)
const moduledata_t *data = (const moduledata_t *)arg;
int error;
module_t mod;
+   linker_file_t lf;

mtx_lock(Giant);
MOD_SLOCK;
@@ -123,12 +129,14 @@ module_register_init(const void *arg)
error = MOD_EVENT(mod, MOD_LOAD);
if (error) {
MOD_EVENT(mod, MOD_UNLOAD);
+   lf = mod-file;
MOD_XLOCK;
module_release(mod);
MOD_XUNLOCK;
printf(module_register_init: MOD_LOAD (%s, %p, %p) error
 %d\n, data-name, (void *)data-evhand, data-priv,
error);
+   linker_release_module(NULL, NULL, lf);
} else {
MOD_XLOCK;
if (mod-file) {


-- 
wbr,
pluknet
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: module_register_init fails, but driver is still loaded?

2011-08-05 Thread Andriy Gapon
on 05/08/2011 18:21 Sergey Kandaurov said the following:
 On 4 August 2011 20:23, Garrett Cooper yaneg...@gmail.com wrote:
 Hi hackers,
I noticed that if anything fails while initializing a driver, the
 driver stays attached to the kernel as a module instead of being
 kicked when all references to the driver go to 0. Is this desired
 behavior (it doesn't seem like it, but I can see potential pros and
 cons of kicking the driver out of the kernel immediately when a
 failure state occurs)? I've seen this on 7.2 ~ 9-CURRENT. Example
 sourcecode and invocation attached below.
 
 Hi.
 I have cooked something that might work, though I don't know how much
 is it correct from locking  cleanup side. Can you try it? Anyway, in its
 current form we cannot return error from module_register_init() because
 it's usually called from SYSINIT, so kldload(8) will say nonsense:
 can't load ./bad_module.ko: No error: 0.

Have you also accounted for a situation where multiple logical modules shared 
the
same kernel loadable file?  Especially when some modules load successfully while
others fail.

 Index: sys/kern/kern_module.c
 ===
 --- sys/kern/kern_module.c  (revision 224471)
 +++ sys/kern/kern_module.c  (working copy)
 @@ -112,6 +117,7 @@ module_register_init(const void *arg)
 const moduledata_t *data = (const moduledata_t *)arg;
 int error;
 module_t mod;
 +   linker_file_t lf;
 
 mtx_lock(Giant);
 MOD_SLOCK;
 @@ -123,12 +129,14 @@ module_register_init(const void *arg)
 error = MOD_EVENT(mod, MOD_LOAD);
 if (error) {
 MOD_EVENT(mod, MOD_UNLOAD);
 +   lf = mod-file;
 MOD_XLOCK;
 module_release(mod);
 MOD_XUNLOCK;
 printf(module_register_init: MOD_LOAD (%s, %p, %p) error
  %d\n, data-name, (void *)data-evhand, data-priv,
 error);
 +   linker_release_module(NULL, NULL, lf);
 } else {
 MOD_XLOCK;
 if (mod-file) {
 
 


-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


top(1) loses process user time count when threads end

2011-08-05 Thread Yuri
I have the process that first runs in 3 threads but later two active 
threads exit.


top(1) shows this moment this way (1 sec intervals):
30833 yuri3  760  4729M  4225M nanslp  4   0:32 88.62% app
30833 yuri3  760  4729M  4225M nanslp  6   0:34 90.92% app
30833 yuri1  960  4729M  4225M CPU11   0:03  1.17% app
30833 yuri1  980  4729M  4226M CPU11   0:04 12.89% app

Process time goes down: 0:34 - 0:03. Also WCPU goes down 90.92% - 
1.17% even though this process is CPU bound and does intense things 
right after threads exit.


getrusage(2) though, called in the process, shows the correct user time.

I think this is the major bug in the process time accounting.

8.2-STABLE

Yuri
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org