We have found a race condition in sysfs.c which occurs when unloading low-level
modules
(e.g., mlx4_ib) in the driver.
Specifically:
Although the kernel takes reference counts on sysfs files, it does not take
such counts
on modules which implement attribute reads.
For example, we have:
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
char *buf)
{
struct port_table_attribute *tab_attr =
container_of(attr, struct port_table_attribute, attr);
u16 pkey;
ssize_t ret;
====>race condition HERE *****
ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
if (ret)
return ret;
return sprintf(buf, "0x%04x\n", pkey);
}
The sysfs file /sys/class/infiniband/<device>/ports/1/pkey/<pkey number> is
protected
from destruction while we are in show_port_pkey.
However, the underlying module which implements ib_query_pkey (in this case,
mlx4_ib) is not.
Thus, if another process is busy unloading mlx4_ib, and the time-slice of the
process
which is reading sysfs expires at the point indicated above in the code,
ib_query_pkey()
will fail with a page-fault (kernel panic), since it will not find the code
page which implements
ib_query_pkey() (inlined to the query_pkey() function in the low-level driver
virtual function table).
Now, when a low-level driver is unloaded, the following procedure (in sysfs.c)
is called:
void ib_device_unregister_sysfs(struct ib_device *device)
{
struct kobject *p, *t;
struct ib_port *port;
list_for_each_entry_safe(p, t, &device->port_list, entry) {
list_del(&p->entry);
port = container_of(p, struct ib_port, kobj);
mutex_lock(&port->mutex);
port->valid = 0;
sysfs_remove_group(p, &pma_group);
sysfs_remove_group(p, &port->pkey_group);
sysfs_remove_group(p, &port->gid_group);
mutex_unlock(&port->mutex);
kobject_put(p);
}
kobject_put(device->ports_parent);
device_unregister(&device->dev);
}
After this call, the kernel continues with unloading the low-level module.
However, until device_unregister(&device->dev) is invoked, the
sysfs attribute path for the low-level device is still valid. Hence the race
condition --
Process A Process B
--------- ---------------
1. starts unloading low-level mod
2. cat /sys/class/infiniband/...
3. Time slice runs out just before accessing
low-level
module for requested info.
4. Low-level module is fully unloaded
5. Page-fault panic when trying to access a
procedure in
the just-unloaded module.
Some attempt was made for some (but not all) of the "show" procedures to check
if the module is alive:
if (!ibdev_is_alive(p->ibdev))
return -ENODEV;
This narrows the race window considerably, but does not eliminate it. (I put
this fix in show_port_pkey(),
and was still able to generate the kernel panic).
The only way I was able to eliminate the kernel panic entirely was via a mutex
(declaration and init not shown):
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
char *buf)
{
struct port_table_attribute *tab_attr =
container_of(attr, struct port_table_attribute, attr);
u16 pkey;
ssize_t ret;
mutex_lock(&p->mutex);
==> if (p->valid)
ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index,
&pkey);
else
ret = -EINVAL;
==> mutex_unlock(&p->mutex);
if (ret)
return ret;
return sprintf(buf, "0x%04x\n", pkey);
}
and:
void ib_device_unregister_sysfs(struct ib_device *device)
{
struct kobject *p, *t;
struct ib_port *port;
list_for_each_entry_safe(p, t, &device->port_list, entry) {
list_del(&p->entry);
port = container_of(p, struct ib_port, kobj);
==> mutex_lock(&port->mutex);
port->valid = 0;
sysfs_remove_group(p, &pma_group);
sysfs_remove_group(p, &port->pkey_group);
sysfs_remove_group(p, &port->gid_group);
==> mutex_unlock(&port->mutex);
kobject_put(p);
}
kobject_put(device->ports_parent);
device_unregister(&device->dev);
}
This is approach is fine for the port-based groups. What about class-device
attributes themselves?
I believe that the best approach is to add a sysfs_mutex to ib_device, and lock
that for ALL "show" methods
in this file.
Opinions?
- Jack
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general