Re: [PATCHv1 0/6] rdma controller support

2016-01-07 Thread Tejun Heo
Hello, Parav.

On Thu, Jan 07, 2016 at 04:43:20AM +0530, Parav Pandit wrote:
> > If different controllers can't agree upon the
> > same set of resources, which probably is a pretty good sign that this
> > isn't too well thought out to begin with,
> 
> When you said "different controller" you meant "different hw vendors", right?
> Or you meant, rdma, mem, cpu as controller here?

Different hw vendors.

> > at least make all resource
> > types defined by the controller itself and let the controllers enable
> > them selectively.
> >
> In this V1 patch, resource is defined by the IB stack and rdma cgroup
> is facilitator for same.
> By doing so, IB stack modules can define new resource without really
> making changes to cgroup.
> This design also allows hw vendors to define their own resources which
> will be reviewed in rdma mailing list anway.
> The idea is different hw versions can have different resource support,
> so the whole intention is not about defining different resource but
> rather enabling it.
> But yes, I equally agree that by doing so, different hw controller
> vendors can define different hw resources.

How many vendors and resources are we talking about?  What I was
trying to say was that unless the number is extremely high, it'd be
far simpler to hard code them in the rdma controller and let drivers
enable the ones which apply to them.  It would require updating the
rdma cgroup controller to add new resource types but I think that'd
actually be an upside, not down.  There needs to be some checks and
balances against adding new resource types; otherwise, it'll soon
become a mess.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup

2016-01-07 Thread Tejun Heo
On Thu, Jan 07, 2016 at 04:46:19AM +0530, Parav Pandit wrote:
> On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <t...@kernel.org> wrote:
> > On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
> >> Added function pointer table to store resource pool specific
> >> operation for each resource type (verb and hw).
> >> Added list node to link device to rdma cgroup so that it can
> >> participate in resource accounting and limit configuration.
> >
> > Is there any point in splitting patches 1 and 2 from 3?
> >
> Patch 2 is in IB stack, so I separated that patch out from 1. That
> makes it 3 patches.
> If you think single patch is easier to review, let me know, I can
> respin to have one patch for these 3 smaller patches.

They don't do anything by themselves.  I think putting them into a
single patch would be easier to review.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.

2016-01-06 Thread Tejun Heo
Hello,

On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote:
> Yes. I read through. I can see two changes to be made in V2 version of
> this patch.
> 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
> respectively to,
> 2. rdma.resource.verb.stat and rdma.resource.verb.max.
> 3. rdma.resource.verb.failcnt indicate failure events, which I think
> should go to events.

What's up with the ".resource" part?  Also can't the .max file list
the available resources?  Why does it need a separtae list file?

> I roll out new patch for events post this patch as additional feature
> and remove this feature in V2.
> 
> rdma.resource.verb.list file is unique to rdma cgroup, so I believe
> this is fine.

Please see above.

> We will conclude whether to have rdma.resource.hw. or not in
> other patches.
> I am in opinion to keep "resource" and "verb" or "hw" tags around to
> keep it verbose enough to know what are we trying to control.

What does that achieve?  I feel that it's getting overengineered
constantly.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup

2016-01-05 Thread Tejun Heo
On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
> Added function pointer table to store resource pool specific
> operation for each resource type (verb and hw).
> Added list node to link device to rdma cgroup so that it can
> participate in resource accounting and limit configuration.

Is there any point in splitting patches 1 and 2 from 3?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 0/6] rdma controller support

2016-01-05 Thread Tejun Heo
Hello,

On Wed, Jan 06, 2016 at 12:28:00AM +0530, Parav Pandit wrote:
> Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB stack & optionally by HCA vendor device drivers.

As I wrote before, I don't think this is a good idea.  Drivers will
inevitably add non-sensical "resources" which don't make any sense
without much scrutiny.  If different controllers can't agree upon the
same set of resources, which probably is a pretty good sign that this
isn't too well thought out to begin with, at least make all resource
types defined by the controller itself and let the controllers enable
them selectively.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.

2016-01-05 Thread Tejun Heo
Hello,

On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote:
> +5-4-1. RDMA Interface Files
> +
> +  rdma.resource.verb.list
> +  rdma.resource.verb.limit
> +  rdma.resource.verb.usage
> +  rdma.resource.verb.failcnt
> +  rdma.resource.hw.list
> +  rdma.resource.hw.limit
> +  rdma.resource.hw.usage
> +  rdma.resource.hw.failcnt

Can you please read the rest of cgroup.txt and put the interface in
line with the common conventions followed by other controllers?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-05 Thread Tejun Heo
Hello,

On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
> +/* hash table to keep map of tgid to owner cgroup */
> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
> +DEFINE_SPINLOCK(pid_cg_map_lock);/* lock to protect hash table access */
> +
> +/* Keeps mapping of pid to its owning cgroup at rdma level,
> + * This mapping doesn't change, even if process migrates from one to other
> + * rdma cgroup.
> + */
> +struct pid_cg_map {
> + struct pid *pid;/* hash key */
> + struct rdma_cgroup *cg;
> +
> + struct hlist_node hlist;/* pid to cgroup hash table link */
> + atomic_t refcnt;/* count active user tasks to figure out
> +  * when to free the memory
> +  */
> +};

Ugh, there's something clearly wrong here.  Why does the rdma
controller need to keep track of pid cgroup membership?

> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
> +   struct cg_resource_pool *rpool)
> +{
> + spin_lock(>cg_list_lock);
> +
> + /* if its started getting used by other task,
> +  * before we take the spin lock, then skip,
> +  * freeing it.
> +  */

Please follow CodingStyle.

> + if (atomic_read(>refcnt) == 0) {
> + list_del_init(>cg_list);
> + spin_unlock(>cg_list_lock);
> +
> + _free_cg_rpool(rpool);
> + return;
> + }
> + spin_unlock(>cg_list_lock);
> +}
> +
> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
> +  struct cg_resource_pool *rpool)
> +{
> + /* Don't free the resource pool which is created by the
> +  * user, otherwise we miss the configured limits. We don't
> +  * gain much either by splitting storage of limit and usage.
> +  * So keep it around until user deletes the limits.
> +  */
> + if (atomic_read(>creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
> + _dealloc_cg_rpool(cg, rpool);

I'm pretty sure you can get away with an fixed length array of
counters.  Please keep it simple.  It's a simple hard limit enforcer.
There's no need to create a massive dynamic infrastrucure.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC rdma cgroup

2015-11-24 Thread Tejun Heo
Hello, chiming in late.

On Wed, Oct 28, 2015 at 01:59:15PM +0530, Parav Pandit wrote:
> Design guidelines:
> ---
> 1. There will be new rdma cgroup for accounting rdma resources
> (instead of extending device cgroup).
> Rationale: RDMA tracks different type of resources and it functions
> differently than device cgroup. Though device cgroup could have been
> extended for more generic nature, community feels that its better to
> create RDMA cgroup, which might have more features than just resource
> limit enforcement in future.

Yeap, it should definitely be separate from device cgroup.

> 2. RDMA cgroup will allow resource accounting, limit enforcement on
> per cgroup, per rdma device basis (instead of resource limiting across
> all devices).
> Rationale: this give granular control when multiple devices exist in the 
> system.
> 
> 3. Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
> subsystem can evolve without the need of rdma cgroup update. A new
> resource can be easily added by the RDMA/IB subsystem without touching
> rdma cgroup.

I'm *extremely* uncomfortable with this.  Drivers for this sort of
higher end devices tend to pull a lot of stunts for better or worse
and my gut feeling is that letting low level drivers run free with
resource definition is highly likely to lead to an unmanageable mess
in the long run.  I'd strongly urge to gather consensus on what the
resources should be across the board.

> Design:
> -
> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
> hw resource pool per such device.
> (Nothing stops to have more devices and pools, but design is around
> this use case).

Heh, 4 seems like an arbitrary number.  idk, it feels weird to bake in
a number like 4 into the design.

> 9. Resource pool object is created in following situations.
> (a) administrative operation is done to set the limit and no previous
> resource pool exist for the device of interest for the cgroup.
> (b) no resource limits were configured, but IB/RDMA subsystem tries to
> charge the resource. so that when applications are running without
> limits and later on when limits are enforced, during uncharging, it
> correctly uncharges them, otherwise usage count will drop to negative.
> This is done using default resource pool.
> Instead of implementing any sort of time markers, default pool
> simplifies the design.

So, the usual way to deal with this is that the root cgroup is exempt
from accounting and each resource tracks where they were charged to
and frees that cgroup on release.  IOW, asssociate on charge and
maintain the association till release.

For interface details, please refer to the following documentation.

 
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup.txt?h=for-4.5

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Tejun Heo
Hello, Parav.

On Mon, Sep 14, 2015 at 07:34:09PM +0530, Parav Pandit wrote:
> I missed to acknowledge your point that we need both - hard limit and
> soft limit/weight. Current patchset is only based on hard limit.
> I see that weight would be another helfpul layer in chain that we can
> implement after this as incremental that makes review, debugging
> manageable?

At this point, I'm very unsure that doing this as a cgroup controller
is a good direction.  From userland interface standpoint, publishing a
cgroup controller is a big commitment.  It is true that we haven't
been doing a good job of gatekeeping or polishing controller
interfaces but we're trying hard to change that and what's being
proposed in this thread doesn't really seem to be mature enough.  It's
not even clear what's being identified as resources here are things
that the users would actually care about or if it's even possible to
implement sensible resource control in the kernel via the proposed
resource restrictions.

So, I'd suggest going back to the board and figuring out what the
actual resources are, their distribution strategies should be and at
which layer such strategies can be implemented best.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
> Resource run away by application can lead to (a) kernel and (b) other
> applications left out with no resources situation.

Yeap, that this controller would be able to prevent to a reasonable
extent.

> Both the problems are the target of this patch set by accounting via cgroup.
> 
> Performance contention can be resolved with higher level user space,
> which will tune it.

If individual applications are gonna be allowed to do that, what's to
prevent them from jacking up their limits?  So, I assume you're
thinking of a central authority overseeing distribution and enforcing
the policy through cgroups?

> Threshold and fail counters are on the way in follow on patch.

If you're planning on following what the existing memcg did in this
area, it's unlikely to go well.  Would you mind sharing what you have
on mind in the long term?  Where do you see this going?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
> > If you're planning on following what the existing memcg did in this
> > area, it's unlikely to go well.  Would you mind sharing what you have
> > on mind in the long term?  Where do you see this going?
>
> At least current thoughts are: central entity authority monitors fail
> count and new threashold count.
> Fail count - as similar to other indicates how many time resource
> failure occured
> threshold count - indicates upto what this resource has gone upto in
> usage. (application might not be able to poll on thousands of such
> resources entries).
> So based on fail count and threshold count, it can tune it further.

So, regardless of the specific resource in question, implementing
adaptive resource distribution requires more than simple thresholds
and failcnts.  The very minimum would be a way to exert reclaim
pressure and then a way to measure how much lack of a given resource
is affecting the workload.  Maybe it can adaptively lower the limits
and then watch how often allocation fails but that's highly unlikely
to be an effective measure as it can't do anything to hoarders and the
frequency of allocation failure doesn't necessarily correlate with the
amount of impact the workload is getting (it's not a measure of
usage).

This is what I'm awry about.  The kernel-userland interface here is
cut pretty low in the stack leaving most of arbitration and management
logic in the userland, which seems to be what people wanted and that's
fine, but then you're trying to implement an intelligent resource
control layer which straddles across kernel and userland with those
low level primitives which inevitably would increase the required
interface surface as nobody has enough information.

Just to illustrate the point, please think of the alsa interface.  We
expose hardware capabilities pretty much as-is leaving management and
multiplexing to userland and there's nothing wrong with it.  It fits
better that way; however, we don't then go try to implement cgroup
controller for PCM channels.  To do any high-level resource
management, you gotta do it where the said resource is actually
managed and arbitrated.

What's the allocation frequency you're expecting?  It might be better
to just let allocations themselves go through the agent that you're
planning.  You sure can use cgroup membership to identify who's asking
tho.  Given how the whole thing is architectured, I'd suggest thinking
more about how the whole thing should turn out eventually.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Doug.

On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.
> 
> The abstraction is 10+ years old.  It has had plenty of time to ferment
> and something better for the specific use case has not emerged.

I think that is likely more reflective of the use cases rather than
anything inherent in the concept.

> >  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> No.  And if you think this, then you miss the *entire* point of RDMA
> technologies.  An analogy that I have used many times in presentations
> is that, in the networking world, the kernel is both a postman and a
> copy machine.  It receives all incoming packets and must sort them to
> the right recipient (the postman job) and when the user space
> application is ready to use the information it must copy it into the
> user's VM space because it couldn't just put the user's data buffer on
> the RX buffer list since each buffer might belong to anyone (the copy
> machine).  In the RDMA world, you create a new queue pair, it is often a
> long lived connection (like a socket), but it belongs now to the app and
> the app can directly queue both send and receive buffers to the card and
> on incoming packets the card will be able to know that the packet
> belongs to a specific queue pair and will immediately go to that apps
> buffer.  You can *not* do this with TCP without moving to complete TCP
> offload on the card, registration of specific sockets on the card, and
> then allowing the application to pre-register receive buffers for a
> specific socket to the card so that incoming data on the wire can go
> straight to the right place.  If you ever get to the point of "OS
> schedules the transfer" then you might as well throw RDMA out the window
> because you have totally trashed the benefit it provides.

I don't know.  This sounds like classic "this is painful so it must be
good" bare metal fantasy.  I get that rdma succeeds at bypassing a lot
of overhead.  That's great but that really isn't exclusive with having
more accessible mechanisms built on top.  The crux of cost saving is
the hardware knowing where the incoming data belongs and putting it
there directly.  Everything else is there to facilitate that and if
you're declaring that it's impossible to build accessible abstractions
for that, I can't agree with you.

Note that this is not to say that rdma should do that in the operating
system.  As you said, people have been happy with the bare abstraction
for a long time and, given relatively specialized use cases, that can
be completely fine but please do note that the lack of proper
abstraction isn't an inherent feature.  It's just easier that way and
putting in more effort hasn't been necessary.

> > It could be that given the use cases rdma might not need such level of
> > abstraction - e.g. most users want to be and are pretty close to bare
> > metal, but, if that's true, it also kinda is weird to build
> > hierarchical resource distribution scheme on top of such bare
> > abstraction.
> 
> Not really.  If you are going to have a bare abstraction, this one isn't
> really a bad one.  You have devices.  On a device, you allocate
> protection domains (PDs).  If you don't care about cross connection
> issues, you ignore this and only use one.  If you do care, this acts
> like a process's unique VM space only for RDMA buffers, it is a domain
> to protect the data of one connection from another.  Then you have queue
> pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
> at least one Completion Queue where you get the events that tell you
> things have completed (although they often use two, one for send
> completions and one for receive completions).  And then you use some
> number of memory registrations (MRs) and address handles (AHs) depending
> on your usage.  Since RDMA stands for Remote Direct Memory Access, as
> you can imagine, giving a remote machine free reign to access all of the
> physical memory in your machine is a security issue.  The MRs help to
> control what memory the remote host on a specific QP has access to.  The
> AHs control how we actually route packets from ourselves to the remote host.
> 
> Here's the deal.  You might be able to create an abstraction above this
> that hides *some* of this.  But it can't hide even nearly all of it
> without loosing significant functionality.  The problem here is that you
> are thinking about RDMA connections like sockets.  They aren't.  Not
> even close.  They are "how do I allow a remote machine to directly read
> and write into my machines physical memory in an even remotely close to
> secure manner?"  These resources aren't hardware resources, they are the
> abstraction resources 

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> Tejun,
> That is such a perfect abstraction to have at OS level, but not sure
> how much close it can be to bare metal RDMA it can be.
> I have started discussion on that front as well as part of other
> thread, but its certainly long way to go.
> Most want to enjoy the performance benefit of the bare metal
> interfaces it provides.

Yeah, sure, I'm not trying to say that rdma needs or should do that.

> Such abstraction that you mentioned, exists, the only difference is
> instead of its OS as central entity, its the higher level libraries,
> drivers and hw together does it today for the applications.

But more that having resource control in the OS and actual arbitration
higher up in the stack isn't likely to lead to an effective resource
distribution scheme.

> > You kinda have to decide that upfront cuz it gets baked into the
> > interface.
> 
> Well, all the interfaces are not yet defined. Except the test and

I meant the cgroup interface.

> benchmark utilities, real world applications wouldn't really bother
> much about which device are they are going through.

Weights can work fine across multiple devices.  Hard limits don't.  It
just doesn't make any sense.  Unless you can exclude multiple device
scenarios, you'll have to implement per-device limits.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Tejun Heo
Hello, Parav.

On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
> This is one old white paper, but most of the reasoning still holds true on 
> RDMA.
> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf

Just read it.  Much appreciated.

...
> These resources include are-  QP (queue pair) to transfer data, CQ
> (Completion queue) to indicate completion of data transfer operation,
> MR (memory region) to represent user application memory as source or
> destination for data transfer.
> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
> (Address handle), FLOW, PD (protection domain), user context etc.

It's kinda bothering that all these are disparate resources.  I
suppose that each restriction comes from the underlying hardware and
there's no accepted higher level abstraction for these things?

> >> This patch-set allows limiting rdma resources to set of processes.
> >> It extend device cgroup controller for limiting rdma device limits.
> >
> > I don't think this belongs to devcg.  If these make sense as a set of
> > resources to be controlled via cgroup, the right way prolly would be a
> > separate controller.
> >
> 
> In past there has been similar comment to have dedicated cgroup
> controller for RDMA instead of merging with device cgroup.
> I am ok with both the approach, however I prefer to utilize device
> controller instead of spinning of new controller for new devices
> category.
> I anticipate more such need would arise and for new device category,
> it might not be worth to have new cgroup controller.
> RapidIO though very less popular and upcoming PCIe are on horizon to
> offer similar benefits as that of RDMA and in future having one
> controller for each of them again would not be right approach.
>
> I certainly seek your and others inputs in this email thread here whether
> (a) to continue to extend device cgroup (which support character,
> block devices white list) and now RDMA devices
> or
> (b) to spin of new controller, if so what are the compelling reasons
> that it can provide compare to extension.

I'm doubtful that these things are gonna be mainstream w/o building up
higher level abstractions on top and if we ever get there we won't be
talking about MR or CQ or whatever.  Also, whatever next-gen is
unlikely to have enough commonalities when the proposed resource knobs
are this low level, so let's please keep it separate, so that if/when
this goes out of fashion for one reason or another, the controller can
silently wither away too.

> Current scope of the patch is limited to RDMA resources as first
> patch, but for fact I am sure that there are more functionality in
> pipe to support via this cgroup by me and others.
> So keeping atleast these two aspects in mind, I need input on
> direction of dedicated controller or new one.
> 
> In future, I anticipate that we might have sub directory to device
> cgroup for individual device class to control.
> such as,
>   /char
>  /block
>  /rdma
>  /pcie
>  /child_cgroup..1..N
> Each controllers cgroup access files would remain within their own
> scope. We are not there yet from base infrastructure but something to
> be done as it matures and users start using it.

I don't think that jives with the rest of cgroup and what generic
block or pcie attributes are directly exposed to applications and need
to be hierarchically controlled via cgroup?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
> The fact is that user level application uses hardware resources.
> Verbs layer is software abstraction for it. Drivers are hiding how
> they implement this QP or CQ or whatever hardware resource they
> project via API layer.
> For all of the userland on top of verb layer I mentioned above, the
> common resource abstraction is these resources AH, QP, CQ, MR etc.
> Hardware (and driver) might have different view of this resource in
> their real implementation.
> For example, verb layer can say that it has 100 QPs, but hardware
> might actually have 20 QPs that driver decide how to efficiently use
> it.

My uneducated suspicion is that the abstraction is just not developed
enough.  It should be possible to virtualize these resources through,
most likely, time-sharing to the level where userland simply says "I
want this chunk transferred there" and OS schedules the transfer
prioritizing competing requests.

It could be that given the use cases rdma might not need such level of
abstraction - e.g. most users want to be and are pretty close to bare
metal, but, if that's true, it also kinda is weird to build
hierarchical resource distribution scheme on top of such bare
abstraction.

...
> > I don't know.  What's proposed in this thread seems way too low level
> > to be useful anywhere else.  Also, what if there are multiple devices?
> > Is that a problem to worry about?
>
> o.k. It doesn't have to be useful anywhere else. If it suffice the
> need of RDMA applications, its fine for near future.
> This patch allows limiting resources across multiple devices.
> As we go along the path, and if requirement come up to have knob on
> per device basis, thats something we can extend in future.

You kinda have to decide that upfront cuz it gets baked into the
interface.

> > I'm kinda doubtful we're gonna have too many of these.  Hardware
> > details being exposed to userland this directly isn't common.
> 
> Its common in RDMA applications. Again they may not be real hardware
> resource, its just API layer which defines those RDMA constructs.

It's still a very low level of abstraction which pretty much gets
decided by what the hardware and driver decide to do.

> > I'd say keep it simple and do the minimum. :)
>
> o.k. In that case new rdma cgroup controller which does rdma resource
> accounting is possibly the most simplest form?
> Make sense?

So, this fits cgroup's purpose to certain level but it feels like
we're trying to build too much on top of something which hasn't
developed sufficiently.  I suppose it could be that this is the level
of development that rdma is gonna reach and dumb cgroup controller can
be useful for some use cases.  I don't know, so, yeah, let's keep it
simple and avoid doing crazy stuff.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Tejun Heo
Hello, Parav.

On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote:
> >> These resources include are-  QP (queue pair) to transfer data, CQ
> >> (Completion queue) to indicate completion of data transfer operation,
> >> MR (memory region) to represent user application memory as source or
> >> destination for data transfer.
> >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
> >> (Address handle), FLOW, PD (protection domain), user context etc.
> >
> > It's kinda bothering that all these are disparate resources.
> 
> Actually not. They are linked resources. Every QP needs associated one
> or two CQ, one PD.
> Every QP will use few MRs for data transfer.

So, if that's the case, let's please implement something higher level.
The goal is providing reasonable isolation or protection.  If that can
be achieved at a higher level of abstraction, please do that.

> Here is the good programming guide of the RDMA APIs exposed to the
> user space application.
> 
> http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
> So first version of the cgroups patch will address the control
> operation for section 3.4.
> 
> > I suppose that each restriction comes from the underlying hardware and
> > there's no accepted higher level abstraction for these things?
>
> There is higher level abstraction which is through the verbs layer
> currently which does actually expose the hardware resource but in
> vendor agnostic way.
> There are many vendors who support these verbs layer, some of them
> which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
> which support these verbs are in  kernel tree.
> 
> There is higher level APIs above the verb layer, such as MPI,
> libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
> They all rely on the hardware resource. All of these higher level
> abstraction is accepted and well used by certain application class. It
> would be long discussion to go over them here.

Well, the programming interface that userland builds on top doesn't
matter too much here but if there is a common resource abstraction
which can be made in terms of constructs that consumers of the
facility would care about, that likely is a better choice than
exposing whatever hardware exposes.

> > I'm doubtful that these things are gonna be mainstream w/o building up
> > higher level abstractions on top and if we ever get there we won't be
> > talking about MR or CQ or whatever.
> 
> Some of the higher level examples I gave above will adapt to resource
> allocation failure. Some are actually adaptive to few resource
> allocation failure, they do query resources. But its not completely
> there yet. Once we have this notion of limited resource in place,
> abstraction layer would adapt to relatively smaller value of such
> resource.
>
> These higher level abstraction is mainstream. Its shipped at least in
> Redhat Enterprise Linux.

Again, I was talking more about resource abstraction - e.g. something
along the line of "I want N command buffers".

> > Also, whatever next-gen is
> > unlikely to have enough commonalities when the proposed resource knobs
> > are this low level,
> 
> I agree that resource won't be common in next-gen other transport
> whenever they arrive.
> But with my existing background working on some of those transport,
> they appear similar in nature and it might seek similar knobs.

I don't know.  What's proposed in this thread seems way too low level
to be useful anywhere else.  Also, what if there are multiple devices?
Is that a problem to worry about?

> In past I have discussions with Liran Liss from Mellanox as well on
> this topic and we also agreed to have such cgroup controller.
> He has recent presentation at Linux foundation event indicating to
> have cgroup for RDMA.
> Below is the link to it.
> http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf
> Slides 1 to 7 and slide 13 will give you more insight to it.
> Liran and I had similar presentation to RDMA audience with less slides
> in RDMA openfabrics summit in March 2015.
>
> I am ok to create separate cgroup for rdma, if community thinks that way.
> My preference would be still use device cgroup for above extensions
> unless there are fundamental issues that I am missing.

The thing is that they aren't related at all in any way.  There's no
reason to tie them together.  In fact, the way we did devcg is
backward.  The ideal solution would have been extending the usual ACL
to understand cgroups so that it's a natural growth of the permission
system.

You're talking about actual hardware resources.  That has nothing to
do with access permissions on device nodes.

> I would let you make the call.
> Rdma and other is just another type of device with different
> characteristics than character or block, so one device cgroup with sub
> functionalities can allow setting knobs.
> Every device category will have their own set of knobs 

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-08 Thread Tejun Heo
Hello, Parav.

On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.

Is there something simple I can read up on what each resource is?
What's the usual access control mechanism?

> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.

I don't think this belongs to devcg.  If these make sense as a set of
resources to be controlled via cgroup, the right way prolly would be a
separate controller.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] RDMA/ocrdma: update ocrdma module license srting

2015-07-23 Thread Tejun Heo
On Fri, Jul 24, 2015 at 05:04:00AM +0530, Devesh Sharma wrote:
 replace module_license from GPL to Dual BSD/GPL
 
 Cc: Tejun Heo t...@kernel.org
 Cc: Duan Jiong duanj.f...@cn.fujitsu.com
 Cc: Roland Dreier rol...@purestorage.com
 Cc: Jes Sorensen jes.soren...@redhat.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Cc: Dan Carpenter dan.carpen...@oracle.com
 Cc: Prarit Bhargava pra...@redhat.com
 Cc: Colin Ian King colin.k...@canonical.com
 Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
 Cc: Moni Shoua mo...@mellanox.com
 Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
 Cc: Li RongQing roy.qing...@gmail.com
 Cc: Devendra Naga devendra.a...@gmail.com
 Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com

Ditto,

 Acked-by: Tejun Heo t...@kernel.org

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/2] RDMA/ocrdma: update ocrdma license to dual-license

2015-07-23 Thread Tejun Heo
On Fri, Jul 24, 2015 at 05:03:59AM +0530, Devesh Sharma wrote:
 Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause)
 
 Cc: Tejun Heo t...@kernel.org
 Cc: Duan Jiong duanj.f...@cn.fujitsu.com
 Cc: Roland Dreier rol...@purestorage.com
 Cc: Jes Sorensen jes.soren...@redhat.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Cc: Dan Carpenter dan.carpen...@oracle.com
 Cc: Prarit Bhargava pra...@redhat.com
 Cc: Colin Ian King colin.k...@canonical.com
 Cc: Wei Yongjun yongjun_...@trendmicro.com.cn
 Cc: Moni Shoua mo...@mellanox.com
 Cc: Rasmus Villemoes li...@rasmusvillemoes.dk
 Cc: Li RongQing roy.qing...@gmail.com
 Cc: Devendra Naga devendra.a...@gmail.com
 Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com

I didn't do any significant work on it but FWIW

 Acked-by: Tejun Heo t...@kernel.org

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Tejun Heo
Hello,

On Tue, Oct 08, 2013 at 02:22:16PM +0200, Alexander Gordeev wrote:
 If we talk about pSeries quota, then the current pSeries pci_enable_msix()
 implementation is racy internally and could fail if the quota went down
 *while* pci_enable_msix() is executing. In this case the loop will have to
 exit rather than retry with a lower number (what number?).

Ah, okay, so that one is already broken.

 In this regard the new scheme does not bring anything new and relies on
 the fact this race does not hit and therefore does not worry.
 
 If we talk about quota as it has to be, then yes - the loop scheme seems
 more preferable.
 
 Overall, looks like we just need to fix the pSeries implementation,
 if the guys want it, he-he :)

If we can't figure out a better interface for the retry case, I think
what can really help is having a simple interface for the simpler
cases.

  The problem case is where multiple msi(x) allocation fails completely
  because the global limit went down before inquiry and allocation.  In
  the loop based interface, it'd retry with the lower number.
 
 I am probably missing something here. If the global limit went down before
 inquiry then the inquiry will get what is available and try to allocate with
 than number.

Oh, I should have written between inquiry and allocation.  Sorry.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Tejun Heo
Hello,

On Wed, Oct 09, 2013 at 02:57:16PM +0200, Alexander Gordeev wrote:
 On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote:
  Hmmm... yean, the race condition could be an issue as multiple msi
  allocation might fail even if the driver can and explicitly handle
  multiple allocation if the quota gets reduced inbetween.
 
 BTW, should we care about the quota getting increased inbetween?
 That would entail.. kind of pci_get_msi_limit() :), but IMHO it is
 not worth it.

I think we shouldn't.  If the resource was low during a point in time
during allocation, it's fine to base the result on that - the resource
was actually low and which answer we return is just a question of
timing and both are correct.  The only reason the existing race
condition is problematic is because it may fail even if the resource
never falls below the failure point.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Tejun Heo
On Mon, Oct 07, 2013 at 09:48:01PM +0100, Ben Hutchings wrote:
  There is one major flaw in min-max approach - the generic MSI layer
  will have to take decisions on exact number of MSIs to request, not
  device drivers.
 [...
 
 No, the min-max functions should be implemented using the same loop that
 drivers are expected to use now.

Wheee... earlier in the thread I thought you guys were referring to
yourselves in the third person and was getting a bit worried. :)

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Tejun Heo
Hello, Alexander.

On Tue, Oct 08, 2013 at 09:48:26AM +0200, Alexander Gordeev wrote:
  If there are many which duplicate the above pattern, it'd probably be
  worthwhile to provide a helper?  It's usually a good idea to reduce
  the amount of boilerplate code in drivers.
 
 I wanted to limit discussion in v1 to as little changes as possible.
 I 'planned' those helper(s) for a separate effort if/when the most
 important change is accepted and soaked a bit.

The thing is doing it this way generates more churns and noises.  Once
the simpler ones live behind a wrapper which can be built on the
existing interface, we can have both reduced cost and more latitude on
the complex cases.

  If we do things this way, it breaks all drivers using this interface
  until they're converted, right?
 
 Right. And the rest of the series does it.

Which breaks bisection which we shouldn't do.

  Also, it probably isn't the best idea
  to flip the behavior like this as this can go completely unnoticed (no
  compiler warning or anything, the same function just behaves
  differently).  Maybe it'd be a better idea to introduce a simpler
  interface that most can be converted to?
 
 Well, an *other* interface is a good idea. What do you mean with the
 simpler here?

I'm still talking about a simpler wrapper for common cases, which is
the important part anyway.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Tejun Heo
Hello,

On Tue, Oct 08, 2013 at 11:07:16AM +0200, Alexander Gordeev wrote:
 Multipe MSIs is just a handful of drivers, really. MSI-X impact still

Yes, so it's pretty nice to try out things there before going full-on.

 will be huge. But if we opt a different name for the new pci_enable_msix()
 then we could first update pci/msi, then drivers (in few stages possibly)
 and finally remove the old implementation.

Yes, that probably should be the steps to follow eventually.  My point
was that you don't have to submit patches for all 7x conversions for
an RFC round.  Scanning them and seeing what would be necessary
definitely is a good idea but just giving summary of the full
conversion with several examples should be good enough before settling
on the way forward, which should be easier for all involved.

Thanks a lot for your effort!

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-07 Thread Tejun Heo
Hey, guys.

On Sun, Oct 06, 2013 at 09:10:30AM +0200, Alexander Gordeev wrote:
 On Sun, Oct 06, 2013 at 05:19:46PM +1100, Benjamin Herrenschmidt wrote:
  On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote:
   In fact, in the current design to address the quota race decently the
   drivers would have to protect the *loop* to prevent the quota change
   between a pci_enable_msix() returned a positive number and the the next
   call to pci_enable_msix() with that number. Is it doable?
  
  I am not advocating for the current design, simply saying that your
  proposal doesn't address this issue while Ben's does.

Hmmm... yean, the race condition could be an issue as multiple msi
allocation might fail even if the driver can and explicitly handle
multiple allocation if the quota gets reduced inbetween.

 There is one major flaw in min-max approach - the generic MSI layer
 will have to take decisions on exact number of MSIs to request, not
 device drivers.

The min-max approach would actually be pretty nice for the users which
actually care about this.

 This will never work for all devices, because there might be specific
 requirements which are not covered by the min-max. That is what Ben
 described ...say, any even number within a certain range. Ben suggests
 to leave the existing loop scheme to cover such devices, which I think is
 not right.

if it could work.

 What about introducing pci_lock_msi() and pci_unlock_msi() and let device
 drivers care about their ranges and specifics in race-safe manner?
 I do not call to introduce it right now (since it appears pSeries has not
 been hitting the race for years) just as a possible alternative to Ben's
 proposal.

I don't think the same race condition would happen with the loop.  The
problem case is where multiple msi(x) allocation fails completely
because the global limit went down before inquiry and allocation.  In
the loop based interface, it'd retry with the lower number.

As long as the number of drivers which need this sort of adaptive
allocation isn't too high and the common cases can be made simple, I
don't think the complex part of interface is all that important.
Maybe we can have reserve / cancel type interface or just keep the
loop with more explicit function names (ie. try_enable or something
like that).

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 05/77] PCI/MSI: Convert pci_msix_table_size() to a public interface

2013-10-07 Thread Tejun Heo
Hello,

On Wed, Oct 02, 2013 at 12:48:21PM +0200, Alexander Gordeev wrote:
 Make pci_msix_table_size() to return a error code if the device
 does not support MSI-X. This update is needed to facilitate a
 forthcoming re-design MSI/MSI-X interrupts enabling pattern.
 
 Device drivers will use this interface to obtain maximum number
 of MSI-X interrupts the device supports and use that value in
 the following call to pci_enable_msix() interface.
 
 Signed-off-by: Alexander Gordeev agord...@redhat.com

Hmmm... I probably missed something but why is this necessary?  To
discern between -EINVAL and -ENOSPC?  If so, does that really matter?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern

2013-10-07 Thread Tejun Heo
Hello,

On Wed, Oct 02, 2013 at 12:48:23PM +0200, Alexander Gordeev wrote:
 +static int foo_driver_enable_msi(struct foo_adapter *adapter, int nvec)
 +{
 + rc = pci_get_msi_cap(adapter-pdev);
 + if (rc  0)
 + return rc;
 +
 + nvec = min(nvec, rc);
 + if (nvec  FOO_DRIVER_MINIMUM_NVEC) {
 + return -ENOSPC;
 +
 + rc = pci_enable_msi_block(adapter-pdev, nvec);
 + return rc;
 +}

If there are many which duplicate the above pattern, it'd probably be
worthwhile to provide a helper?  It's usually a good idea to reduce
the amount of boilerplate code in drivers.

  static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
  {
 + rc = pci_msix_table_size(adapter-pdev);
 + if (rc  0)
 + return rc;
 +
 + nvec = min(nvec, rc);
 + if (nvec  FOO_DRIVER_MINIMUM_NVEC) {
 + return -ENOSPC;
 +
 + for (i = 0; i  nvec; i++)
 + adapter-msix_entries[i].entry = i;
 +
 + rc = pci_enable_msix(adapter-pdev, adapter-msix_entries, nvec);
 + return rc;
  }

Ditto.

 @@ -975,7 +951,7 @@ int pci_enable_msix(struct pci_dev *dev, struct 
 msix_entry *entries, int nvec)
   if (nr_entries  0)
   return nr_entries;
   if (nvec  nr_entries)
 - return nr_entries;
 + return -EINVAL;
  
   /* Check for any invalid entries */
   for (i = 0; i  nvec; i++) {

If we do things this way, it breaks all drivers using this interface
until they're converted, right?  Also, it probably isn't the best idea
to flip the behavior like this as this can go completely unnoticed (no
compiler warning or anything, the same function just behaves
differently).  Maybe it'd be a better idea to introduce a simpler
interface that most can be converted to?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-07 Thread Tejun Heo
Hello, Alexander.

On Wed, Oct 02, 2013 at 12:48:16PM +0200, Alexander Gordeev wrote:
 Alexander Gordeev (77):
   PCI/MSI: Fix return value when populate_msi_sysfs() failed
   PCI/MSI/PPC: Fix wrong RTAS error code reporting
   PCI/MSI/s390: Fix single MSI only check
   PCI/MSI/s390: Remove superfluous check of MSI type
   PCI/MSI: Convert pci_msix_table_size() to a public interface
   PCI/MSI: Factor out pci_get_msi_cap() interface
   PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern
   PCI/MSI: Get rid of pci_enable_msi_block_auto() interface
   ahci: Update MSI/MSI-X interrupts enablement code
   ahci: Check MRSM bit when multiple MSIs enabled
...

Whee that's a lot more than I expected.  I was just scanning
multiple msi users.  Maybe we can stage the work in more manageable
steps so that you don't have to go through massive conversion only to
do it all over again afterwards and likewise people don't get
bombarded on each iteration?  Maybe we can first update pci / msi code
proper, msi and then msix?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/6] amso1100: convert to using idr_alloc_cyclic

2013-03-27 Thread Tejun Heo
On Wed, Mar 27, 2013 at 09:18:04AM -0400, Jeff Layton wrote:
  void c2_init_qp_table(struct c2_dev *c2dev)
  {
   spin_lock_init(c2dev-qp_table.lock);
 - idr_init(c2dev-qp_table.idr);
 + idr_init_cyclic(c2dev-qp_table.idr, 0);
  }

Why is this necessary?  In general, why is idr_init_cyclic()
necessary?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v2] IB/ipath: use GFP_NOWAIT under spin lock

2013-02-21 Thread Tejun Heo
On Wed, Feb 20, 2013 at 9:45 PM, Wei Yongjun weiyj...@gmail.com wrote:
 From: Wei Yongjun yongjun_...@trendmicro.com.cn

 A spin lock is taken here so we should use GFP_NOWAIT like
 other case.

 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn

Acked-by: Tejun Heo t...@kernel.org

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v2] IB/ipath: use GFP_NOWAIT under spin lock

2013-02-21 Thread Tejun Heo
On Thu, Feb 21, 2013 at 7:33 AM, Tejun Heo t...@kernel.org wrote:
 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn

 Acked-by: Tejun Heo t...@kernel.org

Just noticed akpm isn't cc'd.  Can you please repost w/
a...@linux-foundation.org cc'd? All related patches are going through
-mm.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c

2013-02-09 Thread Tejun Heo
Hello,

On Fri, Feb 08, 2013 at 10:09:13PM +, Hefty, Sean wrote:
Used to wrap cyclic @start.  Can be replaced with max(next, 0).
Note that this type of cyclic allocation using idr is buggy.  These
are prone to spurious -ENOSPC failure after the first wraparound.
 
 The replacement code looks fine, but can you explain why the use is buggy?

So, if you want a cyclic allocation, the allocation should be tried in
[start, END) and then [0, start); otherwise, after the allocation
wraps for the first time, as the closer the starting point gets to
END, the chance of not finding a vacant slot in [start, END) goes
higher.  When @start equals END - 1 for the second time, if the first
END - 1 allocation is still around, you'll get -ENOSPC.

In practice, I don't think anyone is hitting this.  idr has always
been horribly broken when it reaches higher range ( 130 on 64bit)
so things would have broken even before the first wraparound.  It
still is a theoretical possibility which may trigger if idr is used
for, say, ipc messages or storage commands.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c

2013-02-08 Thread Tejun Heo
MAX_IDR_MASK is another weirdness in the idr interface.  As idr covers
whole positive integer range, it's defined as 0x7fff or INT_MAX.

Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
They basically mask off the sign bit and operate on the rest, so if
the caller, by accident, passes in a negative number, the sign bit
will be masked off and the remaining part will be used as if that was
the input, which is worse than crashing.

The constant is visible in idr.h and there are several users in the
kernel.

* drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()

  Basically used to test if adap-nr is a negative number which isn't
  -1 and returns -EINVAL if so.  idr_alloc() already has negative
  @start checking (w/ WARN_ON_ONCE), so this can go away.

* drivers/infiniband/core/cm.c:cm_alloc_id()
  drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()

  Used to wrap cyclic @start.  Can be replaced with max(next, 0).
  Note that this type of cyclic allocation using idr is buggy.  These
  are prone to spurious -ENOSPC failure after the first wraparound.

* fs/super.c:get_anon_bdev()

  The ID allocated from ida is masked off before being tested whether
  it's inside valid range.  ida allocated ID can never be a negative
  number and the masking is unnecessary.

Update idr_*() functions to fail with -EINVAL when negative @id is
specified and update other MAX_IDR_MASK users as described above.

This leaves MAX_IDR_MASK without any user, remove it and relocate
other MAX_IDR_* constants to lib/idr.c.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Jean Delvare kh...@linux-fr.org
Cc: linux-...@vger.kernel.org
Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: Marciniszyn, Mike mike.marcinis...@intel.com
Cc: Jack Morgenstein ja...@dev.mellanox.co.il
Cc: Or Gerlitz ogerl...@mellanox.com
Cc: linux-rdma@vger.kernel.org
Cc: Al Viro v...@zeniv.linux.org.uk
---
 drivers/i2c/i2c-core.c  |2 --
 drivers/infiniband/core/cm.c|2 +-
 drivers/infiniband/hw/mlx4/cm.c |2 +-
 fs/super.c  |2 +-
 include/linux/idr.h |   10 --
 lib/idr.c   |   24 +---
 6 files changed, 20 insertions(+), 22 deletions(-)

--- a/drivers/i2c/i2c-core.c
+++ b/drivers/i2c/i2c-core.c
@@ -978,8 +978,6 @@ int i2c_add_numbered_adapter(struct i2c_
 
if (adap-nr == -1) /* -1 means dynamically assign bus id */
return i2c_add_adapter(adap);
-   if (adap-nr  ~MAX_IDR_MASK)
-   return -EINVAL;
 
mutex_lock(core_lock);
id = idr_alloc(i2c_adapter_idr, adap, adap-nr, adap-nr + 1,
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -390,7 +390,7 @@ static int cm_alloc_id(struct cm_id_priv
 
id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT);
if (id = 0)
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
+   next_id = max(id + 1, 0);
 
spin_unlock_irqrestore(cm.lock, flags);
idr_preload_end();
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -225,7 +225,7 @@ id_map_alloc(struct ib_device *ibdev, in
 
ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT);
if (ret = 0) {
-   next_id = ((unsigned)ret + 1)  MAX_IDR_MASK;
+   next_id = max(ret + 1, 0);
ent-pv_cm_id = (u32)ret;
sl_id_map_add(ibdev, ent);
list_add_tail(ent-list, sriov-cm_list);
--- a/fs/super.c
+++ b/fs/super.c
@@ -842,7 +842,7 @@ int get_anon_bdev(dev_t *p)
else if (error)
return -EAGAIN;
 
-   if ((dev  MAX_IDR_MASK) == (1  MINORBITS)) {
+   if (dev == (1  MINORBITS)) {
spin_lock(unnamed_dev_lock);
ida_remove(unnamed_dev_ida, dev);
if (unnamed_dev_start  dev)
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -38,16 +38,6 @@
 #define IDR_SIZE (1  IDR_BITS)
 #define IDR_MASK ((1  IDR_BITS)-1)
 
-#define MAX_IDR_SHIFT (sizeof(int)*8 - 1)
-#define MAX_IDR_BIT (1U  MAX_IDR_SHIFT)
-#define MAX_IDR_MASK (MAX_IDR_BIT - 1)
-
-/* Leave the possibility of an incomplete final layer */
-#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS - 1) / IDR_BITS)
-
-/* Number of id_layer structs to leave in free list */
-#define MAX_IDR_FREE (MAX_IDR_LEVEL * 2)
-
 struct idr_layer {
unsigned long   bitmap; /* A zero bit means space here */
struct idr_layer __rcu  *ary[1IDR_BITS];
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -38,6 +38,15 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 
+#define MAX_IDR_SHIFT  (sizeof(int) * 8 - 1)
+#define MAX_IDR_BIT(1U  MAX_IDR_SHIFT)
+
+/* Leave the possibility of an incomplete final layer */
+#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS - 1) / IDR_BITS)
+
+/* Number of id_layer structs to leave in free list

[PATCH 42/77] IB/ehca: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Hoang-Nam Nguyen hngu...@de.ibm.com
Cc: Christoph Raisch rai...@de.ibm.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/ehca/ehca_cq.c | 27 +++
 drivers/infiniband/hw/ehca/ehca_qp.c | 34 +++---
 2 files changed, 22 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c 
b/drivers/infiniband/hw/ehca/ehca_cq.c
index 8f52901..212150c 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -128,7 +128,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int 
cqe, int comp_vector,
void *vpage;
u32 counter;
u64 rpage, cqx_fec, h_ret;
-   int ipz_rc, ret, i;
+   int ipz_rc, i;
unsigned long flags;
 
if (cqe = 0x - 64 - additional_cqe)
@@ -163,32 +163,19 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, 
int cqe, int comp_vector,
adapter_handle = shca-ipz_hca_handle;
param.eq_handle = shca-eq.ipz_eq_handle;
 
-   do {
-   if (!idr_pre_get(ehca_cq_idr, GFP_KERNEL)) {
-   cq = ERR_PTR(-ENOMEM);
-   ehca_err(device, Can't reserve idr nr. device=%p,
-device);
-   goto create_cq_exit1;
-   }
-
-   write_lock_irqsave(ehca_cq_idr_lock, flags);
-   ret = idr_get_new(ehca_cq_idr, my_cq, my_cq-token);
-   write_unlock_irqrestore(ehca_cq_idr_lock, flags);
-   } while (ret == -EAGAIN);
+   idr_preload(GFP_KERNEL);
+   write_lock_irqsave(ehca_cq_idr_lock, flags);
+   my_cq-token = idr_alloc(ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT);
+   write_unlock_irqrestore(ehca_cq_idr_lock, flags);
+   idr_preload_end();
 
-   if (ret) {
+   if (my_cq-token  0) {
cq = ERR_PTR(-ENOMEM);
ehca_err(device, Can't allocate new idr entry. device=%p,
 device);
goto create_cq_exit1;
}
 
-   if (my_cq-token  0x1FF) {
-   cq = ERR_PTR(-ENOMEM);
-   ehca_err(device, Invalid number of cq. device=%p, device);
-   goto create_cq_exit2;
-   }
-
/*
 * CQs maximum depth is 4GB-64, but we need additional 20 as buffer
 * for receiving errors CQEs.
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c 
b/drivers/infiniband/hw/ehca/ehca_qp.c
index 1493939..00d6861 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -636,30 +636,26 @@ static struct ehca_qp *internal_create_qp(
my_qp-send_cq =
container_of(init_attr-send_cq, struct ehca_cq, ib_cq);
 
-   do {
-   if (!idr_pre_get(ehca_qp_idr, GFP_KERNEL)) {
-   ret = -ENOMEM;
-   ehca_err(pd-device, Can't reserve idr resources.);
-   goto create_qp_exit0;
-   }
+   idr_preload(GFP_KERNEL);
+   write_lock_irqsave(ehca_qp_idr_lock, flags);
 
-   write_lock_irqsave(ehca_qp_idr_lock, flags);
-   ret = idr_get_new(ehca_qp_idr, my_qp, my_qp-token);
-   write_unlock_irqrestore(ehca_qp_idr_lock, flags);
-   } while (ret == -EAGAIN);
+   ret = idr_alloc(ehca_qp_idr, my_qp, 0, 0x200, GFP_NOWAIT);
+   if (ret = 0)
+   my_qp-token = ret;
 
-   if (ret) {
-   ret = -ENOMEM;
-   ehca_err(pd-device, Can't allocate new idr entry.);
+   write_unlock_irqrestore(ehca_qp_idr_lock, flags);
+   idr_preload_end();
+   if (ret  0) {
+   if (ret == -ENOSPC) {
+   ret = -EINVAL;
+   ehca_err(pd-device, Invalid number of qp);
+   } else {
+   ret = -ENOMEM;
+   ehca_err(pd-device, Can't allocate new idr entry.);
+   }
goto create_qp_exit0;
}
 
-   if (my_qp-token  0x1FF) {
-   ret = -EINVAL;
-   ehca_err(pd-device, Invalid number of qp);
-   goto create_qp_exit1;
-   }
-
if (has_srq)
parms.srq_token = my_qp-token;
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 41/77] IB/cxgb4: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Reviewed-by: Steve Wise sw...@opengridcomputing.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 9c1644f..7f862da 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -260,20 +260,21 @@ static inline int _insert_handle(struct c4iw_dev *rhp, 
struct idr *idr,
 void *handle, u32 id, int lock)
 {
int ret;
-   int newid;
 
-   do {
-   if (!idr_pre_get(idr, lock ? GFP_KERNEL : GFP_ATOMIC))
-   return -ENOMEM;
-   if (lock)
-   spin_lock_irq(rhp-lock);
-   ret = idr_get_new_above(idr, handle, id, newid);
-   BUG_ON(!ret  newid != id);
-   if (lock)
-   spin_unlock_irq(rhp-lock);
-   } while (ret == -EAGAIN);
-
-   return ret;
+   if (lock) {
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(rhp-lock);
+   }
+
+   ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC);
+
+   if (lock) {
+   spin_unlock_irq(rhp-lock);
+   idr_preload_end();
+   }
+
+   BUG_ON(ret == -ENOSPC);
+   return ret  0 ? ret : 0;
 }
 
 static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr,
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 45/77] IB/ocrdma: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Roland Dreier rol...@purestorage.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index c4e0131..48928c8 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -51,18 +51,6 @@ static DEFINE_IDR(ocrdma_dev_id);
 
 static union ib_gid ocrdma_zero_sgid;
 
-static int ocrdma_get_instance(void)
-{
-   int instance = 0;
-
-   /* Assign an unused number */
-   if (!idr_pre_get(ocrdma_dev_id, GFP_KERNEL))
-   return -1;
-   if (idr_get_new(ocrdma_dev_id, NULL, instance))
-   return -1;
-   return instance;
-}
-
 void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 {
u8 mac_addr[6];
@@ -416,7 +404,7 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info 
*dev_info)
goto idr_err;
 
memcpy(dev-nic_info, dev_info, sizeof(*dev_info));
-   dev-id = ocrdma_get_instance();
+   dev-id = idr_alloc(ocrdma_dev_id, NULL, 0, 0, GFP_KERNEL);
if (dev-id  0)
goto idr_err;
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 46/77] IB/qib: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Mike Marciniszyn infinip...@intel.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/qib/qib_init.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_init.c 
b/drivers/infiniband/hw/qib/qib_init.c
index ddf066d..50e33aa 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1060,22 +1060,23 @@ struct qib_devdata *qib_alloc_devdata(struct pci_dev 
*pdev, size_t extra)
struct qib_devdata *dd;
int ret;
 
-   if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) {
-   dd = ERR_PTR(-ENOMEM);
-   goto bail;
-   }
-
dd = (struct qib_devdata *) ib_alloc_device(sizeof(*dd) + extra);
if (!dd) {
dd = ERR_PTR(-ENOMEM);
goto bail;
}
 
+   idr_preload(GFP_KERNEL);
spin_lock_irqsave(qib_devs_lock, flags);
-   ret = idr_get_new(qib_unit_table, dd, dd-unit);
-   if (ret = 0)
+
+   ret = idr_alloc(qib_unit_table, dd, 0, 0, GFP_NOWAIT);
+   if (ret = 0) {
+   dd-unit = ret;
list_add(dd-list, qib_dev_list);
+   }
+
spin_unlock_irqrestore(qib_devs_lock, flags);
+   idr_preload_end();
 
if (ret  0) {
qib_early_err(pdev-dev,
@@ -1180,11 +1181,6 @@ static int __init qlogic_ib_init(void)
 * the PCI subsystem.
 */
idr_init(qib_unit_table);
-   if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) {
-   pr_err(idr_pre_get() failed\n);
-   ret = -ENOMEM;
-   goto bail_cq_wq;
-   }
 
ret = pci_register_driver(qib_driver);
if (ret  0) {
@@ -1199,7 +1195,6 @@ static int __init qlogic_ib_init(void)
 
 bail_unit:
idr_destroy(qib_unit_table);
-bail_cq_wq:
destroy_workqueue(qib_cq_wq);
 bail_dev:
qib_dev_cleanup();
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 44/77] IB/mlx4: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Jack Morgenstein ja...@dev.mellanox.co.il
Cc: Or Gerlitz ogerl...@mellanox.com
Cc: Roland Dreier rol...@purestorage.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/mlx4/cm.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index dbc99d4..80e59ed 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -203,7 +203,7 @@ static void sl_id_map_add(struct ib_device *ibdev, struct 
id_map_entry *new)
 static struct id_map_entry *
 id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id)
 {
-   int ret, id;
+   int ret;
static int next_id;
struct id_map_entry *ent;
struct mlx4_ib_sriov *sriov = to_mdev(ibdev)-sriov;
@@ -220,25 +220,23 @@ id_map_alloc(struct ib_device *ibdev, int slave_id, u32 
sl_cm_id)
ent-dev = to_mdev(ibdev);
INIT_DELAYED_WORK(ent-timeout, id_map_ent_timeout);
 
-   do {
-   spin_lock(to_mdev(ibdev)-sriov.id_map_lock);
-   ret = idr_get_new_above(sriov-pv_id_table, ent,
-   next_id, id);
-   if (!ret) {
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
-   ent-pv_cm_id = (u32)id;
-   sl_id_map_add(ibdev, ent);
-   }
+   idr_preload(GFP_KERNEL);
+   spin_lock(to_mdev(ibdev)-sriov.id_map_lock);
 
-   spin_unlock(sriov-id_map_lock);
-   } while (ret == -EAGAIN  idr_pre_get(sriov-pv_id_table, 
GFP_KERNEL));
-   /*the function idr_get_new_above can return -ENOSPC, so don't insert in 
that case.*/
-   if (!ret) {
-   spin_lock(sriov-id_map_lock);
+   ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT);
+   if (ret = 0) {
+   next_id = ((unsigned)ret + 1)  MAX_IDR_MASK;
+   ent-pv_cm_id = (u32)ret;
+   sl_id_map_add(ibdev, ent);
list_add_tail(ent-list, sriov-cm_list);
-   spin_unlock(sriov-id_map_lock);
-   return ent;
}
+
+   spin_unlock(sriov-id_map_lock);
+   idr_preload_end();
+
+   if (ret = 0)
+   return ent;
+
/*error flow*/
kfree(ent);
mlx4_ib_warn(ibdev, No more space in the idr (err:0x%x)\n, ret);
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 43/77] IB/ipath: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Mike Marciniszyn infinip...@intel.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/ipath/ipath_driver.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index 7b371f5..fcdaeea 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -194,11 +194,6 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
struct ipath_devdata *dd;
int ret;
 
-   if (!idr_pre_get(unit_table, GFP_KERNEL)) {
-   dd = ERR_PTR(-ENOMEM);
-   goto bail;
-   }
-
dd = vzalloc(sizeof(*dd));
if (!dd) {
dd = ERR_PTR(-ENOMEM);
@@ -206,9 +201,10 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
}
dd-ipath_unit = -1;
 
+   idr_preload(GFP_KERNEL);
spin_lock_irqsave(ipath_devs_lock, flags);
 
-   ret = idr_get_new(unit_table, dd, dd-ipath_unit);
+   ret = idr_alloc(unit_table, dd, 0, 0, GFP_KERNEL);
if (ret  0) {
printk(KERN_ERR IPATH_DRV_NAME
   : Could not allocate unit ID: error %d\n, -ret);
@@ -216,6 +212,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
dd = ERR_PTR(ret);
goto bail_unlock;
}
+   dd-ipath_unit = ret;
 
dd-pcidev = pdev;
pci_set_drvdata(pdev, dd);
@@ -224,7 +221,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
 
 bail_unlock:
spin_unlock_irqrestore(ipath_devs_lock, flags);
-
+   idr_preload_end();
 bail:
return dd;
 }
@@ -2503,11 +2500,6 @@ static int __init infinipath_init(void)
 * the PCI subsystem.
 */
idr_init(unit_table);
-   if (!idr_pre_get(unit_table, GFP_KERNEL)) {
-   printk(KERN_ERR IPATH_DRV_NAME : idr_pre_get() failed\n);
-   ret = -ENOMEM;
-   goto bail;
-   }
 
ret = pci_register_driver(ipath_driver);
if (ret  0) {
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 40/77] IB/cxgb3: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Reviewed-by: Steve Wise sw...@opengridcomputing.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/cxgb3/iwch.h | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.h 
b/drivers/infiniband/hw/cxgb3/iwch.h
index a1c4457..8378622 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.h
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -153,19 +153,17 @@ static inline int insert_handle(struct iwch_dev *rhp, 
struct idr *idr,
void *handle, u32 id)
 {
int ret;
-   int newid;
-
-   do {
-   if (!idr_pre_get(idr, GFP_KERNEL)) {
-   return -ENOMEM;
-   }
-   spin_lock_irq(rhp-lock);
-   ret = idr_get_new_above(idr, handle, id, newid);
-   BUG_ON(newid != id);
-   spin_unlock_irq(rhp-lock);
-   } while (ret == -EAGAIN);
-
-   return ret;
+
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(rhp-lock);
+
+   ret = idr_alloc(idr, handle, id, id + 1, GFP_NOWAIT);
+
+   spin_unlock_irq(rhp-lock);
+   idr_preload_end();
+
+   BUG_ON(ret == -ENOSPC);
+   return ret  0 ? ret : 0;
 }
 
 static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 38/77] IB/core: convert to idr_alloc()

2013-02-06 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

v2: Mike triggered WARN_ON() in idr_preload() because send_mad(),
which may be used from non-process context, was calling
idr_preload() unconditionally.  Preload iff @gfp_mask has
__GFP_WAIT.

Signed-off-by: Tejun Heo t...@kernel.org
Reviewed-by: Sean Hefty sean.he...@intel.com
Reported-by: Marciniszyn, Mike mike.marcinis...@intel.com
Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/core/cm.c | 22 +++---
 drivers/infiniband/core/cma.c| 24 +++-
 drivers/infiniband/core/sa_query.c   | 18 ++
 drivers/infiniband/core/ucm.c| 16 
 drivers/infiniband/core/ucma.c   | 32 
 drivers/infiniband/core/uverbs_cmd.c | 17 -
 6 files changed, 48 insertions(+), 81 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 394fea2..98281fe 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_sa_path_rec 
*path, struct cm_av *av)
 static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 {
unsigned long flags;
-   int ret, id;
+   int id;
static int next_id;
 
-   do {
-   spin_lock_irqsave(cm.lock, flags);
-   ret = idr_get_new_above(cm.local_id_table, cm_id_priv,
-   next_id, id);
-   if (!ret)
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
-   spin_unlock_irqrestore(cm.lock, flags);
-   } while( (ret == -EAGAIN)  idr_pre_get(cm.local_id_table, 
GFP_KERNEL) );
+   idr_preload(GFP_KERNEL);
+   spin_lock_irqsave(cm.lock, flags);
+
+   id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT);
+   if (id = 0)
+   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
+
+   spin_unlock_irqrestore(cm.lock, flags);
+   idr_preload_end();
 
cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand;
-   return ret;
+   return id  0 ? id : 0;
 }
 
 static void cm_free_id(__be32 local_id)
@@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void)
cm.remote_sidr_table = RB_ROOT;
idr_init(cm.local_id_table);
get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand);
-   idr_pre_get(cm.local_id_table, GFP_KERNEL);
INIT_LIST_HEAD(cm.timewait_list);
 
ret = class_register(cm_class);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d789eea..c32eeaa 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps, struct 
rdma_id_private *id_priv,
  unsigned short snum)
 {
struct rdma_bind_list *bind_list;
-   int port, ret;
+   int ret;
 
bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
if (!bind_list)
return -ENOMEM;
 
-   do {
-   ret = idr_get_new_above(ps, bind_list, snum, port);
-   } while ((ret == -EAGAIN)  idr_pre_get(ps, GFP_KERNEL));
-
-   if (ret)
-   goto err1;
-
-   if (port != snum) {
-   ret = -EADDRNOTAVAIL;
-   goto err2;
-   }
+   ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL);
+   if (ret  0)
+   goto err;
 
bind_list-ps = ps;
-   bind_list-port = (unsigned short) port;
+   bind_list-port = (unsigned short)ret;
cma_bind_port(bind_list, id_priv);
return 0;
-err2:
-   idr_remove(ps, port);
-err1:
+err:
kfree(bind_list);
-   return ret;
+   return ret == -ENOSPC ? -EADDRNOTAVAIL : ret;
 }
 
 static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index a8905ab..934f45e 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -611,19 +611,21 @@ static void init_mad(struct ib_sa_mad *mad, struct 
ib_mad_agent *agent)
 
 static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
+   bool preload = gfp_mask  __GFP_WAIT;
unsigned long flags;
int ret, id;
 
-retry:
-   if (!idr_pre_get(query_idr, gfp_mask))
-   return -ENOMEM;
+   if (preload)
+   idr_preload(gfp_mask);
spin_lock_irqsave(idr_lock, flags);
-   ret = idr_get_new(query_idr, query, id);
+
+   id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT);
+
spin_unlock_irqrestore(idr_lock, flags);
-   if (ret == -EAGAIN)
-   goto retry;
-   if (ret)
-   return ret

Re: [PATCH 27/62] infiniband/ipath: convert to idr_alloc()

2013-02-04 Thread Tejun Heo
Hello,

On Mon, Feb 04, 2013 at 04:15:52PM +, Marciniszyn, Mike wrote:
 I tried the branch you indicated in the initial patch cover.
 
 When run with a qib driver, and ipoib ping of another system produces:
...
 Looks like this is tripping during the arp/neighbour path resolution:
 
 void idr_preload(gfp_t gfp_mask)
 {
 /*
  * Consuming preload buffer from non-process context breaks preload
  * allocation guarantee.  Disallow usage from those contexts.
  */
 WARN_ON_ONCE(in_interrupt()); 
 
 Any ideas Roland?

Yeah, firewire had the same problem.  It needs to conditionalize
preload() if !__GFP_WAIT (no point anyway).  Will send update patches
soon.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 22/62] infiniband/core: convert to idr_alloc()

2013-02-04 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

v2: Mike triggered WARN_ON() in idr_preload() because send_mad(),
which may be used from non-process context, was calling
idr_preload() unconditionally.  Preload iff @gfp_mask has
__GFP_WAIT.

Signed-off-by: Tejun Heo t...@kernel.org
Reported-by: Marciniszyn, Mike mike.marcinis...@intel.com
Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/core/cm.c |   22 +++---
 drivers/infiniband/core/cma.c|   24 +++-
 drivers/infiniband/core/sa_query.c   |   18 ++
 drivers/infiniband/core/ucm.c|   16 
 drivers/infiniband/core/ucma.c   |   32 
 drivers/infiniband/core/uverbs_cmd.c |   17 -
 6 files changed, 48 insertions(+), 81 deletions(-)

--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_
 static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 {
unsigned long flags;
-   int ret, id;
+   int id;
static int next_id;
 
-   do {
-   spin_lock_irqsave(cm.lock, flags);
-   ret = idr_get_new_above(cm.local_id_table, cm_id_priv,
-   next_id, id);
-   if (!ret)
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
-   spin_unlock_irqrestore(cm.lock, flags);
-   } while( (ret == -EAGAIN)  idr_pre_get(cm.local_id_table, 
GFP_KERNEL) );
+   idr_preload(GFP_KERNEL);
+   spin_lock_irqsave(cm.lock, flags);
+
+   id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT);
+   if (id = 0)
+   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
+
+   spin_unlock_irqrestore(cm.lock, flags);
+   idr_preload_end();
 
cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand;
-   return ret;
+   return id  0 ? id : 0;
 }
 
 static void cm_free_id(__be32 local_id)
@@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void)
cm.remote_sidr_table = RB_ROOT;
idr_init(cm.local_id_table);
get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand);
-   idr_pre_get(cm.local_id_table, GFP_KERNEL);
INIT_LIST_HEAD(cm.timewait_list);
 
ret = class_register(cm_class);
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps
  unsigned short snum)
 {
struct rdma_bind_list *bind_list;
-   int port, ret;
+   int ret;
 
bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
if (!bind_list)
return -ENOMEM;
 
-   do {
-   ret = idr_get_new_above(ps, bind_list, snum, port);
-   } while ((ret == -EAGAIN)  idr_pre_get(ps, GFP_KERNEL));
-
-   if (ret)
-   goto err1;
-
-   if (port != snum) {
-   ret = -EADDRNOTAVAIL;
-   goto err2;
-   }
+   ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL);
+   if (ret  0)
+   goto err;
 
bind_list-ps = ps;
-   bind_list-port = (unsigned short) port;
+   bind_list-port = (unsigned short)ret;
cma_bind_port(bind_list, id_priv);
return 0;
-err2:
-   idr_remove(ps, port);
-err1:
+err:
kfree(bind_list);
-   return ret;
+   return ret == -ENOSPC ? -EADDRNOTAVAIL : ret;
 }
 
 static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -611,19 +611,21 @@ static void init_mad(struct ib_sa_mad *m
 
 static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
+   bool preload = gfp_mask  __GFP_WAIT;
unsigned long flags;
int ret, id;
 
-retry:
-   if (!idr_pre_get(query_idr, gfp_mask))
-   return -ENOMEM;
+   if (preload)
+   idr_preload(gfp_mask);
spin_lock_irqsave(idr_lock, flags);
-   ret = idr_get_new(query_idr, query, id);
+
+   id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT);
+
spin_unlock_irqrestore(idr_lock, flags);
-   if (ret == -EAGAIN)
-   goto retry;
-   if (ret)
-   return ret;
+   if (preload)
+   idr_preload_end();
+   if (id  0)
+   return id;
 
query-mad_buf-timeout_ms  = timeout_ms;
query-mad_buf-context[0] = query;
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -176,7 +176,6 @@ static void ib_ucm_cleanup_events(struct
 static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file)
 {
struct ib_ucm_context *ctx;
-   int result

Re: [PATCH 25/62] infiniband/cxgb4: convert to idr_alloc()

2013-02-03 Thread Tejun Heo
Hello,

On Sun, Feb 3, 2013 at 6:18 AM, Steve Wise sw...@opengridcomputing.com wrote:
 Is there a git tree somewhere that I can use to test these patches out?

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git convert-to-idr_alloc

 -   return ret;
 +   if (lock) {
 +   idr_preload(GFP_KERNEL);
 +   spin_lock_irq(rhp-lock);
 +   }

 The idr_preload() needs to be above the 'if (lock)', no?

No reason to preload for ATOMIC allocation.  idr_alloc() can be called
by itself.

 +
 +   ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC);
 +
 +   if (lock) {
 +   spin_unlock_irq(rhp-lock);
 +   idr_preload_end();
 +   }

 And idr_preload_end() should be after the 'if (lock)' block methinks...

Ditto.

 +
 +   BUG_ON(ret == -ENOSPC);
 +   return ret  0 ? ret : 0;

 What would cause ret   0?

It's the allocated id.  In this case, ret would either be @id or -errno.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 22/62] infiniband/core: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/core/cm.c | 22 +++---
 drivers/infiniband/core/cma.c| 24 +++-
 drivers/infiniband/core/sa_query.c   | 15 +++
 drivers/infiniband/core/ucm.c| 16 
 drivers/infiniband/core/ucma.c   | 32 
 drivers/infiniband/core/uverbs_cmd.c | 17 -
 6 files changed, 45 insertions(+), 81 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 394fea2..98281fe 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_sa_path_rec 
*path, struct cm_av *av)
 static int cm_alloc_id(struct cm_id_private *cm_id_priv)
 {
unsigned long flags;
-   int ret, id;
+   int id;
static int next_id;
 
-   do {
-   spin_lock_irqsave(cm.lock, flags);
-   ret = idr_get_new_above(cm.local_id_table, cm_id_priv,
-   next_id, id);
-   if (!ret)
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
-   spin_unlock_irqrestore(cm.lock, flags);
-   } while( (ret == -EAGAIN)  idr_pre_get(cm.local_id_table, 
GFP_KERNEL) );
+   idr_preload(GFP_KERNEL);
+   spin_lock_irqsave(cm.lock, flags);
+
+   id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT);
+   if (id = 0)
+   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
+
+   spin_unlock_irqrestore(cm.lock, flags);
+   idr_preload_end();
 
cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand;
-   return ret;
+   return id  0 ? id : 0;
 }
 
 static void cm_free_id(__be32 local_id)
@@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void)
cm.remote_sidr_table = RB_ROOT;
idr_init(cm.local_id_table);
get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand);
-   idr_pre_get(cm.local_id_table, GFP_KERNEL);
INIT_LIST_HEAD(cm.timewait_list);
 
ret = class_register(cm_class);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d789eea..c32eeaa 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps, struct 
rdma_id_private *id_priv,
  unsigned short snum)
 {
struct rdma_bind_list *bind_list;
-   int port, ret;
+   int ret;
 
bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
if (!bind_list)
return -ENOMEM;
 
-   do {
-   ret = idr_get_new_above(ps, bind_list, snum, port);
-   } while ((ret == -EAGAIN)  idr_pre_get(ps, GFP_KERNEL));
-
-   if (ret)
-   goto err1;
-
-   if (port != snum) {
-   ret = -EADDRNOTAVAIL;
-   goto err2;
-   }
+   ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL);
+   if (ret  0)
+   goto err;
 
bind_list-ps = ps;
-   bind_list-port = (unsigned short) port;
+   bind_list-port = (unsigned short)ret;
cma_bind_port(bind_list, id_priv);
return 0;
-err2:
-   idr_remove(ps, port);
-err1:
+err:
kfree(bind_list);
-   return ret;
+   return ret == -ENOSPC ? -EADDRNOTAVAIL : ret;
 }
 
 static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index a8905ab..dc64bae 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -614,16 +614,15 @@ static int send_mad(struct ib_sa_query *query, int 
timeout_ms, gfp_t gfp_mask)
unsigned long flags;
int ret, id;
 
-retry:
-   if (!idr_pre_get(query_idr, gfp_mask))
-   return -ENOMEM;
+   idr_preload(gfp_mask);
spin_lock_irqsave(idr_lock, flags);
-   ret = idr_get_new(query_idr, query, id);
+
+   id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT);
+
spin_unlock_irqrestore(idr_lock, flags);
-   if (ret == -EAGAIN)
-   goto retry;
-   if (ret)
-   return ret;
+   idr_preload_end();
+   if (id  0)
+   return id;
 
query-mad_buf-timeout_ms  = timeout_ms;
query-mad_buf-context[0] = query;
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 49b15ac..f2f6393 100644
--- a/drivers/infiniband/core/ucm.c

[PATCH 29/62] infiniband/ocrdma: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Roland Dreier rol...@purestorage.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/ocrdma/ocrdma_main.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index c4e0131..48928c8 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -51,18 +51,6 @@ static DEFINE_IDR(ocrdma_dev_id);
 
 static union ib_gid ocrdma_zero_sgid;
 
-static int ocrdma_get_instance(void)
-{
-   int instance = 0;
-
-   /* Assign an unused number */
-   if (!idr_pre_get(ocrdma_dev_id, GFP_KERNEL))
-   return -1;
-   if (idr_get_new(ocrdma_dev_id, NULL, instance))
-   return -1;
-   return instance;
-}
-
 void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 {
u8 mac_addr[6];
@@ -416,7 +404,7 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info 
*dev_info)
goto idr_err;
 
memcpy(dev-nic_info, dev_info, sizeof(*dev_info));
-   dev-id = ocrdma_get_instance();
+   dev-id = idr_alloc(ocrdma_dev_id, NULL, 0, 0, GFP_KERNEL);
if (dev-id  0)
goto idr_err;
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/62] infiniband/mlx4: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Jack Morgenstein ja...@dev.mellanox.co.il
Cc: Or Gerlitz ogerl...@mellanox.com
Cc: Roland Dreier rol...@purestorage.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/mlx4/cm.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
index dbc99d4..80e59ed 100644
--- a/drivers/infiniband/hw/mlx4/cm.c
+++ b/drivers/infiniband/hw/mlx4/cm.c
@@ -203,7 +203,7 @@ static void sl_id_map_add(struct ib_device *ibdev, struct 
id_map_entry *new)
 static struct id_map_entry *
 id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id)
 {
-   int ret, id;
+   int ret;
static int next_id;
struct id_map_entry *ent;
struct mlx4_ib_sriov *sriov = to_mdev(ibdev)-sriov;
@@ -220,25 +220,23 @@ id_map_alloc(struct ib_device *ibdev, int slave_id, u32 
sl_cm_id)
ent-dev = to_mdev(ibdev);
INIT_DELAYED_WORK(ent-timeout, id_map_ent_timeout);
 
-   do {
-   spin_lock(to_mdev(ibdev)-sriov.id_map_lock);
-   ret = idr_get_new_above(sriov-pv_id_table, ent,
-   next_id, id);
-   if (!ret) {
-   next_id = ((unsigned) id + 1)  MAX_IDR_MASK;
-   ent-pv_cm_id = (u32)id;
-   sl_id_map_add(ibdev, ent);
-   }
+   idr_preload(GFP_KERNEL);
+   spin_lock(to_mdev(ibdev)-sriov.id_map_lock);
 
-   spin_unlock(sriov-id_map_lock);
-   } while (ret == -EAGAIN  idr_pre_get(sriov-pv_id_table, 
GFP_KERNEL));
-   /*the function idr_get_new_above can return -ENOSPC, so don't insert in 
that case.*/
-   if (!ret) {
-   spin_lock(sriov-id_map_lock);
+   ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT);
+   if (ret = 0) {
+   next_id = ((unsigned)ret + 1)  MAX_IDR_MASK;
+   ent-pv_cm_id = (u32)ret;
+   sl_id_map_add(ibdev, ent);
list_add_tail(ent-list, sriov-cm_list);
-   spin_unlock(sriov-id_map_lock);
-   return ent;
}
+
+   spin_unlock(sriov-id_map_lock);
+   idr_preload_end();
+
+   if (ret = 0)
+   return ent;
+
/*error flow*/
kfree(ent);
mlx4_ib_warn(ibdev, No more space in the idr (err:0x%x)\n, ret);
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 30/62] infiniband/qib: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Mike Marciniszyn infinip...@intel.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/qib/qib_init.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_init.c 
b/drivers/infiniband/hw/qib/qib_init.c
index ddf066d..50e33aa 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1060,22 +1060,23 @@ struct qib_devdata *qib_alloc_devdata(struct pci_dev 
*pdev, size_t extra)
struct qib_devdata *dd;
int ret;
 
-   if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) {
-   dd = ERR_PTR(-ENOMEM);
-   goto bail;
-   }
-
dd = (struct qib_devdata *) ib_alloc_device(sizeof(*dd) + extra);
if (!dd) {
dd = ERR_PTR(-ENOMEM);
goto bail;
}
 
+   idr_preload(GFP_KERNEL);
spin_lock_irqsave(qib_devs_lock, flags);
-   ret = idr_get_new(qib_unit_table, dd, dd-unit);
-   if (ret = 0)
+
+   ret = idr_alloc(qib_unit_table, dd, 0, 0, GFP_NOWAIT);
+   if (ret = 0) {
+   dd-unit = ret;
list_add(dd-list, qib_dev_list);
+   }
+
spin_unlock_irqrestore(qib_devs_lock, flags);
+   idr_preload_end();
 
if (ret  0) {
qib_early_err(pdev-dev,
@@ -1180,11 +1181,6 @@ static int __init qlogic_ib_init(void)
 * the PCI subsystem.
 */
idr_init(qib_unit_table);
-   if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) {
-   pr_err(idr_pre_get() failed\n);
-   ret = -ENOMEM;
-   goto bail_cq_wq;
-   }
 
ret = pci_register_driver(qib_driver);
if (ret  0) {
@@ -1199,7 +1195,6 @@ static int __init qlogic_ib_init(void)
 
 bail_unit:
idr_destroy(qib_unit_table);
-bail_cq_wq:
destroy_workqueue(qib_cq_wq);
 bail_dev:
qib_dev_cleanup();
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/62] infiniband/ipath: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Mike Marciniszyn infinip...@intel.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/ipath/ipath_driver.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index 7b371f5..fcdaeea 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -194,11 +194,6 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
struct ipath_devdata *dd;
int ret;
 
-   if (!idr_pre_get(unit_table, GFP_KERNEL)) {
-   dd = ERR_PTR(-ENOMEM);
-   goto bail;
-   }
-
dd = vzalloc(sizeof(*dd));
if (!dd) {
dd = ERR_PTR(-ENOMEM);
@@ -206,9 +201,10 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
}
dd-ipath_unit = -1;
 
+   idr_preload(GFP_KERNEL);
spin_lock_irqsave(ipath_devs_lock, flags);
 
-   ret = idr_get_new(unit_table, dd, dd-ipath_unit);
+   ret = idr_alloc(unit_table, dd, 0, 0, GFP_KERNEL);
if (ret  0) {
printk(KERN_ERR IPATH_DRV_NAME
   : Could not allocate unit ID: error %d\n, -ret);
@@ -216,6 +212,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
dd = ERR_PTR(ret);
goto bail_unlock;
}
+   dd-ipath_unit = ret;
 
dd-pcidev = pdev;
pci_set_drvdata(pdev, dd);
@@ -224,7 +221,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
pci_dev *pdev)
 
 bail_unlock:
spin_unlock_irqrestore(ipath_devs_lock, flags);
-
+   idr_preload_end();
 bail:
return dd;
 }
@@ -2503,11 +2500,6 @@ static int __init infinipath_init(void)
 * the PCI subsystem.
 */
idr_init(unit_table);
-   if (!idr_pre_get(unit_table, GFP_KERNEL)) {
-   printk(KERN_ERR IPATH_DRV_NAME : idr_pre_get() failed\n);
-   ret = -ENOMEM;
-   goto bail;
-   }
 
ret = pci_register_driver(ipath_driver);
if (ret  0) {
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/62] infiniband/amso1100: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Tom Tucker t...@opengridcomputing.com
Cc: Steve Wise sw...@opengridcomputing.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/amso1100/c2_qp.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c 
b/drivers/infiniband/hw/amso1100/c2_qp.c
index 28cd5cb..0ab826b 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -382,14 +382,17 @@ static int c2_alloc_qpn(struct c2_dev *c2dev, struct 
c2_qp *qp)
 {
int ret;
 
-do {
-   spin_lock_irq(c2dev-qp_table.lock);
-   ret = idr_get_new_above(c2dev-qp_table.idr, qp,
-   c2dev-qp_table.last++, qp-qpn);
-   spin_unlock_irq(c2dev-qp_table.lock);
-} while ((ret == -EAGAIN) 
-idr_pre_get(c2dev-qp_table.idr, GFP_KERNEL));
-   return ret;
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(c2dev-qp_table.lock);
+
+   ret = idr_alloc(c2dev-qp_table.idr, qp, c2dev-qp_table.last++, 0,
+   GFP_NOWAIT);
+   if (ret = 0)
+   qp-qpn = ret;
+
+   spin_unlock_irq(c2dev-qp_table.lock);
+   idr_preload_end();
+   return ret  0 ? ret : 0;
 }
 
 static void c2_free_qpn(struct c2_dev *c2dev, int qpn)
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 26/62] infiniband/ehca: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Hoang-Nam Nguyen hngu...@de.ibm.com
Cc: Christoph Raisch rai...@de.ibm.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/ehca/ehca_cq.c | 27 +++
 drivers/infiniband/hw/ehca/ehca_qp.c | 34 +++---
 2 files changed, 22 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c 
b/drivers/infiniband/hw/ehca/ehca_cq.c
index 8f52901..212150c 100644
--- a/drivers/infiniband/hw/ehca/ehca_cq.c
+++ b/drivers/infiniband/hw/ehca/ehca_cq.c
@@ -128,7 +128,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int 
cqe, int comp_vector,
void *vpage;
u32 counter;
u64 rpage, cqx_fec, h_ret;
-   int ipz_rc, ret, i;
+   int ipz_rc, i;
unsigned long flags;
 
if (cqe = 0x - 64 - additional_cqe)
@@ -163,32 +163,19 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, 
int cqe, int comp_vector,
adapter_handle = shca-ipz_hca_handle;
param.eq_handle = shca-eq.ipz_eq_handle;
 
-   do {
-   if (!idr_pre_get(ehca_cq_idr, GFP_KERNEL)) {
-   cq = ERR_PTR(-ENOMEM);
-   ehca_err(device, Can't reserve idr nr. device=%p,
-device);
-   goto create_cq_exit1;
-   }
-
-   write_lock_irqsave(ehca_cq_idr_lock, flags);
-   ret = idr_get_new(ehca_cq_idr, my_cq, my_cq-token);
-   write_unlock_irqrestore(ehca_cq_idr_lock, flags);
-   } while (ret == -EAGAIN);
+   idr_preload(GFP_KERNEL);
+   write_lock_irqsave(ehca_cq_idr_lock, flags);
+   my_cq-token = idr_alloc(ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT);
+   write_unlock_irqrestore(ehca_cq_idr_lock, flags);
+   idr_preload_end();
 
-   if (ret) {
+   if (my_cq-token  0) {
cq = ERR_PTR(-ENOMEM);
ehca_err(device, Can't allocate new idr entry. device=%p,
 device);
goto create_cq_exit1;
}
 
-   if (my_cq-token  0x1FF) {
-   cq = ERR_PTR(-ENOMEM);
-   ehca_err(device, Invalid number of cq. device=%p, device);
-   goto create_cq_exit2;
-   }
-
/*
 * CQs maximum depth is 4GB-64, but we need additional 20 as buffer
 * for receiving errors CQEs.
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c 
b/drivers/infiniband/hw/ehca/ehca_qp.c
index 1493939..00d6861 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -636,30 +636,26 @@ static struct ehca_qp *internal_create_qp(
my_qp-send_cq =
container_of(init_attr-send_cq, struct ehca_cq, ib_cq);
 
-   do {
-   if (!idr_pre_get(ehca_qp_idr, GFP_KERNEL)) {
-   ret = -ENOMEM;
-   ehca_err(pd-device, Can't reserve idr resources.);
-   goto create_qp_exit0;
-   }
+   idr_preload(GFP_KERNEL);
+   write_lock_irqsave(ehca_qp_idr_lock, flags);
 
-   write_lock_irqsave(ehca_qp_idr_lock, flags);
-   ret = idr_get_new(ehca_qp_idr, my_qp, my_qp-token);
-   write_unlock_irqrestore(ehca_qp_idr_lock, flags);
-   } while (ret == -EAGAIN);
+   ret = idr_alloc(ehca_qp_idr, my_qp, 0, 0x200, GFP_NOWAIT);
+   if (ret = 0)
+   my_qp-token = ret;
 
-   if (ret) {
-   ret = -ENOMEM;
-   ehca_err(pd-device, Can't allocate new idr entry.);
+   write_unlock_irqrestore(ehca_qp_idr_lock, flags);
+   idr_preload_end();
+   if (ret  0) {
+   if (ret == -ENOSPC) {
+   ret = -EINVAL;
+   ehca_err(pd-device, Invalid number of qp);
+   } else {
+   ret = -ENOMEM;
+   ehca_err(pd-device, Can't allocate new idr entry.);
+   }
goto create_qp_exit0;
}
 
-   if (my_qp-token  0x1FF) {
-   ret = -EINVAL;
-   ehca_err(pd-device, Invalid number of qp);
-   goto create_qp_exit1;
-   }
-
if (has_srq)
parms.srq_token = my_qp-token;
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 25/62] infiniband/cxgb4: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Steve Wise sw...@chelsio.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 9c1644f..7f862da 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -260,20 +260,21 @@ static inline int _insert_handle(struct c4iw_dev *rhp, 
struct idr *idr,
 void *handle, u32 id, int lock)
 {
int ret;
-   int newid;
 
-   do {
-   if (!idr_pre_get(idr, lock ? GFP_KERNEL : GFP_ATOMIC))
-   return -ENOMEM;
-   if (lock)
-   spin_lock_irq(rhp-lock);
-   ret = idr_get_new_above(idr, handle, id, newid);
-   BUG_ON(!ret  newid != id);
-   if (lock)
-   spin_unlock_irq(rhp-lock);
-   } while (ret == -EAGAIN);
-
-   return ret;
+   if (lock) {
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(rhp-lock);
+   }
+
+   ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC);
+
+   if (lock) {
+   spin_unlock_irq(rhp-lock);
+   idr_preload_end();
+   }
+
+   BUG_ON(ret == -ENOSPC);
+   return ret  0 ? ret : 0;
 }
 
 static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr,
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/62] infiniband/cxgb3: convert to idr_alloc()

2013-02-02 Thread Tejun Heo
Convert to the much saner new idr interface.

Only compile tested.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Steve Wise sw...@chelsio.com
Cc: linux-rdma@vger.kernel.org
---
This patch depends on an earlier idr changes and I think it would be
best to route these together through -mm.  Please holler if there's
any objection.  Thanks.

 drivers/infiniband/hw/cxgb3/iwch.h | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.h 
b/drivers/infiniband/hw/cxgb3/iwch.h
index a1c4457..8378622 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.h
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -153,19 +153,17 @@ static inline int insert_handle(struct iwch_dev *rhp, 
struct idr *idr,
void *handle, u32 id)
 {
int ret;
-   int newid;
-
-   do {
-   if (!idr_pre_get(idr, GFP_KERNEL)) {
-   return -ENOMEM;
-   }
-   spin_lock_irq(rhp-lock);
-   ret = idr_get_new_above(idr, handle, id, newid);
-   BUG_ON(newid != id);
-   spin_unlock_irq(rhp-lock);
-   } while (ret == -EAGAIN);
-
-   return ret;
+
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(rhp-lock);
+
+   ret = idr_alloc(idr, handle, id, id + 1, GFP_NOWAIT);
+
+   spin_unlock_irq(rhp-lock);
+   idr_preload_end();
+
+   BUG_ON(ret == -ENOSPC);
+   return ret  0 ? ret : 0;
 }
 
 static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT] lock scaling and bug fixes for 2.6.38

2011-01-11 Thread Tejun Heo
Hello,

On Tue, Jan 11, 2011 at 06:29:52PM +0100, Bart Van Assche wrote:
 On Tue, Jan 11, 2011 at 2:45 AM, Roland Dreier rdre...@cisco.com wrote:
 
       git://git.kernel.org/pub/scm/linux/kernel/git/dad/srp-initiator.git 
  srp-lock-scaling
 
  Thanks, pulled.
 
  Will try to merge Tejun's work queue patch shortly.
 
 Maybe it's a good idea to add a comment above the declaration of the
 global variable extern struct workqueue_struct *ib_wq; that explains
 that that variable has been added during the work queue conversion and
 that the preferred approach for new code is to define a new work queue
 instead of using ib_wq ?

But why is defining more workqueues preferred?  Unless there's a
specific need, using what's already there (system or ib_wq) should be
preferable.

Thank you.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage

2010-10-20 Thread Tejun Heo
Hello,

On 10/19/2010 07:22 PM, Ralph Campbell wrote:
 On Tue, 2010-10-19 at 08:24 -0700, Tejun Heo wrote:
 
 * qib_cq_wq is a separate singlethread workqueue.  Does the queue
   require strict single thread execution ordering?  IOW, does each
   work have to be executed in the exact queued order and no two works
   should execute in parallel?  Or was the singlethreadedness chosen
   just to reduce the number of workers?
 
 The work functions need to be called in-order and single threaded
 or memory will be freed multiple times and other bad things.

I see, so they'll need to be converted to alloc_ordered_workqueue()
once -rc1 merge window opens up.  I'll follow up with the conversion.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage

2010-10-20 Thread Tejun Heo
Hello,

On 10/19/2010 08:40 PM, Bart Van Assche wrote:
 On Tue, Oct 19, 2010 at 5:24 PM, Tejun Heo t...@kernel.org wrote:
 [ ... ]
 This is to prepare for deprecation of flush_scheduled_work().
 [ ... ]
 Index: work/include/rdma/ib_verbs.h
 [ ... ]
 +extern struct workqueue_struct *ib_wq;
 [ ... ]
 
 This patch adds a declaration of a global variable to a public header
 file. That might be unavoidable, but it doesn't make me happy.

Hmm... that's one very interesting reason to be unhappy.  Can you
please elaborate why addition of a global variable doesn't make you
happy?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage

2010-10-20 Thread Tejun Heo
Hello,

On 10/20/2010 12:21 PM, Bart Van Assche wrote:
 Hmm... that's one very interesting reason to be unhappy.  Can you
 please elaborate why addition of a global variable doesn't make you
 happy?
 
 In the past every time I saw a global variable being added in a
 software project that meant that some software concept was not being
 abstracted properly. Which does not necessarily mean that that is the
 case with this patch.

That's too wide brushed position to agree with, but, well, let's talk
about that some other time.  :-)

 With the posted patch, just like with the current implementation, e.g.
 the flush_workqueue() call in the ipath driver will make that driver
 wait until work scheduled by the core/sa_query.c code finished and
 vice versa. Is that necessary ? If not, using multiple work queues for
 IB instead of one would allow to get rid of that global ib_wq
 declaration.

Yeap, if breaking it down further makes sense, definitely.  If someone
is interested, the followings are the guidelines I've been following
in these conversions.

* If all the possibly pending works can be safely enumerated without
  too much difficulty and the works in question are not used during
  memory reclaim, use the system workqueue and use explicit
  flush/cancel_work[_sync]() instead of using flush_workqueue().
  Please note that flush_work_sync() isn't in mainline yet.  It's
  scheduled for the coming rc1 window.

* If works can be used during memory reclaim, there's no way around
  it.  A separate workqueue needs to be used.

* If works free themselves or are used to put the last reference of
  the containing object, they can't be flushed explicitly and thus
  need to be queued on a dedicated workqueue which serves as the
  flushing domain.

For this patch, I chose less intrusive path and just replaced system
wide workqueue with subsystem wide one mainly because I don't have
experience with or access to anything infiniband.  If someone wants to
push it further, I would be happy to help.

BTW, does the posted patch look safe to you?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2.6.36-rc7] infiniband: update workqueue usage

2010-10-19 Thread Tejun Heo
* ib_wq is added, which is used as the common workqueue for infiniband
  instead of the system workqueue.  All system workqueue usages
  including flush_scheduled_work() callers are converted to use and
  flush ib_wq.

* cancel_delayed_work() + flush_scheduled_work() converted to
  cancel_delayed_work_sync().

* qib_wq is removed and ib_wq is used instead.

This is to prepare for deprecation of flush_scheduled_work().

Signed-off-by: Tejun Heo t...@kernel.org
---
Hello,

I think this patch is safe but don't have any experience with or
access to infiniband stuff and it's only compile tested.  Also, while
looking through the code, I got curious about several things.

* Can any of the works in infiniband be used during memory reclaim?

* qib_cq_wq is a separate singlethread workqueue.  Does the queue
  require strict single thread execution ordering?  IOW, does each
  work have to be executed in the exact queued order and no two works
  should execute in parallel?  Or was the singlethreadedness chosen
  just to reduce the number of workers?

* The same question for ipoib_workqueue.

Thank you.

 drivers/infiniband/core/cache.c|4 +--
 drivers/infiniband/core/device.c   |   11 --
 drivers/infiniband/core/sa_query.c |4 +--
 drivers/infiniband/core/umem.c |2 -
 drivers/infiniband/hw/ipath/ipath_driver.c |2 -
 drivers/infiniband/hw/ipath/ipath_user_pages.c |2 -
 drivers/infiniband/hw/qib/qib_iba7220.c|7 ++
 drivers/infiniband/hw/qib/qib_iba7322.c|   14 ++---
 drivers/infiniband/hw/qib/qib_init.c   |   26 +++--
 drivers/infiniband/hw/qib/qib_qsfp.c   |9 +++-
 drivers/infiniband/hw/qib/qib_verbs.h  |5 +---
 drivers/infiniband/ulp/srp/ib_srp.c|4 +--
 include/rdma/ib_verbs.h|3 ++
 13 files changed, 41 insertions(+), 52 deletions(-)

Index: work/drivers/infiniband/core/cache.c
===
--- work.orig/drivers/infiniband/core/cache.c
+++ work/drivers/infiniband/core/cache.c
@@ -308,7 +308,7 @@ static void ib_cache_event(struct ib_eve
INIT_WORK(work-work, ib_cache_task);
work-device   = event-device;
work-port_num = event-element.port_num;
-   schedule_work(work-work);
+   queue_work(ib_wq, work-work);
}
}
 }
@@ -368,7 +368,7 @@ static void ib_cache_cleanup_one(struct
int p;

ib_unregister_event_handler(device-cache.event_handler);
-   flush_scheduled_work();
+   flush_workqueue(ib_wq);

for (p = 0; p = end_port(device) - start_port(device); ++p) {
kfree(device-cache.pkey_cache[p]);
Index: work/drivers/infiniband/core/device.c
===
--- work.orig/drivers/infiniband/core/device.c
+++ work/drivers/infiniband/core/device.c
@@ -38,7 +38,6 @@
 #include linux/slab.h
 #include linux/init.h
 #include linux/mutex.h
-#include linux/workqueue.h

 #include core_priv.h

@@ -52,6 +51,9 @@ struct ib_client_data {
void *data;
 };

+struct workqueue_struct *ib_wq;
+EXPORT_SYMBOL_GPL(ib_wq);
+
 static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);

@@ -718,6 +720,10 @@ static int __init ib_core_init(void)
 {
int ret;

+   ib_wq = alloc_workqueue(infiniband, 0, 0);
+   if (!ib_wq)
+   return -ENOMEM;
+
ret = ib_sysfs_setup();
if (ret)
printk(KERN_WARNING Couldn't create InfiniBand device 
class\n);
@@ -726,6 +732,7 @@ static int __init ib_core_init(void)
if (ret) {
printk(KERN_WARNING Couldn't set up InfiniBand P_Key/GID 
cache\n);
ib_sysfs_cleanup();
+   destroy_workqueue(ib_wq);
}

return ret;
@@ -736,7 +743,7 @@ static void __exit ib_core_cleanup(void)
ib_cache_cleanup();
ib_sysfs_cleanup();
/* Make sure that any pending umem accounting work is done. */
-   flush_scheduled_work();
+   destroy_workqueue(ib_wq);
 }

 module_init(ib_core_init);
Index: work/drivers/infiniband/core/sa_query.c
===
--- work.orig/drivers/infiniband/core/sa_query.c
+++ work/drivers/infiniband/core/sa_query.c
@@ -422,7 +422,7 @@ static void ib_sa_event(struct ib_event_
port-sm_ah = NULL;
spin_unlock_irqrestore(port-ah_lock, flags);

-   schedule_work(sa_dev-port[event-element.port_num -
+   queue_work(ib_wq, sa_dev-port[event-element.port_num -
sa_dev-start_port].update_task);
}
 }
@@ -1068,7 +1068,7 @@ static void ib_sa_remove_one(struct ib_d

ib_unregister_event_handler(sa_dev

Re: idr_get_new_exact ?

2010-09-23 Thread Tejun Heo
Hello,

On 09/23/2010 01:42 PM, Paul Mundt wrote:
 On Mon, Sep 20, 2010 at 11:26:47PM +0200, Tejun Heo wrote:
 Hello,

 On 09/20/2010 10:35 PM, Roland Dreier wrote:
 Looks fine to me as an improvement over the status quo, but I wonder how
 many of these places could use the radix_tree stuff instead?  If you're
 not using the ability of the idr code to assign an id for you, then it
 seems the radix_tree API is a better fit.

 I agree.  Wouldn't those users better off simply using radix tree?

 It could go either way. I was about to write the same function when
 playing with it for IRQ mapping, the idea being to propagate the initial
 tree with sparse static vectors and then switch over to dynamic IDs for
 virtual IRQ creation. I ended up going with a radix tree for other
 reasons, though.

I see.  If there are use cases where fixed and dynamic IDs need to be
mixed, no objection from me.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: idr_get_new_exact ?

2010-09-20 Thread Tejun Heo
Hello,

On 09/20/2010 10:35 PM, Roland Dreier wrote:
 Looks fine to me as an improvement over the status quo, but I wonder how
 many of these places could use the radix_tree stuff instead?  If you're
 not using the ability of the idr code to assign an id for you, then it
 seems the radix_tree API is a better fit.

I agree.  Wouldn't those users better off simply using radix tree?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html