Re: [PATCHv1 0/6] rdma controller support
Hello, Parav. On Thu, Jan 07, 2016 at 04:43:20AM +0530, Parav Pandit wrote: > > If different controllers can't agree upon the > > same set of resources, which probably is a pretty good sign that this > > isn't too well thought out to begin with, > > When you said "different controller" you meant "different hw vendors", right? > Or you meant, rdma, mem, cpu as controller here? Different hw vendors. > > at least make all resource > > types defined by the controller itself and let the controllers enable > > them selectively. > > > In this V1 patch, resource is defined by the IB stack and rdma cgroup > is facilitator for same. > By doing so, IB stack modules can define new resource without really > making changes to cgroup. > This design also allows hw vendors to define their own resources which > will be reviewed in rdma mailing list anway. > The idea is different hw versions can have different resource support, > so the whole intention is not about defining different resource but > rather enabling it. > But yes, I equally agree that by doing so, different hw controller > vendors can define different hw resources. How many vendors and resources are we talking about? What I was trying to say was that unless the number is extremely high, it'd be far simpler to hard code them in the rdma controller and let drivers enable the ones which apply to them. It would require updating the rdma cgroup controller to add new resource types but I think that'd actually be an upside, not down. There needs to be some checks and balances against adding new resource types; otherwise, it'll soon become a mess. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
On Thu, Jan 07, 2016 at 04:46:19AM +0530, Parav Pandit wrote: > On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <t...@kernel.org> wrote: > > On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote: > >> Added function pointer table to store resource pool specific > >> operation for each resource type (verb and hw). > >> Added list node to link device to rdma cgroup so that it can > >> participate in resource accounting and limit configuration. > > > > Is there any point in splitting patches 1 and 2 from 3? > > > Patch 2 is in IB stack, so I separated that patch out from 1. That > makes it 3 patches. > If you think single patch is easier to review, let me know, I can > respin to have one patch for these 3 smaller patches. They don't do anything by themselves. I think putting them into a single patch would be easier to review. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
Hello, On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote: > Yes. I read through. I can see two changes to be made in V2 version of > this patch. > 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change > respectively to, > 2. rdma.resource.verb.stat and rdma.resource.verb.max. > 3. rdma.resource.verb.failcnt indicate failure events, which I think > should go to events. What's up with the ".resource" part? Also can't the .max file list the available resources? Why does it need a separtae list file? > I roll out new patch for events post this patch as additional feature > and remove this feature in V2. > > rdma.resource.verb.list file is unique to rdma cgroup, so I believe > this is fine. Please see above. > We will conclude whether to have rdma.resource.hw. or not in > other patches. > I am in opinion to keep "resource" and "verb" or "hw" tags around to > keep it verbose enough to know what are we trying to control. What does that achieve? I feel that it's getting overengineered constantly. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote: > Added function pointer table to store resource pool specific > operation for each resource type (verb and hw). > Added list node to link device to rdma cgroup so that it can > participate in resource accounting and limit configuration. Is there any point in splitting patches 1 and 2 from 3? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 0/6] rdma controller support
Hello, On Wed, Jan 06, 2016 at 12:28:00AM +0530, Parav Pandit wrote: > Resources are not defined by the RDMA cgroup. Resources are defined > by RDMA/IB stack & optionally by HCA vendor device drivers. As I wrote before, I don't think this is a good idea. Drivers will inevitably add non-sensical "resources" which don't make any sense without much scrutiny. If different controllers can't agree upon the same set of resources, which probably is a pretty good sign that this isn't too well thought out to begin with, at least make all resource types defined by the controller itself and let the controllers enable them selectively. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
Hello, On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote: > +5-4-1. RDMA Interface Files > + > + rdma.resource.verb.list > + rdma.resource.verb.limit > + rdma.resource.verb.usage > + rdma.resource.verb.failcnt > + rdma.resource.hw.list > + rdma.resource.hw.limit > + rdma.resource.hw.usage > + rdma.resource.hw.failcnt Can you please read the rest of cgroup.txt and put the interface in line with the common conventions followed by other controllers? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
Hello, On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote: > +/* hash table to keep map of tgid to owner cgroup */ > +DEFINE_HASHTABLE(pid_cg_map_tbl, 7); > +DEFINE_SPINLOCK(pid_cg_map_lock);/* lock to protect hash table access */ > + > +/* Keeps mapping of pid to its owning cgroup at rdma level, > + * This mapping doesn't change, even if process migrates from one to other > + * rdma cgroup. > + */ > +struct pid_cg_map { > + struct pid *pid;/* hash key */ > + struct rdma_cgroup *cg; > + > + struct hlist_node hlist;/* pid to cgroup hash table link */ > + atomic_t refcnt;/* count active user tasks to figure out > + * when to free the memory > + */ > +}; Ugh, there's something clearly wrong here. Why does the rdma controller need to keep track of pid cgroup membership? > +static void _dealloc_cg_rpool(struct rdma_cgroup *cg, > + struct cg_resource_pool *rpool) > +{ > + spin_lock(>cg_list_lock); > + > + /* if its started getting used by other task, > + * before we take the spin lock, then skip, > + * freeing it. > + */ Please follow CodingStyle. > + if (atomic_read(>refcnt) == 0) { > + list_del_init(>cg_list); > + spin_unlock(>cg_list_lock); > + > + _free_cg_rpool(rpool); > + return; > + } > + spin_unlock(>cg_list_lock); > +} > + > +static void dealloc_cg_rpool(struct rdma_cgroup *cg, > + struct cg_resource_pool *rpool) > +{ > + /* Don't free the resource pool which is created by the > + * user, otherwise we miss the configured limits. We don't > + * gain much either by splitting storage of limit and usage. > + * So keep it around until user deletes the limits. > + */ > + if (atomic_read(>creator) == RDMACG_RPOOL_CREATOR_DEFAULT) > + _dealloc_cg_rpool(cg, rpool); I'm pretty sure you can get away with an fixed length array of counters. Please keep it simple. It's a simple hard limit enforcer. There's no need to create a massive dynamic infrastrucure. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC rdma cgroup
Hello, chiming in late. On Wed, Oct 28, 2015 at 01:59:15PM +0530, Parav Pandit wrote: > Design guidelines: > --- > 1. There will be new rdma cgroup for accounting rdma resources > (instead of extending device cgroup). > Rationale: RDMA tracks different type of resources and it functions > differently than device cgroup. Though device cgroup could have been > extended for more generic nature, community feels that its better to > create RDMA cgroup, which might have more features than just resource > limit enforcement in future. Yeap, it should definitely be separate from device cgroup. > 2. RDMA cgroup will allow resource accounting, limit enforcement on > per cgroup, per rdma device basis (instead of resource limiting across > all devices). > Rationale: this give granular control when multiple devices exist in the > system. > > 3. Resources are not defined by the RDMA cgroup. Resources are defined > by RDMA/IB subsystem and optionally by HCA vendor device drivers. > Rationale: This allows rdma cgroup to remain constant while RDMA/IB > subsystem can evolve without the need of rdma cgroup update. A new > resource can be easily added by the RDMA/IB subsystem without touching > rdma cgroup. I'm *extremely* uncomfortable with this. Drivers for this sort of higher end devices tend to pull a lot of stunts for better or worse and my gut feeling is that letting low level drivers run free with resource definition is highly likely to lead to an unmanageable mess in the long run. I'd strongly urge to gather consensus on what the resources should be across the board. > Design: > - > 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore > each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4 > hw resource pool per such device. > (Nothing stops to have more devices and pools, but design is around > this use case). Heh, 4 seems like an arbitrary number. idk, it feels weird to bake in a number like 4 into the design. > 9. Resource pool object is created in following situations. > (a) administrative operation is done to set the limit and no previous > resource pool exist for the device of interest for the cgroup. > (b) no resource limits were configured, but IB/RDMA subsystem tries to > charge the resource. so that when applications are running without > limits and later on when limits are enforced, during uncharging, it > correctly uncharges them, otherwise usage count will drop to negative. > This is done using default resource pool. > Instead of implementing any sort of time markers, default pool > simplifies the design. So, the usual way to deal with this is that the root cgroup is exempt from accounting and each resource tracks where they were charged to and frees that cgroup on release. IOW, asssociate on charge and maintain the association till release. For interface details, please refer to the following documentation. https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup.txt?h=for-4.5 Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Mon, Sep 14, 2015 at 07:34:09PM +0530, Parav Pandit wrote: > I missed to acknowledge your point that we need both - hard limit and > soft limit/weight. Current patchset is only based on hard limit. > I see that weight would be another helfpul layer in chain that we can > implement after this as incremental that makes review, debugging > manageable? At this point, I'm very unsure that doing this as a cgroup controller is a good direction. From userland interface standpoint, publishing a cgroup controller is a big commitment. It is true that we haven't been doing a good job of gatekeeping or polishing controller interfaces but we're trying hard to change that and what's being proposed in this thread doesn't really seem to be mature enough. It's not even clear what's being identified as resources here are things that the users would actually care about or if it's even possible to implement sensible resource control in the kernel via the proposed resource restrictions. So, I'd suggest going back to the board and figuring out what the actual resources are, their distribution strategies should be and at which layer such strategies can be implemented best. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote: > Resource run away by application can lead to (a) kernel and (b) other > applications left out with no resources situation. Yeap, that this controller would be able to prevent to a reasonable extent. > Both the problems are the target of this patch set by accounting via cgroup. > > Performance contention can be resolved with higher level user space, > which will tune it. If individual applications are gonna be allowed to do that, what's to prevent them from jacking up their limits? So, I assume you're thinking of a central authority overseeing distribution and enforcing the policy through cgroups? > Threshold and fail counters are on the way in follow on patch. If you're planning on following what the existing memcg did in this area, it's unlikely to go well. Would you mind sharing what you have on mind in the long term? Where do you see this going? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote: > > If you're planning on following what the existing memcg did in this > > area, it's unlikely to go well. Would you mind sharing what you have > > on mind in the long term? Where do you see this going? > > At least current thoughts are: central entity authority monitors fail > count and new threashold count. > Fail count - as similar to other indicates how many time resource > failure occured > threshold count - indicates upto what this resource has gone upto in > usage. (application might not be able to poll on thousands of such > resources entries). > So based on fail count and threshold count, it can tune it further. So, regardless of the specific resource in question, implementing adaptive resource distribution requires more than simple thresholds and failcnts. The very minimum would be a way to exert reclaim pressure and then a way to measure how much lack of a given resource is affecting the workload. Maybe it can adaptively lower the limits and then watch how often allocation fails but that's highly unlikely to be an effective measure as it can't do anything to hoarders and the frequency of allocation failure doesn't necessarily correlate with the amount of impact the workload is getting (it's not a measure of usage). This is what I'm awry about. The kernel-userland interface here is cut pretty low in the stack leaving most of arbitration and management logic in the userland, which seems to be what people wanted and that's fine, but then you're trying to implement an intelligent resource control layer which straddles across kernel and userland with those low level primitives which inevitably would increase the required interface surface as nobody has enough information. Just to illustrate the point, please think of the alsa interface. We expose hardware capabilities pretty much as-is leaving management and multiplexing to userland and there's nothing wrong with it. It fits better that way; however, we don't then go try to implement cgroup controller for PCM channels. To do any high-level resource management, you gotta do it where the said resource is actually managed and arbitrated. What's the allocation frequency you're expecting? It might be better to just let allocations themselves go through the agent that you're planning. You sure can use cgroup membership to identify who's asking tho. Given how the whole thing is architectured, I'd suggest thinking more about how the whole thing should turn out eventually. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Doug. On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. > > The abstraction is 10+ years old. It has had plenty of time to ferment > and something better for the specific use case has not emerged. I think that is likely more reflective of the use cases rather than anything inherent in the concept. > > It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > No. And if you think this, then you miss the *entire* point of RDMA > technologies. An analogy that I have used many times in presentations > is that, in the networking world, the kernel is both a postman and a > copy machine. It receives all incoming packets and must sort them to > the right recipient (the postman job) and when the user space > application is ready to use the information it must copy it into the > user's VM space because it couldn't just put the user's data buffer on > the RX buffer list since each buffer might belong to anyone (the copy > machine). In the RDMA world, you create a new queue pair, it is often a > long lived connection (like a socket), but it belongs now to the app and > the app can directly queue both send and receive buffers to the card and > on incoming packets the card will be able to know that the packet > belongs to a specific queue pair and will immediately go to that apps > buffer. You can *not* do this with TCP without moving to complete TCP > offload on the card, registration of specific sockets on the card, and > then allowing the application to pre-register receive buffers for a > specific socket to the card so that incoming data on the wire can go > straight to the right place. If you ever get to the point of "OS > schedules the transfer" then you might as well throw RDMA out the window > because you have totally trashed the benefit it provides. I don't know. This sounds like classic "this is painful so it must be good" bare metal fantasy. I get that rdma succeeds at bypassing a lot of overhead. That's great but that really isn't exclusive with having more accessible mechanisms built on top. The crux of cost saving is the hardware knowing where the incoming data belongs and putting it there directly. Everything else is there to facilitate that and if you're declaring that it's impossible to build accessible abstractions for that, I can't agree with you. Note that this is not to say that rdma should do that in the operating system. As you said, people have been happy with the bare abstraction for a long time and, given relatively specialized use cases, that can be completely fine but please do note that the lack of proper abstraction isn't an inherent feature. It's just easier that way and putting in more effort hasn't been necessary. > > It could be that given the use cases rdma might not need such level of > > abstraction - e.g. most users want to be and are pretty close to bare > > metal, but, if that's true, it also kinda is weird to build > > hierarchical resource distribution scheme on top of such bare > > abstraction. > > Not really. If you are going to have a bare abstraction, this one isn't > really a bad one. You have devices. On a device, you allocate > protection domains (PDs). If you don't care about cross connection > issues, you ignore this and only use one. If you do care, this acts > like a process's unique VM space only for RDMA buffers, it is a domain > to protect the data of one connection from another. Then you have queue > pairs (QPs) which are roughly the equivalent of a socket. Each QP has > at least one Completion Queue where you get the events that tell you > things have completed (although they often use two, one for send > completions and one for receive completions). And then you use some > number of memory registrations (MRs) and address handles (AHs) depending > on your usage. Since RDMA stands for Remote Direct Memory Access, as > you can imagine, giving a remote machine free reign to access all of the > physical memory in your machine is a security issue. The MRs help to > control what memory the remote host on a specific QP has access to. The > AHs control how we actually route packets from ourselves to the remote host. > > Here's the deal. You might be able to create an abstraction above this > that hides *some* of this. But it can't hide even nearly all of it > without loosing significant functionality. The problem here is that you > are thinking about RDMA connections like sockets. They aren't. Not > even close. They are "how do I allow a remote machine to directly read > and write into my machines physical memory in an even remotely close to > secure manner?" These resources aren't hardware resources, they are the > abstraction resources
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > Tejun, > That is such a perfect abstraction to have at OS level, but not sure > how much close it can be to bare metal RDMA it can be. > I have started discussion on that front as well as part of other > thread, but its certainly long way to go. > Most want to enjoy the performance benefit of the bare metal > interfaces it provides. Yeah, sure, I'm not trying to say that rdma needs or should do that. > Such abstraction that you mentioned, exists, the only difference is > instead of its OS as central entity, its the higher level libraries, > drivers and hw together does it today for the applications. But more that having resource control in the OS and actual arbitration higher up in the stack isn't likely to lead to an effective resource distribution scheme. > > You kinda have to decide that upfront cuz it gets baked into the > > interface. > > Well, all the interfaces are not yet defined. Except the test and I meant the cgroup interface. > benchmark utilities, real world applications wouldn't really bother > much about which device are they are going through. Weights can work fine across multiple devices. Hard limits don't. It just doesn't make any sense. Unless you can exclude multiple device scenarios, you'll have to implement per-device limits. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote: > This is one old white paper, but most of the reasoning still holds true on > RDMA. > http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf Just read it. Much appreciated. ... > These resources include are- QP (queue pair) to transfer data, CQ > (Completion queue) to indicate completion of data transfer operation, > MR (memory region) to represent user application memory as source or > destination for data transfer. > Common resources are QP, SRQ (shared received queue), CQ, MR, AH > (Address handle), FLOW, PD (protection domain), user context etc. It's kinda bothering that all these are disparate resources. I suppose that each restriction comes from the underlying hardware and there's no accepted higher level abstraction for these things? > >> This patch-set allows limiting rdma resources to set of processes. > >> It extend device cgroup controller for limiting rdma device limits. > > > > I don't think this belongs to devcg. If these make sense as a set of > > resources to be controlled via cgroup, the right way prolly would be a > > separate controller. > > > > In past there has been similar comment to have dedicated cgroup > controller for RDMA instead of merging with device cgroup. > I am ok with both the approach, however I prefer to utilize device > controller instead of spinning of new controller for new devices > category. > I anticipate more such need would arise and for new device category, > it might not be worth to have new cgroup controller. > RapidIO though very less popular and upcoming PCIe are on horizon to > offer similar benefits as that of RDMA and in future having one > controller for each of them again would not be right approach. > > I certainly seek your and others inputs in this email thread here whether > (a) to continue to extend device cgroup (which support character, > block devices white list) and now RDMA devices > or > (b) to spin of new controller, if so what are the compelling reasons > that it can provide compare to extension. I'm doubtful that these things are gonna be mainstream w/o building up higher level abstractions on top and if we ever get there we won't be talking about MR or CQ or whatever. Also, whatever next-gen is unlikely to have enough commonalities when the proposed resource knobs are this low level, so let's please keep it separate, so that if/when this goes out of fashion for one reason or another, the controller can silently wither away too. > Current scope of the patch is limited to RDMA resources as first > patch, but for fact I am sure that there are more functionality in > pipe to support via this cgroup by me and others. > So keeping atleast these two aspects in mind, I need input on > direction of dedicated controller or new one. > > In future, I anticipate that we might have sub directory to device > cgroup for individual device class to control. > such as, > /char > /block > /rdma > /pcie > /child_cgroup..1..N > Each controllers cgroup access files would remain within their own > scope. We are not there yet from base infrastructure but something to > be done as it matures and users start using it. I don't think that jives with the rest of cgroup and what generic block or pcie attributes are directly exposed to applications and need to be hierarchically controlled via cgroup? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote: > The fact is that user level application uses hardware resources. > Verbs layer is software abstraction for it. Drivers are hiding how > they implement this QP or CQ or whatever hardware resource they > project via API layer. > For all of the userland on top of verb layer I mentioned above, the > common resource abstraction is these resources AH, QP, CQ, MR etc. > Hardware (and driver) might have different view of this resource in > their real implementation. > For example, verb layer can say that it has 100 QPs, but hardware > might actually have 20 QPs that driver decide how to efficiently use > it. My uneducated suspicion is that the abstraction is just not developed enough. It should be possible to virtualize these resources through, most likely, time-sharing to the level where userland simply says "I want this chunk transferred there" and OS schedules the transfer prioritizing competing requests. It could be that given the use cases rdma might not need such level of abstraction - e.g. most users want to be and are pretty close to bare metal, but, if that's true, it also kinda is weird to build hierarchical resource distribution scheme on top of such bare abstraction. ... > > I don't know. What's proposed in this thread seems way too low level > > to be useful anywhere else. Also, what if there are multiple devices? > > Is that a problem to worry about? > > o.k. It doesn't have to be useful anywhere else. If it suffice the > need of RDMA applications, its fine for near future. > This patch allows limiting resources across multiple devices. > As we go along the path, and if requirement come up to have knob on > per device basis, thats something we can extend in future. You kinda have to decide that upfront cuz it gets baked into the interface. > > I'm kinda doubtful we're gonna have too many of these. Hardware > > details being exposed to userland this directly isn't common. > > Its common in RDMA applications. Again they may not be real hardware > resource, its just API layer which defines those RDMA constructs. It's still a very low level of abstraction which pretty much gets decided by what the hardware and driver decide to do. > > I'd say keep it simple and do the minimum. :) > > o.k. In that case new rdma cgroup controller which does rdma resource > accounting is possibly the most simplest form? > Make sense? So, this fits cgroup's purpose to certain level but it feels like we're trying to build too much on top of something which hasn't developed sufficiently. I suppose it could be that this is the level of development that rdma is gonna reach and dumb cgroup controller can be useful for some use cases. I don't know, so, yeah, let's keep it simple and avoid doing crazy stuff. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote: > >> These resources include are- QP (queue pair) to transfer data, CQ > >> (Completion queue) to indicate completion of data transfer operation, > >> MR (memory region) to represent user application memory as source or > >> destination for data transfer. > >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH > >> (Address handle), FLOW, PD (protection domain), user context etc. > > > > It's kinda bothering that all these are disparate resources. > > Actually not. They are linked resources. Every QP needs associated one > or two CQ, one PD. > Every QP will use few MRs for data transfer. So, if that's the case, let's please implement something higher level. The goal is providing reasonable isolation or protection. If that can be achieved at a higher level of abstraction, please do that. > Here is the good programming guide of the RDMA APIs exposed to the > user space application. > > http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf > So first version of the cgroups patch will address the control > operation for section 3.4. > > > I suppose that each restriction comes from the underlying hardware and > > there's no accepted higher level abstraction for these things? > > There is higher level abstraction which is through the verbs layer > currently which does actually expose the hardware resource but in > vendor agnostic way. > There are many vendors who support these verbs layer, some of them > which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers > which support these verbs are in kernel tree. > > There is higher level APIs above the verb layer, such as MPI, > libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer. > They all rely on the hardware resource. All of these higher level > abstraction is accepted and well used by certain application class. It > would be long discussion to go over them here. Well, the programming interface that userland builds on top doesn't matter too much here but if there is a common resource abstraction which can be made in terms of constructs that consumers of the facility would care about, that likely is a better choice than exposing whatever hardware exposes. > > I'm doubtful that these things are gonna be mainstream w/o building up > > higher level abstractions on top and if we ever get there we won't be > > talking about MR or CQ or whatever. > > Some of the higher level examples I gave above will adapt to resource > allocation failure. Some are actually adaptive to few resource > allocation failure, they do query resources. But its not completely > there yet. Once we have this notion of limited resource in place, > abstraction layer would adapt to relatively smaller value of such > resource. > > These higher level abstraction is mainstream. Its shipped at least in > Redhat Enterprise Linux. Again, I was talking more about resource abstraction - e.g. something along the line of "I want N command buffers". > > Also, whatever next-gen is > > unlikely to have enough commonalities when the proposed resource knobs > > are this low level, > > I agree that resource won't be common in next-gen other transport > whenever they arrive. > But with my existing background working on some of those transport, > they appear similar in nature and it might seek similar knobs. I don't know. What's proposed in this thread seems way too low level to be useful anywhere else. Also, what if there are multiple devices? Is that a problem to worry about? > In past I have discussions with Liran Liss from Mellanox as well on > this topic and we also agreed to have such cgroup controller. > He has recent presentation at Linux foundation event indicating to > have cgroup for RDMA. > Below is the link to it. > http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf > Slides 1 to 7 and slide 13 will give you more insight to it. > Liran and I had similar presentation to RDMA audience with less slides > in RDMA openfabrics summit in March 2015. > > I am ok to create separate cgroup for rdma, if community thinks that way. > My preference would be still use device cgroup for above extensions > unless there are fundamental issues that I am missing. The thing is that they aren't related at all in any way. There's no reason to tie them together. In fact, the way we did devcg is backward. The ideal solution would have been extending the usual ACL to understand cgroups so that it's a natural growth of the permission system. You're talking about actual hardware resources. That has nothing to do with access permissions on device nodes. > I would let you make the call. > Rdma and other is just another type of device with different > characteristics than character or block, so one device cgroup with sub > functionalities can allow setting knobs. > Every device category will have their own set of knobs
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote: > Currently user space applications can easily take away all the rdma > device specific resources such as AH, CQ, QP, MR etc. Due to which other > applications in other cgroup or kernel space ULPs may not even get chance > to allocate any rdma resources. Is there something simple I can read up on what each resource is? What's the usual access control mechanism? > This patch-set allows limiting rdma resources to set of processes. > It extend device cgroup controller for limiting rdma device limits. I don't think this belongs to devcg. If these make sense as a set of resources to be controlled via cgroup, the right way prolly would be a separate controller. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] RDMA/ocrdma: update ocrdma module license srting
On Fri, Jul 24, 2015 at 05:04:00AM +0530, Devesh Sharma wrote: replace module_license from GPL to Dual BSD/GPL Cc: Tejun Heo t...@kernel.org Cc: Duan Jiong duanj.f...@cn.fujitsu.com Cc: Roland Dreier rol...@purestorage.com Cc: Jes Sorensen jes.soren...@redhat.com Cc: Sasha Levin levinsasha...@gmail.com Cc: Dan Carpenter dan.carpen...@oracle.com Cc: Prarit Bhargava pra...@redhat.com Cc: Colin Ian King colin.k...@canonical.com Cc: Wei Yongjun yongjun_...@trendmicro.com.cn Cc: Moni Shoua mo...@mellanox.com Cc: Rasmus Villemoes li...@rasmusvillemoes.dk Cc: Li RongQing roy.qing...@gmail.com Cc: Devendra Naga devendra.a...@gmail.com Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com Ditto, Acked-by: Tejun Heo t...@kernel.org Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/2] RDMA/ocrdma: update ocrdma license to dual-license
On Fri, Jul 24, 2015 at 05:03:59AM +0530, Devesh Sharma wrote: Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause) Cc: Tejun Heo t...@kernel.org Cc: Duan Jiong duanj.f...@cn.fujitsu.com Cc: Roland Dreier rol...@purestorage.com Cc: Jes Sorensen jes.soren...@redhat.com Cc: Sasha Levin levinsasha...@gmail.com Cc: Dan Carpenter dan.carpen...@oracle.com Cc: Prarit Bhargava pra...@redhat.com Cc: Colin Ian King colin.k...@canonical.com Cc: Wei Yongjun yongjun_...@trendmicro.com.cn Cc: Moni Shoua mo...@mellanox.com Cc: Rasmus Villemoes li...@rasmusvillemoes.dk Cc: Li RongQing roy.qing...@gmail.com Cc: Devendra Naga devendra.a...@gmail.com Signed-off-by: Devesh Sharma devesh.sha...@avagotech.com I didn't do any significant work on it but FWIW Acked-by: Tejun Heo t...@kernel.org Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Hello, On Tue, Oct 08, 2013 at 02:22:16PM +0200, Alexander Gordeev wrote: If we talk about pSeries quota, then the current pSeries pci_enable_msix() implementation is racy internally and could fail if the quota went down *while* pci_enable_msix() is executing. In this case the loop will have to exit rather than retry with a lower number (what number?). Ah, okay, so that one is already broken. In this regard the new scheme does not bring anything new and relies on the fact this race does not hit and therefore does not worry. If we talk about quota as it has to be, then yes - the loop scheme seems more preferable. Overall, looks like we just need to fix the pSeries implementation, if the guys want it, he-he :) If we can't figure out a better interface for the retry case, I think what can really help is having a simple interface for the simpler cases. The problem case is where multiple msi(x) allocation fails completely because the global limit went down before inquiry and allocation. In the loop based interface, it'd retry with the lower number. I am probably missing something here. If the global limit went down before inquiry then the inquiry will get what is available and try to allocate with than number. Oh, I should have written between inquiry and allocation. Sorry. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Hello, On Wed, Oct 09, 2013 at 02:57:16PM +0200, Alexander Gordeev wrote: On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote: Hmmm... yean, the race condition could be an issue as multiple msi allocation might fail even if the driver can and explicitly handle multiple allocation if the quota gets reduced inbetween. BTW, should we care about the quota getting increased inbetween? That would entail.. kind of pci_get_msi_limit() :), but IMHO it is not worth it. I think we shouldn't. If the resource was low during a point in time during allocation, it's fine to base the result on that - the resource was actually low and which answer we return is just a question of timing and both are correct. The only reason the existing race condition is problematic is because it may fail even if the resource never falls below the failure point. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On Mon, Oct 07, 2013 at 09:48:01PM +0100, Ben Hutchings wrote: There is one major flaw in min-max approach - the generic MSI layer will have to take decisions on exact number of MSIs to request, not device drivers. [... No, the min-max functions should be implemented using the same loop that drivers are expected to use now. Wheee... earlier in the thread I thought you guys were referring to yourselves in the third person and was getting a bit worried. :) -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern
Hello, Alexander. On Tue, Oct 08, 2013 at 09:48:26AM +0200, Alexander Gordeev wrote: If there are many which duplicate the above pattern, it'd probably be worthwhile to provide a helper? It's usually a good idea to reduce the amount of boilerplate code in drivers. I wanted to limit discussion in v1 to as little changes as possible. I 'planned' those helper(s) for a separate effort if/when the most important change is accepted and soaked a bit. The thing is doing it this way generates more churns and noises. Once the simpler ones live behind a wrapper which can be built on the existing interface, we can have both reduced cost and more latitude on the complex cases. If we do things this way, it breaks all drivers using this interface until they're converted, right? Right. And the rest of the series does it. Which breaks bisection which we shouldn't do. Also, it probably isn't the best idea to flip the behavior like this as this can go completely unnoticed (no compiler warning or anything, the same function just behaves differently). Maybe it'd be a better idea to introduce a simpler interface that most can be converted to? Well, an *other* interface is a good idea. What do you mean with the simpler here? I'm still talking about a simpler wrapper for common cases, which is the important part anyway. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Hello, On Tue, Oct 08, 2013 at 11:07:16AM +0200, Alexander Gordeev wrote: Multipe MSIs is just a handful of drivers, really. MSI-X impact still Yes, so it's pretty nice to try out things there before going full-on. will be huge. But if we opt a different name for the new pci_enable_msix() then we could first update pci/msi, then drivers (in few stages possibly) and finally remove the old implementation. Yes, that probably should be the steps to follow eventually. My point was that you don't have to submit patches for all 7x conversions for an RFC round. Scanning them and seeing what would be necessary definitely is a good idea but just giving summary of the full conversion with several examples should be good enough before settling on the way forward, which should be easier for all involved. Thanks a lot for your effort! -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Hey, guys. On Sun, Oct 06, 2013 at 09:10:30AM +0200, Alexander Gordeev wrote: On Sun, Oct 06, 2013 at 05:19:46PM +1100, Benjamin Herrenschmidt wrote: On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote: In fact, in the current design to address the quota race decently the drivers would have to protect the *loop* to prevent the quota change between a pci_enable_msix() returned a positive number and the the next call to pci_enable_msix() with that number. Is it doable? I am not advocating for the current design, simply saying that your proposal doesn't address this issue while Ben's does. Hmmm... yean, the race condition could be an issue as multiple msi allocation might fail even if the driver can and explicitly handle multiple allocation if the quota gets reduced inbetween. There is one major flaw in min-max approach - the generic MSI layer will have to take decisions on exact number of MSIs to request, not device drivers. The min-max approach would actually be pretty nice for the users which actually care about this. This will never work for all devices, because there might be specific requirements which are not covered by the min-max. That is what Ben described ...say, any even number within a certain range. Ben suggests to leave the existing loop scheme to cover such devices, which I think is not right. if it could work. What about introducing pci_lock_msi() and pci_unlock_msi() and let device drivers care about their ranges and specifics in race-safe manner? I do not call to introduce it right now (since it appears pSeries has not been hitting the race for years) just as a possible alternative to Ben's proposal. I don't think the same race condition would happen with the loop. The problem case is where multiple msi(x) allocation fails completely because the global limit went down before inquiry and allocation. In the loop based interface, it'd retry with the lower number. As long as the number of drivers which need this sort of adaptive allocation isn't too high and the common cases can be made simple, I don't think the complex part of interface is all that important. Maybe we can have reserve / cancel type interface or just keep the loop with more explicit function names (ie. try_enable or something like that). Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 05/77] PCI/MSI: Convert pci_msix_table_size() to a public interface
Hello, On Wed, Oct 02, 2013 at 12:48:21PM +0200, Alexander Gordeev wrote: Make pci_msix_table_size() to return a error code if the device does not support MSI-X. This update is needed to facilitate a forthcoming re-design MSI/MSI-X interrupts enabling pattern. Device drivers will use this interface to obtain maximum number of MSI-X interrupts the device supports and use that value in the following call to pci_enable_msix() interface. Signed-off-by: Alexander Gordeev agord...@redhat.com Hmmm... I probably missed something but why is this necessary? To discern between -EINVAL and -ENOSPC? If so, does that really matter? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern
Hello, On Wed, Oct 02, 2013 at 12:48:23PM +0200, Alexander Gordeev wrote: +static int foo_driver_enable_msi(struct foo_adapter *adapter, int nvec) +{ + rc = pci_get_msi_cap(adapter-pdev); + if (rc 0) + return rc; + + nvec = min(nvec, rc); + if (nvec FOO_DRIVER_MINIMUM_NVEC) { + return -ENOSPC; + + rc = pci_enable_msi_block(adapter-pdev, nvec); + return rc; +} If there are many which duplicate the above pattern, it'd probably be worthwhile to provide a helper? It's usually a good idea to reduce the amount of boilerplate code in drivers. static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec) { + rc = pci_msix_table_size(adapter-pdev); + if (rc 0) + return rc; + + nvec = min(nvec, rc); + if (nvec FOO_DRIVER_MINIMUM_NVEC) { + return -ENOSPC; + + for (i = 0; i nvec; i++) + adapter-msix_entries[i].entry = i; + + rc = pci_enable_msix(adapter-pdev, adapter-msix_entries, nvec); + return rc; } Ditto. @@ -975,7 +951,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) if (nr_entries 0) return nr_entries; if (nvec nr_entries) - return nr_entries; + return -EINVAL; /* Check for any invalid entries */ for (i = 0; i nvec; i++) { If we do things this way, it breaks all drivers using this interface until they're converted, right? Also, it probably isn't the best idea to flip the behavior like this as this can go completely unnoticed (no compiler warning or anything, the same function just behaves differently). Maybe it'd be a better idea to introduce a simpler interface that most can be converted to? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Hello, Alexander. On Wed, Oct 02, 2013 at 12:48:16PM +0200, Alexander Gordeev wrote: Alexander Gordeev (77): PCI/MSI: Fix return value when populate_msi_sysfs() failed PCI/MSI/PPC: Fix wrong RTAS error code reporting PCI/MSI/s390: Fix single MSI only check PCI/MSI/s390: Remove superfluous check of MSI type PCI/MSI: Convert pci_msix_table_size() to a public interface PCI/MSI: Factor out pci_get_msi_cap() interface PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern PCI/MSI: Get rid of pci_enable_msi_block_auto() interface ahci: Update MSI/MSI-X interrupts enablement code ahci: Check MRSM bit when multiple MSIs enabled ... Whee that's a lot more than I expected. I was just scanning multiple msi users. Maybe we can stage the work in more manageable steps so that you don't have to go through massive conversion only to do it all over again afterwards and likewise people don't get bombarded on each iteration? Maybe we can first update pci / msi code proper, msi and then msix? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/6] amso1100: convert to using idr_alloc_cyclic
On Wed, Mar 27, 2013 at 09:18:04AM -0400, Jeff Layton wrote: void c2_init_qp_table(struct c2_dev *c2dev) { spin_lock_init(c2dev-qp_table.lock); - idr_init(c2dev-qp_table.idr); + idr_init_cyclic(c2dev-qp_table.idr, 0); } Why is this necessary? In general, why is idr_init_cyclic() necessary? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next v2] IB/ipath: use GFP_NOWAIT under spin lock
On Wed, Feb 20, 2013 at 9:45 PM, Wei Yongjun weiyj...@gmail.com wrote: From: Wei Yongjun yongjun_...@trendmicro.com.cn A spin lock is taken here so we should use GFP_NOWAIT like other case. Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn Acked-by: Tejun Heo t...@kernel.org Thanks! -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next v2] IB/ipath: use GFP_NOWAIT under spin lock
On Thu, Feb 21, 2013 at 7:33 AM, Tejun Heo t...@kernel.org wrote: Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn Acked-by: Tejun Heo t...@kernel.org Just noticed akpm isn't cc'd. Can you please repost w/ a...@linux-foundation.org cc'd? All related patches are going through -mm. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c
Hello, On Fri, Feb 08, 2013 at 10:09:13PM +, Hefty, Sean wrote: Used to wrap cyclic @start. Can be replaced with max(next, 0). Note that this type of cyclic allocation using idr is buggy. These are prone to spurious -ENOSPC failure after the first wraparound. The replacement code looks fine, but can you explain why the use is buggy? So, if you want a cyclic allocation, the allocation should be tried in [start, END) and then [0, start); otherwise, after the allocation wraps for the first time, as the closer the starting point gets to END, the chance of not finding a vacant slot in [start, END) goes higher. When @start equals END - 1 for the second time, if the first END - 1 allocation is still around, you'll get -ENOSPC. In practice, I don't think anyone is hitting this. idr has always been horribly broken when it reaches higher range ( 130 on 64bit) so things would have broken even before the first wraparound. It still is a theoretical possibility which may trigger if idr is used for, say, ipc messages or storage commands. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c
MAX_IDR_MASK is another weirdness in the idr interface. As idr covers whole positive integer range, it's defined as 0x7fff or INT_MAX. Its usage in idr_find(), idr_replace() and idr_remove() is bizarre. They basically mask off the sign bit and operate on the rest, so if the caller, by accident, passes in a negative number, the sign bit will be masked off and the remaining part will be used as if that was the input, which is worse than crashing. The constant is visible in idr.h and there are several users in the kernel. * drivers/i2c/i2c-core.c:i2c_add_numbered_adapter() Basically used to test if adap-nr is a negative number which isn't -1 and returns -EINVAL if so. idr_alloc() already has negative @start checking (w/ WARN_ON_ONCE), so this can go away. * drivers/infiniband/core/cm.c:cm_alloc_id() drivers/infiniband/hw/mlx4/cm.c:id_map_alloc() Used to wrap cyclic @start. Can be replaced with max(next, 0). Note that this type of cyclic allocation using idr is buggy. These are prone to spurious -ENOSPC failure after the first wraparound. * fs/super.c:get_anon_bdev() The ID allocated from ida is masked off before being tested whether it's inside valid range. ida allocated ID can never be a negative number and the masking is unnecessary. Update idr_*() functions to fail with -EINVAL when negative @id is specified and update other MAX_IDR_MASK users as described above. This leaves MAX_IDR_MASK without any user, remove it and relocate other MAX_IDR_* constants to lib/idr.c. Signed-off-by: Tejun Heo t...@kernel.org Cc: Jean Delvare kh...@linux-fr.org Cc: linux-...@vger.kernel.org Cc: Roland Dreier rol...@kernel.org Cc: Sean Hefty sean.he...@intel.com Cc: Hal Rosenstock hal.rosenst...@gmail.com Cc: Marciniszyn, Mike mike.marcinis...@intel.com Cc: Jack Morgenstein ja...@dev.mellanox.co.il Cc: Or Gerlitz ogerl...@mellanox.com Cc: linux-rdma@vger.kernel.org Cc: Al Viro v...@zeniv.linux.org.uk --- drivers/i2c/i2c-core.c |2 -- drivers/infiniband/core/cm.c|2 +- drivers/infiniband/hw/mlx4/cm.c |2 +- fs/super.c |2 +- include/linux/idr.h | 10 -- lib/idr.c | 24 +--- 6 files changed, 20 insertions(+), 22 deletions(-) --- a/drivers/i2c/i2c-core.c +++ b/drivers/i2c/i2c-core.c @@ -978,8 +978,6 @@ int i2c_add_numbered_adapter(struct i2c_ if (adap-nr == -1) /* -1 means dynamically assign bus id */ return i2c_add_adapter(adap); - if (adap-nr ~MAX_IDR_MASK) - return -EINVAL; mutex_lock(core_lock); id = idr_alloc(i2c_adapter_idr, adap, adap-nr, adap-nr + 1, --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -390,7 +390,7 @@ static int cm_alloc_id(struct cm_id_priv id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT); if (id = 0) - next_id = ((unsigned) id + 1) MAX_IDR_MASK; + next_id = max(id + 1, 0); spin_unlock_irqrestore(cm.lock, flags); idr_preload_end(); --- a/drivers/infiniband/hw/mlx4/cm.c +++ b/drivers/infiniband/hw/mlx4/cm.c @@ -225,7 +225,7 @@ id_map_alloc(struct ib_device *ibdev, in ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT); if (ret = 0) { - next_id = ((unsigned)ret + 1) MAX_IDR_MASK; + next_id = max(ret + 1, 0); ent-pv_cm_id = (u32)ret; sl_id_map_add(ibdev, ent); list_add_tail(ent-list, sriov-cm_list); --- a/fs/super.c +++ b/fs/super.c @@ -842,7 +842,7 @@ int get_anon_bdev(dev_t *p) else if (error) return -EAGAIN; - if ((dev MAX_IDR_MASK) == (1 MINORBITS)) { + if (dev == (1 MINORBITS)) { spin_lock(unnamed_dev_lock); ida_remove(unnamed_dev_ida, dev); if (unnamed_dev_start dev) --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -38,16 +38,6 @@ #define IDR_SIZE (1 IDR_BITS) #define IDR_MASK ((1 IDR_BITS)-1) -#define MAX_IDR_SHIFT (sizeof(int)*8 - 1) -#define MAX_IDR_BIT (1U MAX_IDR_SHIFT) -#define MAX_IDR_MASK (MAX_IDR_BIT - 1) - -/* Leave the possibility of an incomplete final layer */ -#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS - 1) / IDR_BITS) - -/* Number of id_layer structs to leave in free list */ -#define MAX_IDR_FREE (MAX_IDR_LEVEL * 2) - struct idr_layer { unsigned long bitmap; /* A zero bit means space here */ struct idr_layer __rcu *ary[1IDR_BITS]; --- a/lib/idr.c +++ b/lib/idr.c @@ -38,6 +38,15 @@ #include linux/percpu.h #include linux/hardirq.h +#define MAX_IDR_SHIFT (sizeof(int) * 8 - 1) +#define MAX_IDR_BIT(1U MAX_IDR_SHIFT) + +/* Leave the possibility of an incomplete final layer */ +#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS - 1) / IDR_BITS) + +/* Number of id_layer structs to leave in free list
[PATCH 42/77] IB/ehca: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Hoang-Nam Nguyen hngu...@de.ibm.com Cc: Christoph Raisch rai...@de.ibm.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/ehca/ehca_cq.c | 27 +++ drivers/infiniband/hw/ehca/ehca_qp.c | 34 +++--- 2 files changed, 22 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 8f52901..212150c 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -128,7 +128,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, void *vpage; u32 counter; u64 rpage, cqx_fec, h_ret; - int ipz_rc, ret, i; + int ipz_rc, i; unsigned long flags; if (cqe = 0x - 64 - additional_cqe) @@ -163,32 +163,19 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, adapter_handle = shca-ipz_hca_handle; param.eq_handle = shca-eq.ipz_eq_handle; - do { - if (!idr_pre_get(ehca_cq_idr, GFP_KERNEL)) { - cq = ERR_PTR(-ENOMEM); - ehca_err(device, Can't reserve idr nr. device=%p, -device); - goto create_cq_exit1; - } - - write_lock_irqsave(ehca_cq_idr_lock, flags); - ret = idr_get_new(ehca_cq_idr, my_cq, my_cq-token); - write_unlock_irqrestore(ehca_cq_idr_lock, flags); - } while (ret == -EAGAIN); + idr_preload(GFP_KERNEL); + write_lock_irqsave(ehca_cq_idr_lock, flags); + my_cq-token = idr_alloc(ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT); + write_unlock_irqrestore(ehca_cq_idr_lock, flags); + idr_preload_end(); - if (ret) { + if (my_cq-token 0) { cq = ERR_PTR(-ENOMEM); ehca_err(device, Can't allocate new idr entry. device=%p, device); goto create_cq_exit1; } - if (my_cq-token 0x1FF) { - cq = ERR_PTR(-ENOMEM); - ehca_err(device, Invalid number of cq. device=%p, device); - goto create_cq_exit2; - } - /* * CQs maximum depth is 4GB-64, but we need additional 20 as buffer * for receiving errors CQEs. diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 1493939..00d6861 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -636,30 +636,26 @@ static struct ehca_qp *internal_create_qp( my_qp-send_cq = container_of(init_attr-send_cq, struct ehca_cq, ib_cq); - do { - if (!idr_pre_get(ehca_qp_idr, GFP_KERNEL)) { - ret = -ENOMEM; - ehca_err(pd-device, Can't reserve idr resources.); - goto create_qp_exit0; - } + idr_preload(GFP_KERNEL); + write_lock_irqsave(ehca_qp_idr_lock, flags); - write_lock_irqsave(ehca_qp_idr_lock, flags); - ret = idr_get_new(ehca_qp_idr, my_qp, my_qp-token); - write_unlock_irqrestore(ehca_qp_idr_lock, flags); - } while (ret == -EAGAIN); + ret = idr_alloc(ehca_qp_idr, my_qp, 0, 0x200, GFP_NOWAIT); + if (ret = 0) + my_qp-token = ret; - if (ret) { - ret = -ENOMEM; - ehca_err(pd-device, Can't allocate new idr entry.); + write_unlock_irqrestore(ehca_qp_idr_lock, flags); + idr_preload_end(); + if (ret 0) { + if (ret == -ENOSPC) { + ret = -EINVAL; + ehca_err(pd-device, Invalid number of qp); + } else { + ret = -ENOMEM; + ehca_err(pd-device, Can't allocate new idr entry.); + } goto create_qp_exit0; } - if (my_qp-token 0x1FF) { - ret = -EINVAL; - ehca_err(pd-device, Invalid number of qp); - goto create_qp_exit1; - } - if (has_srq) parms.srq_token = my_qp-token; -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 41/77] IB/cxgb4: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Reviewed-by: Steve Wise sw...@opengridcomputing.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h index 9c1644f..7f862da 100644 --- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h +++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h @@ -260,20 +260,21 @@ static inline int _insert_handle(struct c4iw_dev *rhp, struct idr *idr, void *handle, u32 id, int lock) { int ret; - int newid; - do { - if (!idr_pre_get(idr, lock ? GFP_KERNEL : GFP_ATOMIC)) - return -ENOMEM; - if (lock) - spin_lock_irq(rhp-lock); - ret = idr_get_new_above(idr, handle, id, newid); - BUG_ON(!ret newid != id); - if (lock) - spin_unlock_irq(rhp-lock); - } while (ret == -EAGAIN); - - return ret; + if (lock) { + idr_preload(GFP_KERNEL); + spin_lock_irq(rhp-lock); + } + + ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC); + + if (lock) { + spin_unlock_irq(rhp-lock); + idr_preload_end(); + } + + BUG_ON(ret == -ENOSPC); + return ret 0 ? ret : 0; } static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr, -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 45/77] IB/ocrdma: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Roland Dreier rol...@purestorage.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/ocrdma/ocrdma_main.c | 14 +- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c index c4e0131..48928c8 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c @@ -51,18 +51,6 @@ static DEFINE_IDR(ocrdma_dev_id); static union ib_gid ocrdma_zero_sgid; -static int ocrdma_get_instance(void) -{ - int instance = 0; - - /* Assign an unused number */ - if (!idr_pre_get(ocrdma_dev_id, GFP_KERNEL)) - return -1; - if (idr_get_new(ocrdma_dev_id, NULL, instance)) - return -1; - return instance; -} - void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid) { u8 mac_addr[6]; @@ -416,7 +404,7 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info) goto idr_err; memcpy(dev-nic_info, dev_info, sizeof(*dev_info)); - dev-id = ocrdma_get_instance(); + dev-id = idr_alloc(ocrdma_dev_id, NULL, 0, 0, GFP_KERNEL); if (dev-id 0) goto idr_err; -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 46/77] IB/qib: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Mike Marciniszyn infinip...@intel.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/qib/qib_init.c | 21 - 1 file changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index ddf066d..50e33aa 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -1060,22 +1060,23 @@ struct qib_devdata *qib_alloc_devdata(struct pci_dev *pdev, size_t extra) struct qib_devdata *dd; int ret; - if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) { - dd = ERR_PTR(-ENOMEM); - goto bail; - } - dd = (struct qib_devdata *) ib_alloc_device(sizeof(*dd) + extra); if (!dd) { dd = ERR_PTR(-ENOMEM); goto bail; } + idr_preload(GFP_KERNEL); spin_lock_irqsave(qib_devs_lock, flags); - ret = idr_get_new(qib_unit_table, dd, dd-unit); - if (ret = 0) + + ret = idr_alloc(qib_unit_table, dd, 0, 0, GFP_NOWAIT); + if (ret = 0) { + dd-unit = ret; list_add(dd-list, qib_dev_list); + } + spin_unlock_irqrestore(qib_devs_lock, flags); + idr_preload_end(); if (ret 0) { qib_early_err(pdev-dev, @@ -1180,11 +1181,6 @@ static int __init qlogic_ib_init(void) * the PCI subsystem. */ idr_init(qib_unit_table); - if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) { - pr_err(idr_pre_get() failed\n); - ret = -ENOMEM; - goto bail_cq_wq; - } ret = pci_register_driver(qib_driver); if (ret 0) { @@ -1199,7 +1195,6 @@ static int __init qlogic_ib_init(void) bail_unit: idr_destroy(qib_unit_table); -bail_cq_wq: destroy_workqueue(qib_cq_wq); bail_dev: qib_dev_cleanup(); -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 44/77] IB/mlx4: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Jack Morgenstein ja...@dev.mellanox.co.il Cc: Or Gerlitz ogerl...@mellanox.com Cc: Roland Dreier rol...@purestorage.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/mlx4/cm.c | 32 +++- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c index dbc99d4..80e59ed 100644 --- a/drivers/infiniband/hw/mlx4/cm.c +++ b/drivers/infiniband/hw/mlx4/cm.c @@ -203,7 +203,7 @@ static void sl_id_map_add(struct ib_device *ibdev, struct id_map_entry *new) static struct id_map_entry * id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id) { - int ret, id; + int ret; static int next_id; struct id_map_entry *ent; struct mlx4_ib_sriov *sriov = to_mdev(ibdev)-sriov; @@ -220,25 +220,23 @@ id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id) ent-dev = to_mdev(ibdev); INIT_DELAYED_WORK(ent-timeout, id_map_ent_timeout); - do { - spin_lock(to_mdev(ibdev)-sriov.id_map_lock); - ret = idr_get_new_above(sriov-pv_id_table, ent, - next_id, id); - if (!ret) { - next_id = ((unsigned) id + 1) MAX_IDR_MASK; - ent-pv_cm_id = (u32)id; - sl_id_map_add(ibdev, ent); - } + idr_preload(GFP_KERNEL); + spin_lock(to_mdev(ibdev)-sriov.id_map_lock); - spin_unlock(sriov-id_map_lock); - } while (ret == -EAGAIN idr_pre_get(sriov-pv_id_table, GFP_KERNEL)); - /*the function idr_get_new_above can return -ENOSPC, so don't insert in that case.*/ - if (!ret) { - spin_lock(sriov-id_map_lock); + ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT); + if (ret = 0) { + next_id = ((unsigned)ret + 1) MAX_IDR_MASK; + ent-pv_cm_id = (u32)ret; + sl_id_map_add(ibdev, ent); list_add_tail(ent-list, sriov-cm_list); - spin_unlock(sriov-id_map_lock); - return ent; } + + spin_unlock(sriov-id_map_lock); + idr_preload_end(); + + if (ret = 0) + return ent; + /*error flow*/ kfree(ent); mlx4_ib_warn(ibdev, No more space in the idr (err:0x%x)\n, ret); -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 43/77] IB/ipath: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Mike Marciniszyn infinip...@intel.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/ipath/ipath_driver.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 7b371f5..fcdaeea 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -194,11 +194,6 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) struct ipath_devdata *dd; int ret; - if (!idr_pre_get(unit_table, GFP_KERNEL)) { - dd = ERR_PTR(-ENOMEM); - goto bail; - } - dd = vzalloc(sizeof(*dd)); if (!dd) { dd = ERR_PTR(-ENOMEM); @@ -206,9 +201,10 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) } dd-ipath_unit = -1; + idr_preload(GFP_KERNEL); spin_lock_irqsave(ipath_devs_lock, flags); - ret = idr_get_new(unit_table, dd, dd-ipath_unit); + ret = idr_alloc(unit_table, dd, 0, 0, GFP_KERNEL); if (ret 0) { printk(KERN_ERR IPATH_DRV_NAME : Could not allocate unit ID: error %d\n, -ret); @@ -216,6 +212,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) dd = ERR_PTR(ret); goto bail_unlock; } + dd-ipath_unit = ret; dd-pcidev = pdev; pci_set_drvdata(pdev, dd); @@ -224,7 +221,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) bail_unlock: spin_unlock_irqrestore(ipath_devs_lock, flags); - + idr_preload_end(); bail: return dd; } @@ -2503,11 +2500,6 @@ static int __init infinipath_init(void) * the PCI subsystem. */ idr_init(unit_table); - if (!idr_pre_get(unit_table, GFP_KERNEL)) { - printk(KERN_ERR IPATH_DRV_NAME : idr_pre_get() failed\n); - ret = -ENOMEM; - goto bail; - } ret = pci_register_driver(ipath_driver); if (ret 0) { -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 40/77] IB/cxgb3: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Reviewed-by: Steve Wise sw...@opengridcomputing.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/hw/cxgb3/iwch.h | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index a1c4457..8378622 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -153,19 +153,17 @@ static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, void *handle, u32 id) { int ret; - int newid; - - do { - if (!idr_pre_get(idr, GFP_KERNEL)) { - return -ENOMEM; - } - spin_lock_irq(rhp-lock); - ret = idr_get_new_above(idr, handle, id, newid); - BUG_ON(newid != id); - spin_unlock_irq(rhp-lock); - } while (ret == -EAGAIN); - - return ret; + + idr_preload(GFP_KERNEL); + spin_lock_irq(rhp-lock); + + ret = idr_alloc(idr, handle, id, id + 1, GFP_NOWAIT); + + spin_unlock_irq(rhp-lock); + idr_preload_end(); + + BUG_ON(ret == -ENOSPC); + return ret 0 ? ret : 0; } static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 38/77] IB/core: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. v2: Mike triggered WARN_ON() in idr_preload() because send_mad(), which may be used from non-process context, was calling idr_preload() unconditionally. Preload iff @gfp_mask has __GFP_WAIT. Signed-off-by: Tejun Heo t...@kernel.org Reviewed-by: Sean Hefty sean.he...@intel.com Reported-by: Marciniszyn, Mike mike.marcinis...@intel.com Cc: Roland Dreier rol...@kernel.org Cc: Sean Hefty sean.he...@intel.com Cc: Hal Rosenstock hal.rosenst...@gmail.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/core/cm.c | 22 +++--- drivers/infiniband/core/cma.c| 24 +++- drivers/infiniband/core/sa_query.c | 18 ++ drivers/infiniband/core/ucm.c| 16 drivers/infiniband/core/ucma.c | 32 drivers/infiniband/core/uverbs_cmd.c | 17 - 6 files changed, 48 insertions(+), 81 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 394fea2..98281fe 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) static int cm_alloc_id(struct cm_id_private *cm_id_priv) { unsigned long flags; - int ret, id; + int id; static int next_id; - do { - spin_lock_irqsave(cm.lock, flags); - ret = idr_get_new_above(cm.local_id_table, cm_id_priv, - next_id, id); - if (!ret) - next_id = ((unsigned) id + 1) MAX_IDR_MASK; - spin_unlock_irqrestore(cm.lock, flags); - } while( (ret == -EAGAIN) idr_pre_get(cm.local_id_table, GFP_KERNEL) ); + idr_preload(GFP_KERNEL); + spin_lock_irqsave(cm.lock, flags); + + id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT); + if (id = 0) + next_id = ((unsigned) id + 1) MAX_IDR_MASK; + + spin_unlock_irqrestore(cm.lock, flags); + idr_preload_end(); cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand; - return ret; + return id 0 ? id : 0; } static void cm_free_id(__be32 local_id) @@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void) cm.remote_sidr_table = RB_ROOT; idr_init(cm.local_id_table); get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand); - idr_pre_get(cm.local_id_table, GFP_KERNEL); INIT_LIST_HEAD(cm.timewait_list); ret = class_register(cm_class); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d789eea..c32eeaa 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; - int port, ret; + int ret; bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); if (!bind_list) return -ENOMEM; - do { - ret = idr_get_new_above(ps, bind_list, snum, port); - } while ((ret == -EAGAIN) idr_pre_get(ps, GFP_KERNEL)); - - if (ret) - goto err1; - - if (port != snum) { - ret = -EADDRNOTAVAIL; - goto err2; - } + ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + if (ret 0) + goto err; bind_list-ps = ps; - bind_list-port = (unsigned short) port; + bind_list-port = (unsigned short)ret; cma_bind_port(bind_list, id_priv); return 0; -err2: - idr_remove(ps, port); -err1: +err: kfree(bind_list); - return ret; + return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index a8905ab..934f45e 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -611,19 +611,21 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask) { + bool preload = gfp_mask __GFP_WAIT; unsigned long flags; int ret, id; -retry: - if (!idr_pre_get(query_idr, gfp_mask)) - return -ENOMEM; + if (preload) + idr_preload(gfp_mask); spin_lock_irqsave(idr_lock, flags); - ret = idr_get_new(query_idr, query, id); + + id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT); + spin_unlock_irqrestore(idr_lock, flags); - if (ret == -EAGAIN) - goto retry; - if (ret) - return ret
Re: [PATCH 27/62] infiniband/ipath: convert to idr_alloc()
Hello, On Mon, Feb 04, 2013 at 04:15:52PM +, Marciniszyn, Mike wrote: I tried the branch you indicated in the initial patch cover. When run with a qib driver, and ipoib ping of another system produces: ... Looks like this is tripping during the arp/neighbour path resolution: void idr_preload(gfp_t gfp_mask) { /* * Consuming preload buffer from non-process context breaks preload * allocation guarantee. Disallow usage from those contexts. */ WARN_ON_ONCE(in_interrupt()); Any ideas Roland? Yeah, firewire had the same problem. It needs to conditionalize preload() if !__GFP_WAIT (no point anyway). Will send update patches soon. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 22/62] infiniband/core: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. v2: Mike triggered WARN_ON() in idr_preload() because send_mad(), which may be used from non-process context, was calling idr_preload() unconditionally. Preload iff @gfp_mask has __GFP_WAIT. Signed-off-by: Tejun Heo t...@kernel.org Reported-by: Marciniszyn, Mike mike.marcinis...@intel.com Cc: Roland Dreier rol...@kernel.org Cc: Sean Hefty sean.he...@intel.com Cc: Hal Rosenstock hal.rosenst...@gmail.com Cc: linux-rdma@vger.kernel.org --- drivers/infiniband/core/cm.c | 22 +++--- drivers/infiniband/core/cma.c| 24 +++- drivers/infiniband/core/sa_query.c | 18 ++ drivers/infiniband/core/ucm.c| 16 drivers/infiniband/core/ucma.c | 32 drivers/infiniband/core/uverbs_cmd.c | 17 - 6 files changed, 48 insertions(+), 81 deletions(-) --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_ static int cm_alloc_id(struct cm_id_private *cm_id_priv) { unsigned long flags; - int ret, id; + int id; static int next_id; - do { - spin_lock_irqsave(cm.lock, flags); - ret = idr_get_new_above(cm.local_id_table, cm_id_priv, - next_id, id); - if (!ret) - next_id = ((unsigned) id + 1) MAX_IDR_MASK; - spin_unlock_irqrestore(cm.lock, flags); - } while( (ret == -EAGAIN) idr_pre_get(cm.local_id_table, GFP_KERNEL) ); + idr_preload(GFP_KERNEL); + spin_lock_irqsave(cm.lock, flags); + + id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT); + if (id = 0) + next_id = ((unsigned) id + 1) MAX_IDR_MASK; + + spin_unlock_irqrestore(cm.lock, flags); + idr_preload_end(); cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand; - return ret; + return id 0 ? id : 0; } static void cm_free_id(__be32 local_id) @@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void) cm.remote_sidr_table = RB_ROOT; idr_init(cm.local_id_table); get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand); - idr_pre_get(cm.local_id_table, GFP_KERNEL); INIT_LIST_HEAD(cm.timewait_list); ret = class_register(cm_class); --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps unsigned short snum) { struct rdma_bind_list *bind_list; - int port, ret; + int ret; bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); if (!bind_list) return -ENOMEM; - do { - ret = idr_get_new_above(ps, bind_list, snum, port); - } while ((ret == -EAGAIN) idr_pre_get(ps, GFP_KERNEL)); - - if (ret) - goto err1; - - if (port != snum) { - ret = -EADDRNOTAVAIL; - goto err2; - } + ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + if (ret 0) + goto err; bind_list-ps = ps; - bind_list-port = (unsigned short) port; + bind_list-port = (unsigned short)ret; cma_bind_port(bind_list, id_priv); return 0; -err2: - idr_remove(ps, port); -err1: +err: kfree(bind_list); - return ret; + return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -611,19 +611,21 @@ static void init_mad(struct ib_sa_mad *m static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask) { + bool preload = gfp_mask __GFP_WAIT; unsigned long flags; int ret, id; -retry: - if (!idr_pre_get(query_idr, gfp_mask)) - return -ENOMEM; + if (preload) + idr_preload(gfp_mask); spin_lock_irqsave(idr_lock, flags); - ret = idr_get_new(query_idr, query, id); + + id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT); + spin_unlock_irqrestore(idr_lock, flags); - if (ret == -EAGAIN) - goto retry; - if (ret) - return ret; + if (preload) + idr_preload_end(); + if (id 0) + return id; query-mad_buf-timeout_ms = timeout_ms; query-mad_buf-context[0] = query; --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -176,7 +176,6 @@ static void ib_ucm_cleanup_events(struct static struct ib_ucm_context *ib_ucm_ctx_alloc(struct ib_ucm_file *file) { struct ib_ucm_context *ctx; - int result
Re: [PATCH 25/62] infiniband/cxgb4: convert to idr_alloc()
Hello, On Sun, Feb 3, 2013 at 6:18 AM, Steve Wise sw...@opengridcomputing.com wrote: Is there a git tree somewhere that I can use to test these patches out? git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git convert-to-idr_alloc - return ret; + if (lock) { + idr_preload(GFP_KERNEL); + spin_lock_irq(rhp-lock); + } The idr_preload() needs to be above the 'if (lock)', no? No reason to preload for ATOMIC allocation. idr_alloc() can be called by itself. + + ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC); + + if (lock) { + spin_unlock_irq(rhp-lock); + idr_preload_end(); + } And idr_preload_end() should be after the 'if (lock)' block methinks... Ditto. + + BUG_ON(ret == -ENOSPC); + return ret 0 ? ret : 0; What would cause ret 0? It's the allocated id. In this case, ret would either be @id or -errno. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/62] infiniband/core: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Roland Dreier rol...@kernel.org Cc: Sean Hefty sean.he...@intel.com Cc: Hal Rosenstock hal.rosenst...@gmail.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/core/cm.c | 22 +++--- drivers/infiniband/core/cma.c| 24 +++- drivers/infiniband/core/sa_query.c | 15 +++ drivers/infiniband/core/ucm.c| 16 drivers/infiniband/core/ucma.c | 32 drivers/infiniband/core/uverbs_cmd.c | 17 - 6 files changed, 45 insertions(+), 81 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 394fea2..98281fe 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -382,20 +382,21 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) static int cm_alloc_id(struct cm_id_private *cm_id_priv) { unsigned long flags; - int ret, id; + int id; static int next_id; - do { - spin_lock_irqsave(cm.lock, flags); - ret = idr_get_new_above(cm.local_id_table, cm_id_priv, - next_id, id); - if (!ret) - next_id = ((unsigned) id + 1) MAX_IDR_MASK; - spin_unlock_irqrestore(cm.lock, flags); - } while( (ret == -EAGAIN) idr_pre_get(cm.local_id_table, GFP_KERNEL) ); + idr_preload(GFP_KERNEL); + spin_lock_irqsave(cm.lock, flags); + + id = idr_alloc(cm.local_id_table, cm_id_priv, next_id, 0, GFP_NOWAIT); + if (id = 0) + next_id = ((unsigned) id + 1) MAX_IDR_MASK; + + spin_unlock_irqrestore(cm.lock, flags); + idr_preload_end(); cm_id_priv-id.local_id = (__force __be32)id ^ cm.random_id_operand; - return ret; + return id 0 ? id : 0; } static void cm_free_id(__be32 local_id) @@ -3844,7 +3845,6 @@ static int __init ib_cm_init(void) cm.remote_sidr_table = RB_ROOT; idr_init(cm.local_id_table); get_random_bytes(cm.random_id_operand, sizeof cm.random_id_operand); - idr_pre_get(cm.local_id_table, GFP_KERNEL); INIT_LIST_HEAD(cm.timewait_list); ret = class_register(cm_class); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d789eea..c32eeaa 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2143,33 +2143,23 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; - int port, ret; + int ret; bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); if (!bind_list) return -ENOMEM; - do { - ret = idr_get_new_above(ps, bind_list, snum, port); - } while ((ret == -EAGAIN) idr_pre_get(ps, GFP_KERNEL)); - - if (ret) - goto err1; - - if (port != snum) { - ret = -EADDRNOTAVAIL; - goto err2; - } + ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + if (ret 0) + goto err; bind_list-ps = ps; - bind_list-port = (unsigned short) port; + bind_list-port = (unsigned short)ret; cma_bind_port(bind_list, id_priv); return 0; -err2: - idr_remove(ps, port); -err1: +err: kfree(bind_list); - return ret; + return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index a8905ab..dc64bae 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -614,16 +614,15 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask) unsigned long flags; int ret, id; -retry: - if (!idr_pre_get(query_idr, gfp_mask)) - return -ENOMEM; + idr_preload(gfp_mask); spin_lock_irqsave(idr_lock, flags); - ret = idr_get_new(query_idr, query, id); + + id = idr_alloc(query_idr, query, 0, 0, GFP_NOWAIT); + spin_unlock_irqrestore(idr_lock, flags); - if (ret == -EAGAIN) - goto retry; - if (ret) - return ret; + idr_preload_end(); + if (id 0) + return id; query-mad_buf-timeout_ms = timeout_ms; query-mad_buf-context[0] = query; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 49b15ac..f2f6393 100644 --- a/drivers/infiniband/core/ucm.c
[PATCH 29/62] infiniband/ocrdma: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Roland Dreier rol...@purestorage.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/ocrdma/ocrdma_main.c | 14 +- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c index c4e0131..48928c8 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c @@ -51,18 +51,6 @@ static DEFINE_IDR(ocrdma_dev_id); static union ib_gid ocrdma_zero_sgid; -static int ocrdma_get_instance(void) -{ - int instance = 0; - - /* Assign an unused number */ - if (!idr_pre_get(ocrdma_dev_id, GFP_KERNEL)) - return -1; - if (idr_get_new(ocrdma_dev_id, NULL, instance)) - return -1; - return instance; -} - void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid) { u8 mac_addr[6]; @@ -416,7 +404,7 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info) goto idr_err; memcpy(dev-nic_info, dev_info, sizeof(*dev_info)); - dev-id = ocrdma_get_instance(); + dev-id = idr_alloc(ocrdma_dev_id, NULL, 0, 0, GFP_KERNEL); if (dev-id 0) goto idr_err; -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 28/62] infiniband/mlx4: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Jack Morgenstein ja...@dev.mellanox.co.il Cc: Or Gerlitz ogerl...@mellanox.com Cc: Roland Dreier rol...@purestorage.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/mlx4/cm.c | 32 +++- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c index dbc99d4..80e59ed 100644 --- a/drivers/infiniband/hw/mlx4/cm.c +++ b/drivers/infiniband/hw/mlx4/cm.c @@ -203,7 +203,7 @@ static void sl_id_map_add(struct ib_device *ibdev, struct id_map_entry *new) static struct id_map_entry * id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id) { - int ret, id; + int ret; static int next_id; struct id_map_entry *ent; struct mlx4_ib_sriov *sriov = to_mdev(ibdev)-sriov; @@ -220,25 +220,23 @@ id_map_alloc(struct ib_device *ibdev, int slave_id, u32 sl_cm_id) ent-dev = to_mdev(ibdev); INIT_DELAYED_WORK(ent-timeout, id_map_ent_timeout); - do { - spin_lock(to_mdev(ibdev)-sriov.id_map_lock); - ret = idr_get_new_above(sriov-pv_id_table, ent, - next_id, id); - if (!ret) { - next_id = ((unsigned) id + 1) MAX_IDR_MASK; - ent-pv_cm_id = (u32)id; - sl_id_map_add(ibdev, ent); - } + idr_preload(GFP_KERNEL); + spin_lock(to_mdev(ibdev)-sriov.id_map_lock); - spin_unlock(sriov-id_map_lock); - } while (ret == -EAGAIN idr_pre_get(sriov-pv_id_table, GFP_KERNEL)); - /*the function idr_get_new_above can return -ENOSPC, so don't insert in that case.*/ - if (!ret) { - spin_lock(sriov-id_map_lock); + ret = idr_alloc(sriov-pv_id_table, ent, next_id, 0, GFP_NOWAIT); + if (ret = 0) { + next_id = ((unsigned)ret + 1) MAX_IDR_MASK; + ent-pv_cm_id = (u32)ret; + sl_id_map_add(ibdev, ent); list_add_tail(ent-list, sriov-cm_list); - spin_unlock(sriov-id_map_lock); - return ent; } + + spin_unlock(sriov-id_map_lock); + idr_preload_end(); + + if (ret = 0) + return ent; + /*error flow*/ kfree(ent); mlx4_ib_warn(ibdev, No more space in the idr (err:0x%x)\n, ret); -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 30/62] infiniband/qib: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Mike Marciniszyn infinip...@intel.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/qib/qib_init.c | 21 - 1 file changed, 8 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index ddf066d..50e33aa 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -1060,22 +1060,23 @@ struct qib_devdata *qib_alloc_devdata(struct pci_dev *pdev, size_t extra) struct qib_devdata *dd; int ret; - if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) { - dd = ERR_PTR(-ENOMEM); - goto bail; - } - dd = (struct qib_devdata *) ib_alloc_device(sizeof(*dd) + extra); if (!dd) { dd = ERR_PTR(-ENOMEM); goto bail; } + idr_preload(GFP_KERNEL); spin_lock_irqsave(qib_devs_lock, flags); - ret = idr_get_new(qib_unit_table, dd, dd-unit); - if (ret = 0) + + ret = idr_alloc(qib_unit_table, dd, 0, 0, GFP_NOWAIT); + if (ret = 0) { + dd-unit = ret; list_add(dd-list, qib_dev_list); + } + spin_unlock_irqrestore(qib_devs_lock, flags); + idr_preload_end(); if (ret 0) { qib_early_err(pdev-dev, @@ -1180,11 +1181,6 @@ static int __init qlogic_ib_init(void) * the PCI subsystem. */ idr_init(qib_unit_table); - if (!idr_pre_get(qib_unit_table, GFP_KERNEL)) { - pr_err(idr_pre_get() failed\n); - ret = -ENOMEM; - goto bail_cq_wq; - } ret = pci_register_driver(qib_driver); if (ret 0) { @@ -1199,7 +1195,6 @@ static int __init qlogic_ib_init(void) bail_unit: idr_destroy(qib_unit_table); -bail_cq_wq: destroy_workqueue(qib_cq_wq); bail_dev: qib_dev_cleanup(); -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 27/62] infiniband/ipath: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Mike Marciniszyn infinip...@intel.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/ipath/ipath_driver.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 7b371f5..fcdaeea 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -194,11 +194,6 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) struct ipath_devdata *dd; int ret; - if (!idr_pre_get(unit_table, GFP_KERNEL)) { - dd = ERR_PTR(-ENOMEM); - goto bail; - } - dd = vzalloc(sizeof(*dd)); if (!dd) { dd = ERR_PTR(-ENOMEM); @@ -206,9 +201,10 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) } dd-ipath_unit = -1; + idr_preload(GFP_KERNEL); spin_lock_irqsave(ipath_devs_lock, flags); - ret = idr_get_new(unit_table, dd, dd-ipath_unit); + ret = idr_alloc(unit_table, dd, 0, 0, GFP_KERNEL); if (ret 0) { printk(KERN_ERR IPATH_DRV_NAME : Could not allocate unit ID: error %d\n, -ret); @@ -216,6 +212,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) dd = ERR_PTR(ret); goto bail_unlock; } + dd-ipath_unit = ret; dd-pcidev = pdev; pci_set_drvdata(pdev, dd); @@ -224,7 +221,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) bail_unlock: spin_unlock_irqrestore(ipath_devs_lock, flags); - + idr_preload_end(); bail: return dd; } @@ -2503,11 +2500,6 @@ static int __init infinipath_init(void) * the PCI subsystem. */ idr_init(unit_table); - if (!idr_pre_get(unit_table, GFP_KERNEL)) { - printk(KERN_ERR IPATH_DRV_NAME : idr_pre_get() failed\n); - ret = -ENOMEM; - goto bail; - } ret = pci_register_driver(ipath_driver); if (ret 0) { -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/62] infiniband/amso1100: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Tom Tucker t...@opengridcomputing.com Cc: Steve Wise sw...@opengridcomputing.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/amso1100/c2_qp.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c index 28cd5cb..0ab826b 100644 --- a/drivers/infiniband/hw/amso1100/c2_qp.c +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -382,14 +382,17 @@ static int c2_alloc_qpn(struct c2_dev *c2dev, struct c2_qp *qp) { int ret; -do { - spin_lock_irq(c2dev-qp_table.lock); - ret = idr_get_new_above(c2dev-qp_table.idr, qp, - c2dev-qp_table.last++, qp-qpn); - spin_unlock_irq(c2dev-qp_table.lock); -} while ((ret == -EAGAIN) -idr_pre_get(c2dev-qp_table.idr, GFP_KERNEL)); - return ret; + idr_preload(GFP_KERNEL); + spin_lock_irq(c2dev-qp_table.lock); + + ret = idr_alloc(c2dev-qp_table.idr, qp, c2dev-qp_table.last++, 0, + GFP_NOWAIT); + if (ret = 0) + qp-qpn = ret; + + spin_unlock_irq(c2dev-qp_table.lock); + idr_preload_end(); + return ret 0 ? ret : 0; } static void c2_free_qpn(struct c2_dev *c2dev, int qpn) -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 26/62] infiniband/ehca: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Hoang-Nam Nguyen hngu...@de.ibm.com Cc: Christoph Raisch rai...@de.ibm.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/ehca/ehca_cq.c | 27 +++ drivers/infiniband/hw/ehca/ehca_qp.c | 34 +++--- 2 files changed, 22 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 8f52901..212150c 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -128,7 +128,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, void *vpage; u32 counter; u64 rpage, cqx_fec, h_ret; - int ipz_rc, ret, i; + int ipz_rc, i; unsigned long flags; if (cqe = 0x - 64 - additional_cqe) @@ -163,32 +163,19 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, adapter_handle = shca-ipz_hca_handle; param.eq_handle = shca-eq.ipz_eq_handle; - do { - if (!idr_pre_get(ehca_cq_idr, GFP_KERNEL)) { - cq = ERR_PTR(-ENOMEM); - ehca_err(device, Can't reserve idr nr. device=%p, -device); - goto create_cq_exit1; - } - - write_lock_irqsave(ehca_cq_idr_lock, flags); - ret = idr_get_new(ehca_cq_idr, my_cq, my_cq-token); - write_unlock_irqrestore(ehca_cq_idr_lock, flags); - } while (ret == -EAGAIN); + idr_preload(GFP_KERNEL); + write_lock_irqsave(ehca_cq_idr_lock, flags); + my_cq-token = idr_alloc(ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT); + write_unlock_irqrestore(ehca_cq_idr_lock, flags); + idr_preload_end(); - if (ret) { + if (my_cq-token 0) { cq = ERR_PTR(-ENOMEM); ehca_err(device, Can't allocate new idr entry. device=%p, device); goto create_cq_exit1; } - if (my_cq-token 0x1FF) { - cq = ERR_PTR(-ENOMEM); - ehca_err(device, Invalid number of cq. device=%p, device); - goto create_cq_exit2; - } - /* * CQs maximum depth is 4GB-64, but we need additional 20 as buffer * for receiving errors CQEs. diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 1493939..00d6861 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -636,30 +636,26 @@ static struct ehca_qp *internal_create_qp( my_qp-send_cq = container_of(init_attr-send_cq, struct ehca_cq, ib_cq); - do { - if (!idr_pre_get(ehca_qp_idr, GFP_KERNEL)) { - ret = -ENOMEM; - ehca_err(pd-device, Can't reserve idr resources.); - goto create_qp_exit0; - } + idr_preload(GFP_KERNEL); + write_lock_irqsave(ehca_qp_idr_lock, flags); - write_lock_irqsave(ehca_qp_idr_lock, flags); - ret = idr_get_new(ehca_qp_idr, my_qp, my_qp-token); - write_unlock_irqrestore(ehca_qp_idr_lock, flags); - } while (ret == -EAGAIN); + ret = idr_alloc(ehca_qp_idr, my_qp, 0, 0x200, GFP_NOWAIT); + if (ret = 0) + my_qp-token = ret; - if (ret) { - ret = -ENOMEM; - ehca_err(pd-device, Can't allocate new idr entry.); + write_unlock_irqrestore(ehca_qp_idr_lock, flags); + idr_preload_end(); + if (ret 0) { + if (ret == -ENOSPC) { + ret = -EINVAL; + ehca_err(pd-device, Invalid number of qp); + } else { + ret = -ENOMEM; + ehca_err(pd-device, Can't allocate new idr entry.); + } goto create_qp_exit0; } - if (my_qp-token 0x1FF) { - ret = -EINVAL; - ehca_err(pd-device, Invalid number of qp); - goto create_qp_exit1; - } - if (has_srq) parms.srq_token = my_qp-token; -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 25/62] infiniband/cxgb4: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Steve Wise sw...@chelsio.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h index 9c1644f..7f862da 100644 --- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h +++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h @@ -260,20 +260,21 @@ static inline int _insert_handle(struct c4iw_dev *rhp, struct idr *idr, void *handle, u32 id, int lock) { int ret; - int newid; - do { - if (!idr_pre_get(idr, lock ? GFP_KERNEL : GFP_ATOMIC)) - return -ENOMEM; - if (lock) - spin_lock_irq(rhp-lock); - ret = idr_get_new_above(idr, handle, id, newid); - BUG_ON(!ret newid != id); - if (lock) - spin_unlock_irq(rhp-lock); - } while (ret == -EAGAIN); - - return ret; + if (lock) { + idr_preload(GFP_KERNEL); + spin_lock_irq(rhp-lock); + } + + ret = idr_alloc(idr, handle, id, id + 1, GFP_ATOMIC); + + if (lock) { + spin_unlock_irq(rhp-lock); + idr_preload_end(); + } + + BUG_ON(ret == -ENOSPC); + return ret 0 ? ret : 0; } static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr, -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 24/62] infiniband/cxgb3: convert to idr_alloc()
Convert to the much saner new idr interface. Only compile tested. Signed-off-by: Tejun Heo t...@kernel.org Cc: Steve Wise sw...@chelsio.com Cc: linux-rdma@vger.kernel.org --- This patch depends on an earlier idr changes and I think it would be best to route these together through -mm. Please holler if there's any objection. Thanks. drivers/infiniband/hw/cxgb3/iwch.h | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index a1c4457..8378622 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -153,19 +153,17 @@ static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, void *handle, u32 id) { int ret; - int newid; - - do { - if (!idr_pre_get(idr, GFP_KERNEL)) { - return -ENOMEM; - } - spin_lock_irq(rhp-lock); - ret = idr_get_new_above(idr, handle, id, newid); - BUG_ON(newid != id); - spin_unlock_irq(rhp-lock); - } while (ret == -EAGAIN); - - return ret; + + idr_preload(GFP_KERNEL); + spin_lock_irq(rhp-lock); + + ret = idr_alloc(idr, handle, id, id + 1, GFP_NOWAIT); + + spin_unlock_irq(rhp-lock); + idr_preload_end(); + + BUG_ON(ret == -ENOSPC); + return ret 0 ? ret : 0; } static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) -- 1.8.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] lock scaling and bug fixes for 2.6.38
Hello, On Tue, Jan 11, 2011 at 06:29:52PM +0100, Bart Van Assche wrote: On Tue, Jan 11, 2011 at 2:45 AM, Roland Dreier rdre...@cisco.com wrote: git://git.kernel.org/pub/scm/linux/kernel/git/dad/srp-initiator.git srp-lock-scaling Thanks, pulled. Will try to merge Tejun's work queue patch shortly. Maybe it's a good idea to add a comment above the declaration of the global variable extern struct workqueue_struct *ib_wq; that explains that that variable has been added during the work queue conversion and that the preferred approach for new code is to define a new work queue instead of using ib_wq ? But why is defining more workqueues preferred? Unless there's a specific need, using what's already there (system or ib_wq) should be preferable. Thank you. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage
Hello, On 10/19/2010 07:22 PM, Ralph Campbell wrote: On Tue, 2010-10-19 at 08:24 -0700, Tejun Heo wrote: * qib_cq_wq is a separate singlethread workqueue. Does the queue require strict single thread execution ordering? IOW, does each work have to be executed in the exact queued order and no two works should execute in parallel? Or was the singlethreadedness chosen just to reduce the number of workers? The work functions need to be called in-order and single threaded or memory will be freed multiple times and other bad things. I see, so they'll need to be converted to alloc_ordered_workqueue() once -rc1 merge window opens up. I'll follow up with the conversion. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage
Hello, On 10/19/2010 08:40 PM, Bart Van Assche wrote: On Tue, Oct 19, 2010 at 5:24 PM, Tejun Heo t...@kernel.org wrote: [ ... ] This is to prepare for deprecation of flush_scheduled_work(). [ ... ] Index: work/include/rdma/ib_verbs.h [ ... ] +extern struct workqueue_struct *ib_wq; [ ... ] This patch adds a declaration of a global variable to a public header file. That might be unavoidable, but it doesn't make me happy. Hmm... that's one very interesting reason to be unhappy. Can you please elaborate why addition of a global variable doesn't make you happy? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2.6.36-rc7] infiniband: update workqueue usage
Hello, On 10/20/2010 12:21 PM, Bart Van Assche wrote: Hmm... that's one very interesting reason to be unhappy. Can you please elaborate why addition of a global variable doesn't make you happy? In the past every time I saw a global variable being added in a software project that meant that some software concept was not being abstracted properly. Which does not necessarily mean that that is the case with this patch. That's too wide brushed position to agree with, but, well, let's talk about that some other time. :-) With the posted patch, just like with the current implementation, e.g. the flush_workqueue() call in the ipath driver will make that driver wait until work scheduled by the core/sa_query.c code finished and vice versa. Is that necessary ? If not, using multiple work queues for IB instead of one would allow to get rid of that global ib_wq declaration. Yeap, if breaking it down further makes sense, definitely. If someone is interested, the followings are the guidelines I've been following in these conversions. * If all the possibly pending works can be safely enumerated without too much difficulty and the works in question are not used during memory reclaim, use the system workqueue and use explicit flush/cancel_work[_sync]() instead of using flush_workqueue(). Please note that flush_work_sync() isn't in mainline yet. It's scheduled for the coming rc1 window. * If works can be used during memory reclaim, there's no way around it. A separate workqueue needs to be used. * If works free themselves or are used to put the last reference of the containing object, they can't be flushed explicitly and thus need to be queued on a dedicated workqueue which serves as the flushing domain. For this patch, I chose less intrusive path and just replaced system wide workqueue with subsystem wide one mainly because I don't have experience with or access to anything infiniband. If someone wants to push it further, I would be happy to help. BTW, does the posted patch look safe to you? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2.6.36-rc7] infiniband: update workqueue usage
* ib_wq is added, which is used as the common workqueue for infiniband instead of the system workqueue. All system workqueue usages including flush_scheduled_work() callers are converted to use and flush ib_wq. * cancel_delayed_work() + flush_scheduled_work() converted to cancel_delayed_work_sync(). * qib_wq is removed and ib_wq is used instead. This is to prepare for deprecation of flush_scheduled_work(). Signed-off-by: Tejun Heo t...@kernel.org --- Hello, I think this patch is safe but don't have any experience with or access to infiniband stuff and it's only compile tested. Also, while looking through the code, I got curious about several things. * Can any of the works in infiniband be used during memory reclaim? * qib_cq_wq is a separate singlethread workqueue. Does the queue require strict single thread execution ordering? IOW, does each work have to be executed in the exact queued order and no two works should execute in parallel? Or was the singlethreadedness chosen just to reduce the number of workers? * The same question for ipoib_workqueue. Thank you. drivers/infiniband/core/cache.c|4 +-- drivers/infiniband/core/device.c | 11 -- drivers/infiniband/core/sa_query.c |4 +-- drivers/infiniband/core/umem.c |2 - drivers/infiniband/hw/ipath/ipath_driver.c |2 - drivers/infiniband/hw/ipath/ipath_user_pages.c |2 - drivers/infiniband/hw/qib/qib_iba7220.c|7 ++ drivers/infiniband/hw/qib/qib_iba7322.c| 14 ++--- drivers/infiniband/hw/qib/qib_init.c | 26 +++-- drivers/infiniband/hw/qib/qib_qsfp.c |9 +++- drivers/infiniband/hw/qib/qib_verbs.h |5 +--- drivers/infiniband/ulp/srp/ib_srp.c|4 +-- include/rdma/ib_verbs.h|3 ++ 13 files changed, 41 insertions(+), 52 deletions(-) Index: work/drivers/infiniband/core/cache.c === --- work.orig/drivers/infiniband/core/cache.c +++ work/drivers/infiniband/core/cache.c @@ -308,7 +308,7 @@ static void ib_cache_event(struct ib_eve INIT_WORK(work-work, ib_cache_task); work-device = event-device; work-port_num = event-element.port_num; - schedule_work(work-work); + queue_work(ib_wq, work-work); } } } @@ -368,7 +368,7 @@ static void ib_cache_cleanup_one(struct int p; ib_unregister_event_handler(device-cache.event_handler); - flush_scheduled_work(); + flush_workqueue(ib_wq); for (p = 0; p = end_port(device) - start_port(device); ++p) { kfree(device-cache.pkey_cache[p]); Index: work/drivers/infiniband/core/device.c === --- work.orig/drivers/infiniband/core/device.c +++ work/drivers/infiniband/core/device.c @@ -38,7 +38,6 @@ #include linux/slab.h #include linux/init.h #include linux/mutex.h -#include linux/workqueue.h #include core_priv.h @@ -52,6 +51,9 @@ struct ib_client_data { void *data; }; +struct workqueue_struct *ib_wq; +EXPORT_SYMBOL_GPL(ib_wq); + static LIST_HEAD(device_list); static LIST_HEAD(client_list); @@ -718,6 +720,10 @@ static int __init ib_core_init(void) { int ret; + ib_wq = alloc_workqueue(infiniband, 0, 0); + if (!ib_wq) + return -ENOMEM; + ret = ib_sysfs_setup(); if (ret) printk(KERN_WARNING Couldn't create InfiniBand device class\n); @@ -726,6 +732,7 @@ static int __init ib_core_init(void) if (ret) { printk(KERN_WARNING Couldn't set up InfiniBand P_Key/GID cache\n); ib_sysfs_cleanup(); + destroy_workqueue(ib_wq); } return ret; @@ -736,7 +743,7 @@ static void __exit ib_core_cleanup(void) ib_cache_cleanup(); ib_sysfs_cleanup(); /* Make sure that any pending umem accounting work is done. */ - flush_scheduled_work(); + destroy_workqueue(ib_wq); } module_init(ib_core_init); Index: work/drivers/infiniband/core/sa_query.c === --- work.orig/drivers/infiniband/core/sa_query.c +++ work/drivers/infiniband/core/sa_query.c @@ -422,7 +422,7 @@ static void ib_sa_event(struct ib_event_ port-sm_ah = NULL; spin_unlock_irqrestore(port-ah_lock, flags); - schedule_work(sa_dev-port[event-element.port_num - + queue_work(ib_wq, sa_dev-port[event-element.port_num - sa_dev-start_port].update_task); } } @@ -1068,7 +1068,7 @@ static void ib_sa_remove_one(struct ib_d ib_unregister_event_handler(sa_dev
Re: idr_get_new_exact ?
Hello, On 09/23/2010 01:42 PM, Paul Mundt wrote: On Mon, Sep 20, 2010 at 11:26:47PM +0200, Tejun Heo wrote: Hello, On 09/20/2010 10:35 PM, Roland Dreier wrote: Looks fine to me as an improvement over the status quo, but I wonder how many of these places could use the radix_tree stuff instead? If you're not using the ability of the idr code to assign an id for you, then it seems the radix_tree API is a better fit. I agree. Wouldn't those users better off simply using radix tree? It could go either way. I was about to write the same function when playing with it for IRQ mapping, the idea being to propagate the initial tree with sparse static vectors and then switch over to dynamic IDs for virtual IRQ creation. I ended up going with a radix tree for other reasons, though. I see. If there are use cases where fixed and dynamic IDs need to be mixed, no objection from me. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: idr_get_new_exact ?
Hello, On 09/20/2010 10:35 PM, Roland Dreier wrote: Looks fine to me as an improvement over the status quo, but I wonder how many of these places could use the radix_tree stuff instead? If you're not using the ability of the idr code to assign an id for you, then it seems the radix_tree API is a better fit. I agree. Wouldn't those users better off simply using radix tree? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html