Hi Thomas,
On Wed, Jun 28, 2017 at 09:03:19PM +0200, Thomas Gleixner wrote:
> On Sat, 13 May 2017, Chen Yu wrote:
> > This is because:
> > 1. One of the drivers has declare many vector resource via
> >    pci_enable_msix_range(), say, this driver might likely want
> >    to reserve 6 per logical CPU, then there would be 192 of them.
> 
> This has been solved with the managed interrupt mechanism for multi queue
> devices, which does not longer migrate these interrupts. It simply shuts
> them down and restarts them when the cpu comes back online.
> 
I think the driver has already enabled multi queue:
ethtool -l enp187s0f0
Channel parameters for enp187s0f0:
Pre-set maximums:
RX:             0
TX:             0
Other:          1
Combined:       128
Current hardware settings:
RX:             0
TX:             0
Other:          1
Combined:       32
> > 2. Besides, most of the vectors for this driver are allocated
> >    on CPU0 due to the current code strategy, so there would be
> >    insufficient slots left on CPU0 to receive any migrated
> >    IRQs from the other CPUs during CPU offine.
> 
> That's a driver sillyness. I assume it's a multi queue device, so it should
> use the managed interrupt affinity code.
> 
Let me Cc my colleagues Mitch and Harshitha who are working on
this driver currently.
> > 3. Furthermore, many vectors this driver reserved do no have
> >    any IRQ handler attached.
> 
> That's silly. Why does a driver allocate more resources than required? Just
> because it can?
> 
The reason why the driver has reserved the resource but w/o using them
might be that(well, it is just my guess), some network cable are not
plugged in the interface at the moment, although the driver has been
loaded. I will confirm if this is the case first.
> > As a result, all vectors on CPU0 were used out and the last alive
> > CPU (31) failed to migrate its IRQs to the CPU0.
> > 
> > As we might have difficulty to reduce the number of vectors reserved
> > by that driver,
> 
> Why so? If the vectors are not used, then why are they allocated in the
> first place? Can you point at the driver source please?
> 
We are testing the net/ethernet/intel/i40e driver now. Previously we have
introduced a mechanism to release some vectors before S4 thus to work
around this issue, but as you mentioned before, if these are multi-queues,
the release will be accomplisheded automatically during cpu hotplug, I wonder
if the vectors reserved will also be released even without handler attached?
> > there could be a compromising solution that, to spread
> > the vector allocation on different CPUs rather than always choosing
> > the *first* CPU in the cpumask. In this way, there would be a balanced
> > vector distribution. Because many vectors reserved but without used(point 3
> > above) will not be counted in during CPU offline, and they are now
> > on nonboot CPUs this problem will be solved.
> 
> You are tackling the problem at the wrong end to some extent. I'm not
> against sanely spreading interrupts, but we only want to do that, if there
> is a compelling reason. A broken driver does not count.
>   
Understand.
> > +static int pick_leisure_cpu(const struct cpumask *mask)
> > +{
> > +   int cpu, vector;
> > +   int min_nr_vector = NR_VECTORS;
> > +   int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> > +
> > +   for_each_cpu_and(cpu, mask, cpu_online_mask) {
> > +           int nr_vectors = 0;
> > +
> > +           for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; 
> > vector++) {
> > +                   if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> > +                           nr_vectors++;
> > +           }
> > +           if (nr_vectors < min_nr_vector) {
> > +                   min_nr_vector = nr_vectors;
> > +                   target_cpu = cpu;
> > +           }
> 
> That's beyond silly. Why would we have to count the available vectors over
> and over? We can keep track of the vectors available per cpu continuously
> when we allocate and free them.
> 
Yes, that would be much more efficient:)
> > +   }
> > +   return target_cpu;
> > +}
> > +
> >  static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> >                            const struct cpumask *mask)
> >  {
> > @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct 
> > apic_chip_data *d,
> >     /* Only try and allocate irqs on cpus that are present */
> >     cpumask_clear(d->old_domain);
> >     cpumask_clear(searched_cpumask);
> > -   cpu = cpumask_first_and(mask, cpu_online_mask);
> > +   cpu = pick_leisure_cpu(mask);
> >     while (cpu < nr_cpu_ids) {
> >             int new_cpu, offset;
> 
> I'm really not fond of this extra loop. That function is a huge nested loop
> mess already. The cpu we select upfront gets fed into
> apic->vector_allocation_domain() which might just say no because the CPU
> does not belong to it. So yes, it kinda works for your purpose, but it's
> just a bolted on 'solution'.
> 
> We really need to sit down and rewrite that allocation code from scratch,
> use per cpu bitmaps, so we can do fast search over masks and keep track of
> the vectors which are used/free per cpu.
> 
> That needs some thought because we need to respect the allocation domains,
> but there must be a saner way than looping with interrupts disabled for a
> gazillion hoops.
> 
> What the current code misses completely is to set a preference for the node
> on which the interrupt was allocated. We just blindly take something out of
> the default affinity mask which is all online cpus.
> 
> So there are a lot of shortcomings in the current implementation and we
> should be able to replace it with something better, faster and while at it
> make that spreading stuff work.
Agree. Sounds to be a good improvement, let me check the i40e driver first
and then see if this can be revised according to this design.
> 
> That does not mean, that we blindly support drivers allocating a gazillion
> vectors just for nothing.
> 
Thanks,
        Yu
> Thanks,
> 
>       tglx
> 

Reply via email to