> This patch implements a low latency Write Combining (aka Write Coalescing) 
> work
> request path. PCIE maps User Space Doorbell BAR2 region writes to the new
> interface to SGE. SGE pulls a new message from PCIE new interface and if its a
> coalesced write work request then pushes it for processing. This patch copies
> coalesced work request to memory mapped BAR2 space.
...
> +     } else {
> +             memset(data, 0, 2 * sizeof(u64));
> +             *data += 2;
> +     }

Using memset is overkill (or rather a big overhead if it isn't
detected by the compiler). Nothing wrong with:
                (*data)[0] = 0;
                (*data)[1] = 0;
                *data += 2;
Actually, typing that, I realise that you should probably have
read *data into a local variable, and then updated it when finished.
Otherwise some of the accesses might be ones which force the
compiler to reload the value.
                
>  static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
>  {
> +     unsigned int *wr, index;
> +
>       wmb();            /* write descriptors before telling HW */
>       spin_lock(&q->db_lock);
>       if (!q->db_disabled) {
> -             t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
> -                          QID(q->cntxt_id) | PIDX(n));
> +             if (is_t4(adap->chip)) {
> +                     t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
> +                                  QID(q->cntxt_id) | PIDX(n));
> +             } else {
> +                     if (n == 1) {
> +                             index = q->pidx ? (q->pidx - 1) : (q->size - 1);
> +                             wr = (unsigned int *)&q->desc[index];
> +                             cxgb_pio_copy((u64 __iomem *)
> +                                           (adap->bar2 + q->udb + 64),
> +                                           (u64 *)wr);

Why all the casts on 'wr' ?

> +                     } else
> +                             writel(n,  adap->bar2 + q->udb + 8);
> +                     wmb();

Since you actually need memory barriers here on x86 you definitely
need a comment saying so, and it would (IMHO) better to use a
different define in the source (even if it is currently converted
to wmb() in a local header file).

Thinking further, for portability you might want to find some way
of abstracting the multi-word writes somehow.
For example, some of the ppc have a dma engine associated with the
PCIe master interface that can be used to generate large TLP.
The code would still want to spin waiting for the dma to complete
(since the transfer would be far faster than any interrupt path).

        David



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to