It seems that this discussion becomes really into the implementation detail, 
and I think a discussion over phone would be more efficient. Let me know 
your convenient time so that we could have a talk. I can summarize the 
discussion later on the alias.

But I still tried to speak out some of my questions inline. See below:

> On Wed, Oct 17, 2007 at 12:55:56PM +0800, Cathy Zhou wrote:
>>>> Leave that aside, if I understand correctly, "normal" packets would be 
>>>> send/received from the data-path-stream, and "special" packets 
>>>> (received because of the promiscuity-mode or the multicast packets) 
>>>> would be received from the shared-lower-stream.
>>>>
>>> initially the data stream is the only stream that can receive packets 
>>> (unicast + broadcast) since it's bound to a particular sap (e.g. IP). 
>>> the shared stream is not bound so it cannot receive anything.
>>>
>> Assume that there is a per-stream DLS stream, and another MAC client that 
>> requires the shared lower stream to be bound to a sap (for example, an 
>> aggregation is created over this MAC), does that mean the performance of the 
>> per-stream DLS stream would also suffer from this? Further, in this case, 
>> whether or not to filter out the duplicated messages depends not only on the 
>> state of the stream itself, but also the existence of other MAC clients. 
>> This does not seem right.
>>
> 
> first, aggr would probably mac_active_set() the softmac so the other DLS
> stream you mentioned can only be a passive (most likely snoop) stream right?
> why are you so concerned with snoop performance?
> 
> ok. let suppose it's not aggr but another upper stream that sends down
> a DL_ENABMULTI_REQ to the shared stream. yes, this will affect performance
> of other upper streams. but the performance hit most likely won't be due
> to filtering; it will be due to the cost of demultiplexing to multiple
> streams within the dlpi driver underneath softmac. from softmac upwards,
> the filtering cost will be one comparison (calling mac_ether_unicst_verify()).
> 
> let me reiterate what I suggested:
> 
> case 1
> one data stream, one shared stream, both bound to sap, with multicast
> addresses added to the shared stream, with physical promisc mode off
> on the shared stream (but DL_PROMISC_SAP could be on):
> 
> softmac_rput() filters packets as follows:
> -data path stream drops all non-unicast packets.
> -shared stream drops all unicast packets.
> 
> case 2
> same config as case 1 except physical promisc mode is ON on the
> shared stream:
> 
> there are two ways softmac_rput() could filter packets:
> simple option (hurts data path performance when there are snoops)
> -data path stream drops all packets
> -shared stream passes up all packets.
> 
It is not only the performance issue that concerns me, but also the fact 
that each lower data stream has to know the state of other upper streams to 
switch different filter policy.

It means that for each dls_str_t (dld_impl_t), it has to somehow inform the 
lower stream of other dld_impl_t when its state changes.

I am not sure whether that would make code less invasive or not.

> dls filter option
> -data path stream drops all non-unicast packets
> -shared stream passes up all packets. dls_accept() drops unicast
>  packets if the dls_impl_t has a data path stream open and the
>  dls_impl_t does not belong to a snoop upper stream.
> 
This would still hurt the performance, because the message parsing would 
happen on both data path stream and shared stream. But maybe you are right, 
maybe it does not matter.
> 
>> This one-byte comparison assumes Ethernet media type, right?
>>
> 
> yes. with the IB driver porting already under way, I don't think this is
> a problem.
> 
What I meant is that to determine unicast/multicast/broadcast packets hance 
determine whether or not to filter them out might not be only one-type 
comparison for other media types.
>  
> I don't think you need to worry about the aggr case unless you really
> want to make passive streams perform well. for the promisc case, I
> think we could choose the "dls filter" option, which should not affect
> data path performance much.
> 
That is true. You convinced me on this.

>>> also, I noticed you added a new function dld_wput_perstream_callback() 
>>> and similar putnext() code in dld_wsrv(). the reason you do this here 
>>> instead of within softmac_m_tx() is because you can't pass down the the 
>>> lower stream write queue, right? 
>> dld_wput_perstream_callback() is for the per-stream scheme upperstream, 
>> (which is the fast-path), and we don't do softmac_m_tx() is that we'd like 
>> to skip MAC layer completely in that case, so neither mac_tx(), or 
>> mac_txloop() is called in that case.
>>
> 
> but if we could add a third arg to the mi_tx() entry point so you could
> pass down a write queue, why do you need to bypass mac at all? the cost
> of calling dls_tx()->mt_fn() should be very similar to calling
> dld_wput_perstream_callback(). If the performance gain is not huge,
> I would really like to see this removed from dld.
> 
I used to have structured code this way:

        wq = dsp->ds_perstream_wq;
        if (!CANPUTNEXT(wq)) {
                dld_tx_enqueue();
                return (mp);
        } else {
                putnext(wq, mp);
                return (NULL);
        }

I believe the result shows even the loss of tail-call introduces the 
performance regression. If I was assured that this performance regression 
does not matter. I am sure fine to change as you suggested.

> the way I see it is:
> legacy drivers are going away. legacy nics will all be EOL'ed someday if
> they haven't already. nemo, on the other hand, is going to stay and will
> continually be extended for a long time. if we make any optimizations
> in the framework now, there had better be good reasons that these are going
> to benefit most drivers, not just a small subset of them. this per-stream
> optimization is, unfortunately, benefiting only legacy drivers and in my
> opinion, its implementation is far too invasive to the framework.
> 
> I think we need to make a judgement call here, do we want to:
> -support all legacy drivers and guarantee 0% degradation under *all*
>  circumstances. if this is a requirement, then it's unfortunate that
>  we would likely have to keep the existing perstream implementation.
>  the downside of this is that eventually this code will become a
>  maintenance burden.
> 
Adding fast-path always makes the code uglier, and myself does not like that 
hack either. Again, if your propose turns out it is less hacky, I am willing 
  to make the change. Based on the amount of work, I think it would only be 
done post-first-putback though.

Thanks
- Cathy

Reply via email to