[crossbow-discuss] Comments on updated arch document

Narayan Venkat Thu, 11 Oct 2007 18:13:55 -0400

Hi Nicolas

Sorry for the late response. Please see some more questions/comments  
below ..


<..snip..>

>
>>
>>     Q1.2) On pg 38, there is a reference to the following flags,  
>> but which
>>          interface takes them as an argument?
>>
>>       MAC_OPEN_FLAGS_FORCE_MULTI_RINGS
>>       MAC_OPEN_FLAGS_FORCE_ONE_RING
>>
>>       It seems like these are an argument to mac_client_open(),
>>       but there is a reference mac_open() in the description see below:
>>
>>       "If MAC_OPEN_FLAGS_FORCE_MULTI_RINGS flag is set and it is not
>>        possible to allocate mbc_ncpus hardware rings, the mac_open()
>>        call will fail, otherwise the MAC layer will attempt to reserve
>>        one hardware ring for the MAC client."
>
> These flags are specified when calling mac_client_open(), not  
> mac_open().

I guess the above text will get fixed in a subsequent revision.


>>
>>     Q1.3) Are there any other flags other than the following ones?
>>
>>       MAC_OPEN_FLAGS_FORCE_MULTI_RINGS
>>       MAC_OPEN_FLAGS_FORCE_ONE_RING
>
> No.

Is there a reason this is tied to hardware rings. We would like the  
mac client
open request be extended so that it can get either all software  
rings, a mix of
hardware and software or HW rings in order to match the number of  
cpus specified
in the client_open call .. A flag can be specified for this ..

Also according to the explanation in the doc at page 38, there is
also a case where no flags is specified. It seems like, if no flags
specified, then it will attempt to reserve one hardware ring.
It seems to not to fail even if such reservation fails, but it is not
clearly specified.

>>  - Is there a way to force a software ring?
>
> Do you mean not assign a hardware ring? I think this is something  
> we could add, yes.

This is related to the above. Can you add a flag that we can use
to indicate that client wants to use a single ring or multiple rings
but not force hardware rings. That way even when the underlying
device does not have enough hardware rings, a client can get a soft
ring per CPU.

Above comment applies to this one also. The behavior without any flags
seem to attempt to reserve one h/w ring. What is the failure case ?



>>
>>     Q1.4) Is the mbc_cpus in mac_bind_cpus_t an array of CPU ids?
>
> Yes.
>
>>
>>     Q1.5) The following description of mbc_cpus on pg 37 is not  
>> clear,
>>          especially for the non-NULL case.
>>
>>       "If mbc_cpus is NULL, the MAC layer will pick the CPUs.
>>       If mbc_cpus is non-NULL, the MAC layer will chose the CPUs.".
>
> The first one is correct. If mbc_cpus is non-NULL, the MAC layer  
> will assign the CPUs provided by the caller.

When the mbc_cpus is NULL what determines how many CPUs and hence
the number of rings available to this client.

>
>>
>>     Q1.6) What is the relationship between Unicast addresses(multiple
>>          unicast set via mac_unicast_add()), Rings and CPUs?
>>      
>>       - Is there a 1:1 relation between a unicast address and a ring?
>>       - Is there a 1:1 relation between a ring and CPU?
>
> Neither. The MAC addresses will share the same rings and CPUs.

But since you are allowing multiple mac addresses to be associated
with a client, can we add support as part of the unicast_add call
to indicate that each of these addresses should be associated with
ring (either HW or SW).

>
>>
>>         -  The Rings and CPUs are tightly coupled in this interface.
>>         How can allocate multiple rings even when there is one CPU(or  
>> less
>>         number of CPUs).
>
> You don't allocate rings explicitly, you express a level of  
> parallelism instead, the framework distributes the hardware rings  
> transparently.

But the only way we can control this parallelism is by specifying
the number of CPUs in the domain. But in a system capable of adding
and removing CPUs dynamically, we might want to change the parallelism
level too. The current APIs dont allow changing this. We will need
a way to specify this as an extension to the client_open or a via
a new API call.

Also the document states that if mbc_ncpus HW rings the open will fail.
As I mentioned earlier it would be nice if we can get software rings
in this case.

Also, in terms of parallelism is this specified by the no. of CPUs
or by unique CPUIDs in the array. What happens if I specify ncpus
where all IDs are the same - do I get ncpus HW rings if they are
available. Also can we then change the ring to cpu mapping when
more CPUs are added/removed to/from the domain ?

>
>>       - When there are multiple CPUs and multiple unicast addresses,
>>         is there address fanout per CPU?
>
> See 2 answers above.

This will be a very useful feature as it will allow clients to associate
each ring with a mac address. Currently the only way to do this is
to do separate mac_client_open calls associate it with a ring and then
bind it to a mac address.

>
>>
>>     Q1.7) How is the binding of CPUs via mac_bind_cpus_t is co- 
>> ordinated
>>          with CPU DR(on the platforms that support them)?
>
> The MAC layer will be notified of the removal of the CPU and will  
> stop using it for its worker threads and interrupts.

That is purely error handling. We need the ability to be able to use
more CPUs and improve the level of parallelism when CPUs are added.
The reverse is true when the CPUs are removed. When the MAC layer is
notified about CPUs going away does it remove the rings associated
with the CPUs ?

>
>>
>>       NOTE: CPU DR is already a supported feature on LDoms.
>>      
>>     Q1.8) LDoms requires the CPU binding to be changed dynamically,
>>           how can this be accomplished ?
>
> This cannot be done with the API as documented today. It seems that  
> you are looking for a call to change the set of CPUs assigned to  
> the MAC client, is that what you are asking for?

See 1.7

<..snip..>

>>
>>     Q1.10) Can the mac client interface be extended to support  
>> creating
>>            a client based on ether_type? This is required for mac  
>> clients
>>         like fiberchannel over ethernet.
>
> No, each MAC client corresponds to a MAC level entity which is  
> defined by its MAC address. Multiple ether types can be supported  
> on top of a MAC client.

Devices like the Niagara2 NIU allow classification of packets using
parameters like the ether_type. How can a mac_client take advantage
of such a functionality.

<..snip..>

>>      Q2.1) The section 4.5 describes "By value" type which is used
>>           to set a specific MAC address by the MAC client. But there
>>           is no equivalent addr_type definition under mac_unicast_add()
>>           interface.
>
> MAC_UNICAST_VALUE is missing from the list, this is what you are  
> looking for.

I presume this will be documented in the next revision.

>>
>>           NOTE: LDoms requires the MAC addresses that are allocated
>>           by LDom manager be used by the network device. So, LDoms
>>           will not use any other addr_type other than "By value" type.
>
> That's fine.
>
>>
>>      Q2.2) Is there an impact to the multiaddress_capab_t.maddr_add()/
>>           maddr_remove() interfaces? Are these being obsoleted or
>>           going away?
>
> The capability will stay, and the framework will continue to use  
> that capability to query and control the allocation of MAC address  
> slots. However that interface is not intended to be used by drivers  
> which should use the MAC client interfaces instead.

OK.


>>
>>      Q2.3) A system with many domains (aka LDoms) with virtual network
>>               devices, it requires the use of a large number  
>> layer2 addresses,
>>            this will exhaust h/w slots available on most standard NICs.
>>            How can a client take advantage of layer2 filtering  
>> provided by
>>            NICs like NII-NIU/Neptune. Specifically, this will help in
>>               avoiding the programming of the device into  
>> PROMISCous mode
>>               etc.  Currently there are no interfaces that seem to  
>> provide
>>               such ability.
>
> Yes, this is a situation we are aware of. We've talked on this list  
> about having multiple VNICs sharing the same MAC address, and  
> identified by their IP address instead. However this needs to be  
> scoped and defined further before we can commit on providing that  
> functionality.
>

The current APIs only allow adding as many addresses as the
number of slots available. Following this it will put the adapter
in promisc mode. Instead can you add the capability to specify
when to use a filter and when to take up a slot in the HW.


>>
>>      Q2.4) Clients will need the ability to specify if mac_unicast_add()
>>              is allowed it to go into promiscous mode or not. An  
>> error return
>>           value is required if no h/w mac address slot is available.
>
> OK, I will add a flag.

Thanks ..

<..snip..>


>>
>>      Q2.6) Can it be assumed that every address added to a client is
>>              processed in a separate ring (either h/w ring or s/w  
>> ring)?
>
> No, all the MAC addresses for a client will share the same ring(s).  
> If there's a need to have a different set of rings associated with  
> a MAC address, then a different MAC client should be created.

What happens when a single client has multiple rings and multiple
mac addresses. How is the mapping done in that case ? Would it
be possible to in that case request a 1-to-1 mapping and reserve
a ring for each address ?


>>      Q2.7) How are the multiple addresses per client maintained, is it  
>> done
>>           in the MAC layer or does it bybpass the MAC layer and passed
>>           to h/w directly.
>
> Since the action of reserving the MAC address is triggered by a  
> call to the MAC layer, the MAC layer cannot be bypassed. The MAC  
> layer will use the multiple MAC address capability exposed by the  
> driver to reserve a new MAC address slot.

What if the driver does not expose that capability. Will the unicast_add
call fail ? Is the MAC layer essentially reflecting the capability of
the underlying hardware or does it provide the ability to have multiple
addresses irrespective of whether the HW has multiple slots or not ?

>>
>>      Q2.8) Can unlimited number of mac addresses be assigned to a MAC
>>           client? What are the software/hardware features that limit  
>> this?
>
> Memory that can be allocated by the kernel.

So even if the underlying device runs out of slots the MAC layer will
maintain all the addresses associated with that client. How does it then
manage and associate these addresses with the rings allocated for
this client ? What does it do in both software and hardware to
filter the addresses for this client ? Also which addresses get HW slots
and which dont ? Also if you run out of slots does the HW go to  
promisc mode ..

>>
>> 3) Rings related:
>>      (Crossbow-virt.pdf  Section 5.3 Pg 43)
>>        mac_rint_t *mac_rings_list_get(mac_client_handle_t mch,
>>                uint_t nrings);
>>        void mac_rings_list_free(mac_rings_t *rings, uint_t nrings);
>>        uint16_t mac_ring_get_flags(mac_ring_t ring);
>>
>>
>>     QUESTIONS:
>>
>>      Q3.1) All of these interfaces are now categorized as project- 
>> private
>>            API. What motivated this change. These interfaces need to be
>>               more open.
>
> The MAC layer will do the allocation of hardware resources to the  
> various MAC clients and their flows. Instead of having each MAC  
> client manage its own set of resources, the resources are allocated  
> to MAC clients based on their needs, for example the degree of  
> parallelism expressed through mac_client_open(). If you have  
> specific functional requirements that are not satisfied by the  
> current document, please list them.

Currently rings are hidden resources entirely managed by the mac layer
and clients have no visiblity. All the client gets to do is request a
degree of parallelism. Providing APIs that allow clients to see how
rings were allocated will be useful.


>
>>         Q3.2) The mac_rings_list_get() is only  for h/w rings, is  
>> there
>>           an equivalent interface to obtain s/w ring information.
>>           Or this interface can be extended return both h/w ring
>>           or s/w ring information.
>
> The interface will evolve to provide that information, but it will  
> remain project private. It is provided here FYI but will change in  
> future revisions of the document.

So the expectation is that ring APIs should not be used by clients  
and it
is only an internal MAC layer resource managed by it ?

>
>>      Q3.3) Are the mac_resource_set() and mac_resources() interfaces
>>           going away?        
>
> Yes, they will be replaced by different interfaces. But note that  
> they are already project private in Nevada and were not supposed to  
> be used by other ON components.

Agreed, but there is no other way in the new Crossbow API a way
to take advantage of multiple rings. There is one generic RX callback,
but no other way to associate a callback with a specific ring. This
is a limitation of the existing API and support should be added to
open API list, so that we can process traffic independently. So having
some additional APIs to expose this will be very useful.


>>      Q3.4) What is the action taken when no free h/w ring available.
>>           As per the documentation of mac_rings_list_get(), if no h/w
>>           ring available, it returns NULL. In such case, how does
>>           mac_unicast_add() behave when NULL is passed for rings?
>
> mac_unicast_add() no longer takes rings. This will be handled  
> transparently to the MAC clients by using a default ring and  
> falling back to software classification.

So when there are multiple rings -- and these rings are associated
with all the addresses - there is no pre-defined mapping ? That can
be inefficient as the hardware has the ability to associate each ring
with an address. Can we extend the MAC apis to allow a client to choose
between the default behavior and binding addresses to rings ?

>>      Q3.5) Are there any interfaces other than the above mac_rings_xxx
>>           interfaces that are available to deal with MAC rings?
>
> Not available to MAC clients. The set of project private interfaces  
> might evolve as we refine the design.

I would like to see some of the functionality exported via these
private mac interfaces promoted to a open API. Even if the API
cannot be moved over, can we extend APIs to provide hints ..

>
>>      Q3.6) Is the mac_rings_list_get() returns the list of mac rings
>>           assigned to the client at the time of client open. How can
>>           this be changed after the client is open.
>
> The set of assigned rings may change. The details on the APIs  
> needed to support this still need to be defined, but they will  
> remain project private.

So you are saying there is no way to rely on how many rings are
available to a particular client. This will change without the
client's control ? Is CPUs being removed from the system a case
under which this will happen ?

>>      Q3.7) Assigning h/w rings to a specific MAC address limits the
>>           bandwidth to the number of rings that are assigned to that
>>           address. Is there a way to not to bind h/w rings specific
>>           to MAC address so that the bandwidth could be used by
>>           any mac client depending on the traffic?
>
> See Q1.3.

Not sure what you mean. Are you suggesting that some mac addresses
will have SW rings and others will be associated to HW rings ?


>>
>> 4) Receive callback related:
>>      (Crossbow-virt.pdf  Section 5.2.5 Pg 40)
>>      int mac_rx_set(mac_client_handle_t mch, mac_rx_fn_t rx_fn,
>>          void *arg);
>>      int mac_rx_clear(mac_client_handle_t mch);
>>
>>     QUESTIONS:
>>
>>      Q4.1) How can a client get rx callback  per ring that is  
>> assigned
>>           to the mac client? This will allow parallel processing
>>           and improve the performance. Such a feature is already
>>           being used in the current implementation of LDoms vSwitch
>>           driver and the mac_xxx interfaces should support such an
>>              ability.
>
> The parallel processing will still happen. I.e. if multiple  
> hardware rings or software rings are assigned to a MAC clients,  
> multiple connections associated with that MAC client will be spread  
> across these rings.

So with multiple rings, there will be concurrent callbacks to the
rx_fn, each with packets in the corresponding ring ? Also will each
callback be able to determine what ring did the callback ?

>
>>      Q4.2) How can a client get a separate callback for a defined type of
>>           traffic, such as different SAP numbers etc. This will
>>           be useful to provide out of the band type packet processing
>>              or related services.
>
> This will be supported by a MAC flow API built on top of the MAC  
> client API. The flow API will be described by a separate document.

So if a client wants to use the flow API will it need to layer itself on
the flow API and not the mac client API directly. Can you give me more
information on what this layering will look like. Also, when do you
expect the flow API doc to be available ?

<..snip..>

>>
>> 5) Transmit related:
>>
>>      (Crossbow-virt.pdf  Section 5.2.7 Pg 41)
>>      mblk_t *mac_tx(mac_client_handle_t *mch, mblk_t *mp, uint64_t hint);
>>
>>    QUESTIONS:
>>
>>      Q5.1) What are the valid values for the 'hint' argument?
>>            From the description on pg 42, NULL seems to be
>>            a valid value. Is it safe to assume that the 'hint' is a
>>               ring-id, if so, a NULL value of 0 will conflict with a
>>               ring-id of 0.
>
> The hint can be any 64 bit value, but it must always be the same  
> value for the packets corresponding to the same connection to avoid  
> reordering. TCP and UDP for example pass the connection pointer as  
> the hint, which allows us to avoid packet inspection for these  
> protocols.

Can you clarify what the hint is being used for. Is it similar
to the case below -- where a hash will be applied on hint value
to pick up a TX ring ?

>>
>>      Q5.2) If NULL specified as a 'hint', how is the tx ring
>>            selected?
>
> In this case mac_tx() will parse the packet headers and hash on the  
> header information to select a transmit ring.
>

Is the goal here to somehow bifurcate traffic being sent by a client
via the interface ? The algorithm is pre-determined by the mac layer
and either hint or misc headed + hash will be used to determine the
Tx ring for transmit ? It is possible for a client to request a specific
ring -- is the only way to do this is pick a unique hint ?

>>
>>      Q5.3) The 'hint' argument description says the following.
>>                    What is the meaning of a connection in this context and
>>            how to identify this?
>>
>>            "The hint must be the same for packets of the same  
>> connection."
>
> It can be a TCP connection for example. This is required to avoid  
> reordering of packets for the same connection.

OK ..

>
>> 6) Multicast addresses related:
>>      (Crossbow-virt.pdf  Section 5.2.6 Pg 41)
>>      int mac_multicast_add(mac_client_handle_t mch, const uint8_t  
>> *addr);
>>      int mac_multicast_remove(mac_client_handle_t mch, const uint8_t  
>> *addr);
>>
>>
>>         No comments at this point.
>>
>> 7) Promiscous mode realted:
>>
>>      (Crossbow-virt.pdf  Section 5.2.8 Pg 42)
>>      Its not clear if the above interface will be available or not,
>>      but two new intefaces are added:
>>
>>      int mac_promisc_add(mac_client_handle_t mch, mac_promisc_type
>>          promisc_type, mac_promisc_fn_t promisc_fn, void *arg,
>>          mac_promisc_handle_t *php);
>>      int mac_promisc_remove(mac_client_handle_t mch,
>>          mac_promisc_handle_t *ph);
>>
>>              MAC_PROMISC_ALL - send all packets
>>              MAC_PROMISC_MULTI - only broadcast and multicast
>>
>>      May be the mac_promisc_add(MAC_PROMISC_ALL) will force device
>>      to operate in the promiscous mode.
>
> Both need to, since the device needs to be in promiscuous mode also  
> to receive all multicast traffic.

What do mean by "Both need to." ? In addition to above interface
to enable promisc mode, will the existing promisc_set() interface
be removed ? Also, PROMISC_ALL is unicast+multicast, whereas MULTI
is only all multicast traffic ?

>
>>
>>     QUESTIONS:
>>
>>      Q7.1) According to the section 4.6, the promiscuous mode  
>> operates
>>           in the layer2 switch model. When choosing the promiscuous mode
>>              model can it be either layer2 switch model or shared
>>           ethernet model?
>

Comments ?


>>      Q7.2) From the explanation of mac_promisc_add(), it seems like
>>           the mac_promisc_add() could be called without setting
>>           MAC address via mac_unicast_add(). Is this correct?
>>           If so, what is the expected behaviour?
>
> Currently we provide the same semantics as a switched environment,  
> i.e. a MAC client will see the same traffic that would be seen by a  
> NIC connected to a switch.
>

Is there a way to see only the multicast traffic associated with all mac
clients - union of all mac_client multicast_add addresses. The MULTI
promisc option seems more a way to weed to unicast and broadcast traffic
on the wire and pass all wire multicast traffic up - including ones the
system may not be interested in ? Is this the case ?

> What we would also like to provide is the ability to for a MAC  
> client to obtain all the traffic going in and out of the box, as  
> well as the traffic exchanged between MAC clients. The non-unicast  
> address was part of that solution.
>

OK ..

> Another option would be to generalize this with the shared ethernet  
> model, and allow a MAC client to specify that it wants to observe  
> all traffic via a separate promiscuous type. I need to see how this  
> can be added to the API.

This will be very useful. How about something MAC_PROMISC_CLIENTS ?

>
>>
>> 8) Statistics related:
>>
>>              Q8.1) Is the mac_stat_get() interface being obsoleted or  
>> changed?
>>            If so, what is the new equivalent interface?
>
> Yes, there will be a new MAC client interface. The MAC layer will  
> also maintain per-MAC client statistics for MAC client specific  
> statistics such as number of packets sent/received, etc. I need to  
> add that interface to the document.

OK -- when will this be available. Next doc update ? Until that point  
in time
should we continue to use the mac_stat_get() interface ?

>> GENERAL QUESTONS:
>> ================
>>
>>      Qg.1) Are there any GLDv3 MAC client interfaces that are being
>>            obsoleted(provided by the Nemo framework) but not documented
>>            in this doc?
>
> The MAC client interface was project private, and most of the  
> interface is being completely revamped by Crossbow. The set of MAC  
> client API available to ON consolidation components is described by  
> section 5.2 of the document. Any other MAC client API are still  
> project private.

So all interfaces that were part of GLDv3 will be replaced with the
interfaces specified here. Anything that is not specified here should
not be used moving forward ?


>>      Qg.2) Are there any changes to the MAC driver interfaces or being
>>           obsoleted?
>
> The changes made to the driver API will be published as part of a  
> separate forthcoming document.

How soon will this doc become available ? I see in the latest doc  
Kais published
there are some new interfaces for ring support. I presume there will  
be a separate
doc for the deneric mac driver interfaces ?

>>
>>      Qg.3) There are no MAC client interfaces to specify bandwidth
>>           attributes. From the section 4.7, it seems like they are
>>           implemented as part of VNIC and not as MAC client interfaces.
>>           If this is the case, how can the bandwidth attributes be
>>              specified?
>
> They are not documented yet, but will be specified as arguments to  
> mac_client_open().

Can we expect to see this in the next revision of this document ?

>
>>
>>      Qg.4) When will the classification interface be fully documented
>>            for review?
>
> There will be separate documents for the MAC driver classification  
> interfaces, and for the MAC client flow APIs.

When will this be available ?

Thanks
-Narayan

[crossbow-discuss] Comments on updated arch document

Reply via email to