[crossbow-discuss] Comments on updated arch document

Nicolas Droux Fri, 19 Oct 2007 16:33:53 -0600

Hi Narayan,

On Oct 17, 2007, at 5:53 PM, Narayan Venkat wrote:


> Hi Nicolas
>
> Thanks for the response ..
>
>> In general it seems that the issues being discussed are related to
>> either (1) the port of the existing LDOM functionality to the
>> updated MAC client interfaces introduced by Crossbow, or (2) the
>> addition of new LDOM functionality based on these new APIs.
>>
>
> The primary goal for us is to address the first case - the port of the
> existing LDom functionality already released to customers.

Yes, these are the issues that we need to address for the initial  
putback.

> But there
> is new LDoms functionality on the horizon that must be addressed as
> well.
> The new features is expected show up in the same time frame as  
> Crossbow.

If there's additional functionality which is needed by LDOM which has  
an impact on Crossbow then the functional and schedule requirements  
for these new features needs to be clearly communicated to us. If  
additional work is needed on Crossbow, then we need to discuss the  
vehicle for these changes, and the potential impact of that work on  
the existing Crossbow schedule, other existing projects, and future  
funded projects needs to evaluated, tracked and the effort staffed.  
Unless you give us that information we can't plan for that work. And  
we cannot get into detailed design discussions if the functional  
requirements are not clearly communicated first.

>> So given my latest response below, what are in your opinion the
>> remaining issues which are directly related to the port of the
>> existing LDOM functionality to the new MAC API?
>>
>
> Quick summary of issues that still needs resolving for the port are:
>
> - How is ring to CPU mapping change when DR happens

I think this was answered at length below. Do you have any remaining  
issues on this topic?

> - How multiple mac address assigned to a single client correspond
>      to the rings owned by the client

I don't agree that this is required for the port as part of the  
initial Crossbow putback. From what I could tell LDOMs today doesn't  
allow multiple MAC addresses to be assigned to vnets in the first place.

> - Usage of HW mac addr slots in the NIC and automatic switching
>      to layer-2 filtering and promisc mode.

I think I answered your questions about this point below. Also today  
there's no hardware classification done in hardware for LDOMs,  
whether in promiscuous mode or not. So I don't understand why you  
consider this a requirement for the initial port.

> - Separation of incoming traffic and association with Rx rings

"Association with Rx rings" is a bit vague.

As I described in my previous emails and below, individual  
connections will be fanned out to the rings that are member of a  
group associated with a MAC client, or the fanout will be done in  
software between soft rings.

> - Tx ring allocation and relation to Rx ring and level of
>      parallelism

See Q5.2 below.

<snip>

>>> Also according to the explanation in the doc at page 38, there is
>>> also a case where no flags is specified. It seems like, if no flags
>>> specified, then it will attempt to reserve one hardware ring.
>>> It seems to not to fail even if such reservation fails, but it is  
>>> not
>>> clearly specified.
>>>
>>
>> If you don't specify the ONE_RING flag, and a hardware ring cannot
>> be reserved for the MAC client, then the MAC client will share the
>> default ring with other MAC clients.
>>
>
> In the current design, if there will be N rings, and a client open
> is done, and it requests a HW ring it will get one assigned from the
> N-1 rings. If it does not request one, it will be still mapped to a
> HW ring (if available), else be assigned a SW ring fanned out from
> the default HW ring - correct ?

Correct.

<snip>

>>> Above comment applies to this one also. The behavior without any
>>> flags
>>> seem to attempt to reserve one h/w ring. What is the failure case ?
>>>
>>
>> Without any flag, we try to allocate N hardware rings. If that
>> fails, then we try to allocate 1 hardware ring and do fanout to N
>> soft rings, if that fails, then we share the default hardware ring,
>> and do fanout to N soft rings.
>>
>
> This is the part that I was missing. So when a client requests NCPUS
> they do get N rings - either all SW or HW rings. In the case of SW  
> rings
> it might be a fanout from a allocated HW ring or the default HW ring.
> This is clear from chapter 4.x ..

Correct.

>
> Couple more questions:
>
> - If a NIC has only 1 HW ring, this will essentially become the
>    the default HW ring. All other requests are fanout from this HW
>    ring to the SW rings ?

Yes

> - If a NIC has N HW rings, only (N-1) rings are available for
>    allotment, 1 is always reserved as the default HW ring ?

Yes.

> - If a NIC has 2 free HW rings, and a client requests 3 rings,
>    the mapping today will be 1 HW to 3 SW rings, correct ? This
>    will still leave the other HW ring still free ?

Yes.

>>>> The first one is correct. If mbc_cpus is non-NULL, the MAC layer
>>>> will assign the CPUs provided by the caller.
>>>>
>>>
>>> When the mbc_cpus is NULL what determines how many CPUs and hence
>>> the number of rings available to this client.
>>>
>>
>> mbc_ncpus.
>>
>
> Can you document this. It is not clear from the doc that mbc_ncpus
> still controls the ring allotment and the degree of parallelism even
> when mbc_cpus is NULL ..

I will update the document to make it clearer.

> Can we set the ncpus value to a number
> greater than the actual number of cpus in the domain. Will the MAC
> layer create the requested number of rings to match ncpus or will it
> limit the rings to the number of actual cpus in the domain.

We use the specified ncpus.

>>>>>     Q1.6) What is the relationship between Unicast addresses
>>>>> (multiple
>>>>>          unicast set via mac_unicast_add()), Rings and CPUs?
>>>>>           
>>>>>            - Is there a 1:1 relation between a unicast address and a
>>>>> ring?
>>>>>    - Is there a 1:1 relation between a ring and CPU?
>>>>>
>>>>
>>>> Neither. The MAC addresses will share the same rings and CPUs.
>>>>
>>>
>>> But since you are allowing multiple mac addresses to be associated
>>> with a client, can we add support as part of the unicast_add call
>>> to indicate that each of these addresses should be associated with
>>> ring (either HW or SW).
>>>
>>
>> No, each client is associated with a group of hardware rings or
>> soft rings. Each group or set of rings corresponds to a set of
>> unicast MAC addresses. The bandwidth limits are set on a per MAC
>> client basis. This maps to how hardware NICs do their
>> classification and fanout.
>>
>> If you want a separate set of rings for different MAC addresses,
>> then you create a new MAC client.
>>
>
> If this is the case what is the real value in being able to assign
> multiple addresses to a client. Especially when a single client has
> multiple MAC addresses, coalescing the pkts into a single stream
> has less benefit over separating the traffic for each address to
> its own ring. Since a client has many rings and addresses, being
> able to treat these not as a group of addresses associated with a
> group of rings will be useful.

Of course there's value of assigning the multiple MAC addresses to a  
single client even if you share one or more rings within that client.  
If such a MAC client maps to a front-end driver instance in another  
domain (vnet in your case, xnf for Xen), that domain can then create  
multiple VNICs on top of that front-end driver, assign a MAC address  
to these VNICs without having to turn the underlying hardware in  
promiscuous mode, and establish a path between the VNICs on the  
domain itself without having to cross hypervisor boundary.

That said, I don't think anything forces you to have a 1:1 mapping  
between a vnet instance and a MAC client in the service domain. If  
what you are trying to do is have separate sets of rings for the  
clients of a vnet based on their MAC addresses, you could have a vnet  
map to multiple MAC clients in the service domain, with their own MAC  
addresses and separate groups of rings. That vnet instance would then  
register groups of rings corresponding to the MAC clients in the  
service domain. These groups would have their own set of rings, and  
the groups would be assigned by the MAC layer to the MAC clients of  
vnet.

> For instance on N2-NIU, if you assign a RDC group to a mac_client,  
> this
> group can still contain one or more rings. The group also can be
> assigned
> multiple MAC addresses. Will the number of groups limit the mac  
> clients
> that can be created for the specific device ? In that case we will  
> want
> the ability to have traffic from separate MAC addresses spread across
> the rings in this mac_client.

We will create one group per MAC client, as long as hardware  
resources are available. We reserve one group as the default group,  
and software classification will be used on top of the default group  
to spread the traffic to multiple software rings assigned to the MAC  
clients sharing the same group.

The hardware associates multiple MAC addresses per group, and each  
group maps to a MAC client. Once the hardware finds the group  
associated with a MAC address, it spreads traffic across the rings  
assigned to that group according to a computed hash on the inbound  
packet headers.

>>>>>         -  The Rings and CPUs are tightly coupled in this
>>>>> interface.
>>>>>      How can allocate multiple rings even when there is one CPU
>>>>> (or less
>>>>>      number of CPUs).
>>>>>
>>>>
>>>> You don't allocate rings explicitly, you express a level of
>>>> parallelism instead, the framework distributes the hardware rings
>>>> transparently.
>>>>
>>>
>>> But the only way we can control this parallelism is by specifying
>>> the number of CPUs in the domain. But in a system capable of adding
>>> and removing CPUs dynamically, we might want to change the
>>> parallelism
>>> level too. The current APIs dont allow changing this. We will need
>>> a way to specify this as an extension to the client_open or a via
>>> a new API call.
>>>
>>
>> So you want an API which allows you to change the actual
>> mac_bind_cpu_t for a client which has been already opened? I think
>> we can do that.
>>
>
> Exactly. As a result it should also allocate more rings to correspond
> to the current set of CPUs. The reverse also should be true. We should
> similarly be able to reduce the mbc_ncpus when CPUs are removed from
> the system.

OK.


>
> <..snip..>
>
>>> Also, in terms of parallelism is this specified by the no. of CPUs
>>> or by unique CPUIDs in the array. What happens if I specify ncpus
>>> where all IDs are the same - do I get ncpus HW rings if they are
>>> available. Also can we then change the ring to cpu mapping when
>>> more CPUs are added/removed to/from the domain ?
>>>
>>
>> There should be no duplicate CPU ids in the array.
>>
>
> Will this be checked and error returned to client_open ? Can you
> add that this is an error in the doc also.

Yes. I will update the doc.

>>>>>     Q1.7) How is the binding of CPUs via mac_bind_cpus_t is co-
>>>>> ordinated
>>>>>          with CPU DR(on the platforms that support them)?
>>>>>
>>>>
>>>> The MAC layer will be notified of the removal of the CPU and will
>>>> stop using it for its worker threads and interrupts.
>>>>
>>>
>>> That is purely error handling. We need the ability to be able to use
>>> more CPUs and improve the level of parallelism when CPUs are added.
>>> The reverse is true when the CPUs are removed. When the MAC layer is
>>> notified about CPUs going away does it remove the rings associated
>>> with the CPUs ?
>>>
>>
>> I was not talking specifically about error handling. If the MAC
>> layer bound a ring worker thread or interrupt to a CPU and that CPU
>> is going away, the MAC layer will move that thread or interrupt to
>> a different CPU.
>>
>
> So if a client_open was done with only one CPU, in the mbc_cpus array
> and this CPU is going away, the alternate CPU will be picked in the
> same manner as if the client_open was done with mbc_cpus=NULL.

Yes.

>> The API discussed in Q1.6 above would allow a MAC client to
>> increase the number of CPUs if it detects that CPUs were added to
>> the system.
>>
>
> Does this only allow specifying a increased CPU count or even allow
> the client to specify the CPUs to use for doing the mapping when
> more CPUs are added  ?

It would allow the new CPU set to be specified, but we need to figure  
out the details of that API.

>>>>>     Q1.10) Can the mac client interface be extended to support
>>>>> creating
>>>>>            a client based on ether_type? This is required for
>>>>> mac clients
>>>>>      like fiberchannel over ethernet.
>>>>>
>>>>
>>>> No, each MAC client corresponds to a MAC level entity which is
>>>> defined by its MAC address. Multiple ether types can be supported
>>>> on top of a MAC client.
>>>>
>>>
>>> Devices like the Niagara2 NIU allow classification of packets using
>>> parameters like the ether_type. How can a mac_client take advantage
>>> of such a functionality.
>>>
>>
>> The fact that a particular hardware implementation can do
>> classification on a specific header field of a packet doesn't
>> necessarily mean that a MAC client needs to be associated with that
>> field.
>>
>> Today the SAP demultiplexing is done by DLS on top of MAC clients.
>> At some point in the future we may make use of hardware
>> classification to offload that demultiplexing, but can be done at a
>> level above the MAC layer, and maintain the separation between MAC
>> clients and what defines them (MAC addresses and VLANs), from SAP
>> demultiplexing.
>>
>
> Agreed, that makes sense for SAP demultiplexing. In the near future,
> opening clients
> based on ether_type will be important.  Particularly for FCoE.
> Interfaces this time
> next year will be supporting FCoE and the Leadville stack will need
> to open a client
> based on the ether_type for FCoE.

The fact that there's a need to do SAP demultiplexing based on SAPs  
doesn't necessarily mean that the SAP needs to be associated with the  
MAC client directly. I think this falls into the "future projects"  
category where the functional requirements need to be clarified.  
We'll need to work with the FCoE folks on this. At least our current  
design doesn't prevent that demultiplexing from being added to the  
MAC layer in the future.

>>>>>           Q2.2) Is there an impact to the
>>>>> multiaddress_capab_t.maddr_add()/
>>>>>                maddr_remove() interfaces? Are these being obsoleted or
>>>>>        going away?
>>>>>
>>>>
>>>> The capability will stay, and the framework will continue to use
>>>> that capability to query and control the allocation of MAC
>>>> address slots. However that interface is not intended to be used
>>>> by drivers which should use the MAC client interfaces instead.
>>>>
>>>
>>> OK.
>>>
>>
>> Since my last reply Kais and Roamer have been working on the design
>> for the new driver interface. Their proposal removes the multiple
>> MAC address capability as it is known today. You should read their
>> design document, which is available at http://www.opensolaris.org/
>> os/project/crossbow/Docs/virtual_resources.pdf
>>
>
> Thanks. I saw the email too and reviewing the doc now ..

Actually since we're covering this topic. Can you clarify for us how  
LDOMs obtains factory MAC addresses from the interfaces? It doesn't  
seem that you go through the multiple MAC address capability to do  
this. We will change that part of the multiple MAC address capability  
as well, so we need to know if you depend on that interface to obtain  
the factory MAC addresses.

>>>>>   Q2.3) A system with many domains (aka LDoms) with virtual network
>>>>>               devices, it requires the use of a large number
>>>>> layer2 addresses,
>>>>>         this will exhaust h/w slots available on most standard  
>>>>> NICs.
>>>>>         How can a client take advantage of layer2 filtering
>>>>> provided by
>>>>>         NICs like NII-NIU/Neptune. Specifically, this will help in
>>>>>               avoiding the programming of the device into
>>>>> PROMISCous mode
>>>>>               etc.  Currently there are no interfaces that seem
>>>>> to provide
>>>>>               such ability.
>>>>>
>>>>
>>>> Yes, this is a situation we are aware of. We've talked on this
>>>> list about having multiple VNICs sharing the same MAC address,
>>>> and identified by their IP address instead. However this needs to
>>>> be scoped and defined further before we can commit on providing
>>>> that functionality.
>>>>
>>>>
>>>
>>> The current APIs only allow adding as many addresses as the
>>> number of slots available. Following this it will put the adapter
>>> in promisc mode. Instead can you add the capability to specify
>>> when to use a filter and when to take up a slot in the HW.
>>>
>>
>> Do you mean that you want to be able to specify that a
>> mac_unicast_add() should put the NIC in promiscuous mode even
>> though there are MAC address slots available? What is the use case
>> for this?
>>
>
> No, that does not make any sense. I am not asking for that. The number
> of mac addresses that can be added across all mac clients is  
> restricted
> to the total number of HW slots in the NIC - correct ? If this is not
> the case, does the MAC layer put the card in promisc mode to filter
> the MAC addresses in SW ?

Yes, we put the card in promiscuous mode if we run out of MAC address  
slots.

> In the case of HW that allow layer-2 filtering in HW, is there a way
> the MAC layer takes advantage of this, instead of putting the NIC in
> promisc mode, especially when we run out of HW slots on the NIC.

By layer-2 filtering I guess you mean hardware classification. You  
still need to put the NIC in promiscuous mode so that it starts  
receiving traffic for the MAC address which do not fit in the  
hardware MAC address slots. In the case of the NIU I believe that the  
packets will still be classified to the right groups even if the card  
is in promiscuous mode.

>>>>>   Q2.7) How are the multiple addresses per client maintained, is
>>>>> it done
>>>>>        in the MAC layer or does it bybpass the MAC layer and passed
>>>>>        to h/w directly.
>>>>>
>>>>
>>>> Since the action of reserving the MAC address is triggered by a
>>>> call to the MAC layer, the MAC layer cannot be bypassed. The MAC
>>>> layer will use the multiple MAC address capability exposed by the
>>>> driver to reserve a new MAC address slot.
>>>>
>>>
>>> What if the driver does not expose that capability. Will the
>>> unicast_add
>>> call fail ? Is the MAC layer essentially reflecting the  
>>> capability of
>>> the underlying hardware or does it provide the ability to have
>>> multiple
>>> addresses irrespective of whether the HW has multiple slots or not ?
>>>
>>
>> The request will still succeed if the number of MAC address slots
>> is exhausted, or if the underlying NIC doesn't support the multiple
>> MAC address capability. However in these cases, the MAC layer will
>> transparently put the NIC in promiscuous mode in order to receive
>> traffic for that new MAC unicast address.
>>
>
> Can we take advantage of other HW capabilities like address
> filtering. See
> comment above wrt Q2.3. Also there are cases we dont want to switch to
> promisc mode automatically. Can we add a flag to the unicast_add call
> and get an error instead of automatic switching.

Yes, I thought I already agreed to add it but forgot to document the  
flag.

> Since there is an API
> for forcing promisc mode, we can explicitly request to switch to  
> promisc
> mode using that API when needed.

You lost me here. There's no separate API to force promisc mode.  
There's an API to add promiscuous callbacks, it's different than the  
older API.

>>>>>   Q2.8) Can unlimited number of mac addresses be assigned to a MAC
>>>>>        client? What are the software/hardware features that limit
>>>>> this?
>>>>>
>>>>
>>>> Memory that can be allocated by the kernel.
>>>>
>>>
>>> So even if the underlying device runs out of slots the MAC layer  
>>> will
>>> maintain all the addresses associated with that client. How does
>>> it then
>>> manage and associate these addresses with the rings allocated for
>>> this client ? What does it do in both software and hardware to
>>> filter the addresses for this client ? Also which addresses get HW
>>> slots
>>> and which dont ? Also if you run out of slots does the HW go to
>>> promisc mode ..
>>>
>>
>> Each MAC client is associated with a group of rings. Each group of
>> rings is therefore associated with a set of MAC addresses. If a
>> client needs to be associated with more than one MAC address, then
>> the corresponding group needs to be associated with the same set of
>> addresses. If the hardware runs out of MAC addresses, then the NIC
>> is put in promiscuous mode. The allocation of slots is on a first
>> come first served basis.
>>
>
> That HW slots are global across all MAC clients. Since it is FCFS one
> client
> can potentially consume all HW slots ? Also since transitioning the
> NIC to
> promisc mode has impact on all clients, I think the mac layer  
> should try
> to do slightly better than FCFS and do something like fair-share so  
> that
> it does not give one client all the HW slots ? Also add a flag to
> prevent
> automatic switching to promisc mode.

There are cases where internally we might come up with some  
algorithms to distribute the slots fairly across the clients, however  
we want to be able to tune these algorithms as we gain experience  
with the framework, and avoid pushing the complexity of managing  
these shared resources to the clients.

>>>>> 3) Rings related:
>>>>>   (Crossbow-virt.pdf  Section 5.3 Pg 43)
>>>>>     mac_rint_t *mac_rings_list_get(mac_client_handle_t mch,
>>>>>             uint_t nrings);
>>>>>     void mac_rings_list_free(mac_rings_t *rings, uint_t nrings);
>>>>>     uint16_t mac_ring_get_flags(mac_ring_t ring);
>>>>>
>>>>>
>>>>>     QUESTIONS:
>>>>>
>>>>>           Q3.1) All of these interfaces are now categorized as
>>>>> project-private
>>>>>         API. What motivated this change. These interfaces need  
>>>>> to be
>>>>>               more open.
>>>>>
>>>>
>>>> The MAC layer will do the allocation of hardware resources to the
>>>> various MAC clients and their flows. Instead of having each MAC
>>>> client manage its own set of resources, the resources are
>>>> allocated to MAC clients based on their needs, for example the
>>>> degree of parallelism expressed through mac_client_open(). If you
>>>> have specific functional requirements that are not satisfied by
>>>> the current document, please list them.
>>>>
>>>
>>> Currently rings are hidden resources entirely managed by the mac
>>> layer
>>> and clients have no visiblity. All the client gets to do is  
>>> request a
>>> degree of parallelism. Providing APIs that allow clients to see how
>>> rings were allocated will be useful.
>>>
>>
>> Why? What is the functional requirement?
>>
>
> A client otherwise does not know whether its parallelism request is  
> met
> using HW rings or SW rings. HW is obviously better than SW. In the  
> case
> it gets the later, it might choose to reduce the degree to parallelism
> so that it gets all HW rings. Having said that, since the current APIs
> allow for requesting only HW rings, we can always try HW first and  
> then
> ask for SW rings only if the first one fails and if client is OK with
> the SW rings. Some visibility into this in the future will positively
> help with optimizations.

OK, if you can go with the algorithm that you just described for now  
that would be great. We'll look into how we can improve the  
visibility of resource availability in the future, however we won't  
be able to add this feature for our initial putback.

>>>>>         Q3.2) The mac_rings_list_get() is only  for h/w rings,
>>>>> is there
>>>>>        an equivalent interface to obtain s/w ring information.
>>>>>        Or this interface can be extended return both h/w ring
>>>>>        or s/w ring information.
>>>>>
>>>>
>>>> The interface will evolve to provide that information, but it
>>>> will remain project private. It is provided here FYI but will
>>>> change in future revisions of the document.
>>>>
>>>
>>> So the expectation is that ring APIs should not be used by clients
>>> and it
>>> is only an internal MAC layer resource managed by it ?
>>>
>>
>> Yes, the MAC layer does the allocation of resources to MAC clients
>> and their flows.
>>
>
> Some visibility into this will help in both perf monitoring and
> policy correction. Instead of looking at this from a single OS  
> instance
> point of view, if we see this from a perspective of diff OSs, having
> more info can help better tune for varying traffic loads. Can some
> of this be available via some type of stats like interface ?

That's a very complex problem. Even today in the single OS instance  
case we don't fully self-tune according to the workload. Providing  
this type of capability falls outside the scope of our initial  
putback. We need an architecture first before starting to export kstats.

<snip>

>>>>>   Q3.6) Is the mac_rings_list_get() returns the list of mac rings
>>>>>        assigned to the client at the time of client open. How can
>>>>>        this be changed after the client is open.
>>>>>
>>>>
>>>> The set of assigned rings may change. The details on the APIs
>>>> needed to support this still need to be defined, but they will
>>>> remain project private.
>>>>
>>>
>>> So you are saying there is no way to rely on how many rings are
>>> available to a particular client. This will change without the
>>> client's control ? Is CPUs being removed from the system a case
>>> under which this will happen ?
>>>
>>
>> The flags taken by mac_client_open() allows some control by the MAC
>> client, see Q1.3. If the client specified that a given CPU be
>> assigned to the client, we could block the DR'ing out the CPU until
>> the MAC client releases that CPU, what is your requirement here?
>>
>
> I dont think you want to block DR. DR of CPUs happen outside the scope
> of the kernel. Normally from a external control point like a data mgmt
> center. This control point has little visibility into what CPU in a
> domain is being used by a MAC client. So instead of preventing DR from
> happening, the mac_client should be notified that it might lose some
> of its rings. Alternatively you can handle this in the same way as
> when the client_open is done with mbc_cpus=NULL and ncpus > 0. The
> mac layer can redist the rings across the remaining CPUs in the
> system instead of reducing the number of rings the client currently  
> has.

Thanks for your input on this, we will rebind the thread to one of  
the remaining CPUs.

>>>>>   Q3.7) Assigning h/w rings to a specific MAC address limits the
>>>>>        bandwidth to the number of rings that are assigned to that
>>>>>        address. Is there a way to not to bind h/w rings specific
>>>>>        to MAC address so that the bandwidth could be used by
>>>>>        any mac client depending on the traffic?
>>>>>
>>>>
>>>> See Q1.3.
>>>>
>>>
>>> Not sure what you mean. Are you suggesting that some mac addresses
>>> will have SW rings and others will be associated to HW rings ?
>>>
>>
>> Between different MAC clients, that's possible. But within the same
>> MAC client, all unicast addresses of that client will share the
>> same group of hardware rings or SRS.
>>
>
> So when packets for a specific address arrives, will it be processed
> by a different ring each time ? So each ring is synonymous to a CPU
> resource and handles whichever packet arrives. It has no affinity to
> specific mac addresses ?

For the rings assigned to a MAC client, yes. Note that the hardware  
is required to use the same RX ring of a group for a connection, in  
order to maintain locality and prevent reordering.

However each MAC client will get its own group of rings, and traffic  
for the two clients will not spill over to the set of rings of the  
other clients.

See also Q1.6 above.

<snip>

>>>>>   Q4.2) How can a client get a separate callback for a defined
>>>>> type of
>>>>>        traffic, such as different SAP numbers etc. This will
>>>>>        be useful to provide out of the band type packet processing
>>>>>              or related services.
>>>>>
>>>>
>>>> This will be supported by a MAC flow API built on top of the MAC
>>>> client API. The flow API will be described by a separate document.
>>>>
>>>
>>> So if a client wants to use the flow API will it need to layer
>>> itself on
>>> the flow API and not the mac client API directly. Can you give me
>>> more
>>> information on what this layering will look like. Also, when do you
>>> expect the flow API doc to be available ?
>>>
>>
>> The flow API will be an addition to the MAC client API. A MAC
>> client will be able to use that flow API. Such a flow operation
>> would be of the form mac_flow_xxx(mac_client_handle_t mch, <flow
>> description>, <bandwidth properties>, etc). Kais is working on
>> defining that API, I'll let him comment on expected availability.
>>
>
> Thanks -- some of the requirements / comments above are tied to the
> flow API. So clarification on the flow API will help better define the
> requirements.

I would think that this should be the other way around :-) you  
provide the functional requirements, then can discuss whether the  
APIs satisfy these requirements.

>>>>>   Q5.2) If NULL specified as a 'hint', how is the tx ring
>>>>>         selected?
>>>>>
>>>>
>>>> In this case mac_tx() will parse the packet headers and hash on
>>>> the header information to select a transmit ring.
>>>>
>>>>
>>>
>>> Is the goal here to somehow bifurcate traffic being sent by a client
>>> via the interface ?
>>>
>>
>> The goal is to spread the traffic among the transmit rings assigned
>> to the client while maintaining packet ordering for individual
>> connections, without exposing the details of assignment of transmit
>> rings to MAC clients.
>>
>
> Another related question ..
>
> Are Tx and Rx rings assigned as a pair to a mac client ? Can a client
> have more Tx rings than Rx rings ? What controls this ? Does the ncpus
> parameter control how many Tx rings a client is assigned ?

First we try to allocate ncpus number of hardware TX rings. If there  
is less than ncpus, it's the same algorithm as for receive rings,  
i.e. we assign one TX ring to the client. If we run out of TX rings,  
then we fallback to a default TX ring. We'll add flags to  
mac_client_open() similar to the ones we already have for receive  
ring allocation. Note that since the hardware cannot guarantee that  
the number of TX rings is always the same as the number of RX rings,  
it is not possible to always guarantee that each RX ring will map to  
a TX ring.

>>>>>   Q7.2) From the explanation of mac_promisc_add(), it seems like
>>>>>        the mac_promisc_add() could be called without setting
>>>>>        MAC address via mac_unicast_add(). Is this correct?
>>>>>        If so, what is the expected behaviour?
>>>>>
>>>>
>>>> Currently we provide the same semantics as a switched
>>>> environment, i.e. a MAC client will see the same traffic that
>>>> would be seen by a NIC connected to a switch.
>>>>
>>>>
>>>
>>> Is there a way to see only the multicast traffic associated with
>>> all mac
>>> clients - union of all mac_client multicast_add addresses. The MULTI
>>> promisc option seems more a way to weed to unicast and broadcast
>>> traffic
>>> on the wire and pass all wire multicast traffic up - including
>>> ones the
>>> system may not be interested in ? Is this the case ?
>>>
>>
>> These broadcast flags apply not only to the incoming received
>> traffic but also the traffic sent my MAC clients of the same
>> underlying MAC. I.e. a MAC client PROMISC_MULTI callback will also
>> see all multicast traffic sent by the other MAC clients defined on
>> top of the same MAC. In order to preserve the semantics that are
>> implemented by a real physical switch, this applies to *all*
>> multicast traffic, not just the multicast groups that were "joined"
>> by the individual MAC clients.
>>
>
> Ok - thanks for the clarification .. Can you add some text to the doc
> also
> to this effect ..

Will do.

<snip>

Thanks,
Nicolas.

-- 
Nicolas Droux - Solaris Core OS - Sun Microsystems, Inc.
droux at sun.com - http://blogs.sun.com/droux

[crossbow-discuss] Comments on updated arch document

Reply via email to