On 02/22/2010 09:32 AM, Or Gerlitz wrote:
Mike Christie wrote:
2. I wasn't sure if there is and if yes what is the transport role in
detecting session failure.

It varies from transport to transport.
For iscsi_tcp we do not really have a nice way to figure out if the
someone just tripped over a cable so that is where the nop comes from.
We can tell if the tcp state changes and so you can see
iscsi_tcp_state_change notify the upper layers of a problem for that.

understood. Still, the noop-out based watch-dog serve all transports, correct?


I'd like to narrow down things and understand if/what is the transport role:

For the nop out path, the trasnport just has to send/recv the nop pdu/response.

Some  iscsi drivers will runn iscsi_conn/session_failure when they
discover a link down event or someone doing ifdown. I thought this is
sort of what you are able to do with iser_cma_handler->iser_disconnected_handler
or with the call to iscsi_conn_failure in iser_handle_comp_error

Yes, we call iscsi_conn/session_failure but I wasn't really sure if multipathing
works for non tcp transports if they never make these calls or they have to.

They do not have to make those calls for multipath to work. Multipath will work better if the transport can signal when there is a problem, because we can stop using a bad path and get IO going to a working path faster. If the transport does nothing then we have to rely on the scsi error handler/timeout to detect the problem and that is very slow.

If there are other places you can detect a link failure type of problem
you would want to call iscsi_conn_failure, so the iscsi layer can begin
trying to recover the connection and let dm-multipath know there is a problem.

I understand that once there's timeout on the noop out watch-dog, the iscsi 
will call ep_disconnect, correct? currently our ep_disconnect is sometime too 

Yes. You should also change your ep_disconnect because it is not supposed to block (did we talk about this or was this just bnx2i), since it will stop iscsid from processing other events.

and I can change that. But, still I wasn't sure if for iscsi to let 
dm-multipath that
there is a failure something is needed at the transport side or not...

I do not think there is anything special. It should handle a error like it would if multipath was not used. The user will set the iscsi timers like replacement_timeout and nop timeout differently if they are using multipath.

I do see that there's an shost param to ep_connect, is there a way it
can give me a hint on the source IP?

I do not think it can help iser as it is today. Remember when we talked
about a shost per some physical/virtual resource vs a shost per session.
This is another place where that came in. bnx2i, cxgb3i and be2iscsi
allocate a host per port/netdev, so that is how they know the src they should 
be using.
I will have to think about how to do it for iser as it is today with the host 
per session

how about extending the ep_connect user/netlink/kernel/iscsi_transport 
framework to support
the functionality provided by the user space code of bind_conn_to_iface or 
basically, since the connection establishment framework is IP based, I would 
to just get some source ip in the kernel when ep_connect is called. I saw the
comment on why bind_src_by_address is problematic, but this doesn't apply to 

Which comment are you talking about? Are you talking about bind() not doing what you would want for iscsi_tcp (target sometimes sends data to the wrong port) or are you talking about if you were to use DHCP and so the IPs could change over boots?

A question for you. Some people do not like using the the netdev name
for the binding since it can change between boots. The default method is
to use  iface.hwaddress instead of iface.net_ifacename. For iscsi it is
just the MAC. For iser how big is the RNIC's equivalent of the MAC?

iser is working now over IB and at some point we'll make it work also over 

With IB, the RNIC is IPoIB NIC whose HW address (equiv of MAC) is 20 bytes long.
It turns out that some of these 20 bytes may change... the part which is burned

So is there anything in there that is static and can be used to identify the port?

is called GUID and is 8 bytes long, here you see two IPoIB NICs, ib0 and ib1 
and the
port GUIDs they are using are 00:02:c9:03:00:02:6b:df and 

7: ib0:<BROADCAST,MULTICAST,UP,LOWER_UP>  mtu 2044 qdisc pfifo_fast qlen 256
     link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:df
8: ib1:<BROADCAST,MULTICAST>  mtu 2044 qdisc pfifo_fast qlen 256
     link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e0

If you really interested to learn how these 20 bytes are composed its in the 
form of
flags:QPN:GID (1:3:16 bytes) where GID is of the form PREFIX:GUID (8:8 bytes) do
wget http://ietf.org/rfc/rfc4391.txt and see section "9.1.1. Link-Layer 
Address/Hardware Address".

Note that the ifconfig output is buggy so you should use $ ip address show
anyway, I wasn't really sure if/how the iface binding by hw address is working
in open iscsi, specifically, I wasn't able to track which library exports
net_get_netdev_from_hwaddress ... but I am quite sure this (binding iface to hw

That iscsi code actually uses the same sys/lib calls as ifconfig.

address and not netdev) works well for iscsi-tcp and offloads, correct?

Bind by hw address or netdev works with bnx2i and cxgb3i, because they are tied to the netdev and export both values.

be2iscsi and qla4xxx uses bind by hwaddress, because they have no interaction with the network subsystem so it only has the hw address.

You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
For more options, visit this group at 

Reply via email to