I've been doing some digging into a reported scenario where a node
update fails with an "iSCSI database failure" reported.

The short of it is that it's possible to create node records for a
target portal in both the old-style portal configuration file format,
and the new-style configuration directory with tpgt.  If they both exist
on an update command a database failure may occur, depending on the
unspecified return order of readdir.

If the old-style record is created first, then the update/rewrite logic
will remove it when a request to create a node with a specified tpgt
comes in.  But if the new-style record exists first, you can still
create a static old-style record without tpgt.

On an update, when walking through the files in the node database if
readdir returns the new-style record first this can occur;  The
new-style entry is read and passed to idbm_node_set_param and then
idbm_rec_write.  In idbm_rec_write the existence of the old-style file
is detected, and it's removed in order to update to the new-style.  We
have now gone from two node records to one.

But, because readdir can cache directory contents from the last
opendir/rewinddir call (in idbm_for_each_portal) it will still return
the path to the now deleted old-style file.  Attempting to read that in
(in idbm_for_each_iface) will result in a failed stat call and the
"iSCSI database failure."

The actual set of calls that seems to be happening in what was reported
to me is running SendTargets discovery on a portal, then for some reason
running additional node --op=new commands specifying the target-name and
portal address (without tpgt) of what was discovered, and finally an
update to change node.startup.  To reproduce this reliably, I've been
using LD_PRELOAD with Ted T'so's spd_readdir.c code to cache and sort
the filestream contents.

OK, arguably that may be a case of "if it hurts when you do that, then
stop doing that."  But it's easy, subtle, and weird enough that it might
be worth addressing.  It's just not clear how to best address it.

The idbm code could be more forgiving of stat failures in these loops,
either ignoring them or restarting the command if a filesystem change
like this happens.  In this case, it could check for the existence of a
matching new-style entry, and convert that to a static record preserving
the tpgt instead of adding a new static record without one.

I'm slightly confused by the entire point of having the tpgt be in the
pathname (and therefor part of a unique record key), given that it
allows the same portal to have multiple tags under a single target which
seems wrong.

Anyway, end of long description.  Looking at possible improvements and
I'd love any other opinions on this one.

- Chris

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to