I was investigating bug 8130 for a while in order to determine
what the problem is and if this might be considered as a stopper
for 2009.06 release. I would like to share my thoughts and observations,
since it seems that the problem is partially related to the chosen
implementation and at this point addressing it as a whole would be
too risky.

Problem:
--------
In current implementation, AI client boot process contains several
steps. Those interesting with respect to 8130 are

[1] locating and downloading boot_archive
[2] locating and downloading additional compressed archives (solaris.zlib,
    solarismisc.zlib).

In current implementation, it is required that [1] and [2] are taken
from the same AI image. The issue here is that in specific configuration
affecting Sparc client, mismatch between [1] and [2] could occur
(boot_archive is taken from different AI image than compressed archives).

For x86, this mismatch doesn't occur, since both locations are specified
at one place - (in GRUB menu.lst file) and are always updated at once.

For Sparc, those locations are separated and there are scenarios when
they could currently become out of sync. Location of boot archive is
specified in wanboot.conf file (as 'root_file' option) and location of
compressed archives is provided as 'RootPath' option by DHCP server.

The mismatch doesn't occur if AI Sparc client is explicitly associated
with given install service and AI image by using 'create-client'
installadm(1M) subcommand. In that case, both DHCP server as well as
wanboot.conf files are appropriately configured:

* client specific DHCP macro containing location of compressed archives
  is (re)created. It takes precedence over service specific DHCP macro.
  It assures that client is always provided with correct 'RootPath'
  information.

* client specific wanboot.conf file containing location of boot_archive
  is (re)created in /etc/netboot/<network_address>/<client_id> directory.
  Again, it takes precedence over other wanboot.conf files stored in other
  locations within /etc/netboot directory.

The problematic scenario is when Sparc AI client is not explicitly 
configured
with 'create-client' command. In that case, it is provided with boot_archive
picked up from location specified in /etc/netboot/wanboot.conf file and
with RootPath option pointing to location of compressed archives which
is taken from service-specific DHCP macro. Those are configured when 
'create-service'
command is used to create install service.

The problem is that /etc/netboot/wanboot.conf file is populated each time
new install service is created, but service-specific DHCP macro is assigned
to given pool of IP addresses (by calling pntadm(1M)) only when new pool
of IP addresses is asked to be created (by providing -i and -c options).

e.g. the problem occurs when:

[1] first install service is created along with pool of IP addresses
# installadm create-service -n service_1 -i <start_IP> -c <IP_pool_size> \
  -s <ai_iso_image_1> <ai_image_1>

* /etc/netboot/wanboot.conf is created and points to boot_archive in
  <ai_image_1>

* service specific DHCP macro dhcp_macro_service_1 is created with
  'RootPath' pointing to <ai_image_1>

* created IP addresses are assigned to dhcp_macro_service_1 macro
  using pntadm(1M) command

[2] second service is created
# installadm create-service -n service_2 -s <ai_iso_image_2> <ai_image_2>

* /etc/netboot/wanboot.conf is (re)created and points to boot_archive in
  <ai_image_2>

* service specific DHCP macro dhcp_macro_service_2 is created with
  'RootPath' pointing to <ai_image_2>, but not associated with IP
  addresses

Now when Sparc AI client is booted, it picks up boot archive from 
<ai_image_2>
and compressed archives from <ai_image_1>

[3] second service is deleted along with AI image
# installadm delete-service -x service_2

* /etc/netboot/wanboot.conf is left untouched and points to boot_archive in
  already deleted <ai_image_2>

Now when Sparc AI client is asked to boot, it fails when trying to obtain
boot_archive.

Proposed final solution:
------------------------
I think that the final solution here is to worked out set of requirements
we would like to address and reconsider existing design and implementation
with respect to

* what are desired install service scopes to be available
  - currently for Sparc we can either explicitly associate install
    service with particular client (identifying it by MAC address)
    and use another one for rest of Sparc clients. More than one
    service can't be created serving broader scope, since only
    one /etc/netboot/wanboot.conf file can be created.

* how Sparc client obtains location of AI images
  - now it is spread across two places - one for boot_archive,
    one for compressed archives. It should be consolidated, so
    that it is less error prone and easier to maintain.

Proposed fix for now:
---------------------
For now any significant design changes are not appropriate,
since they would be too risky. Based on this I am thinking about
following temporary solution before final approach can be taken:

* when new service is created, don't touch /etc/netboot/wanboot.conf
  if it contains pointer to existing boot archive. It makes sure
  that once /etc/netboot/wanboot.conf is created for one service,
  it is not accidentaly overwritten by another service. So clients would
  continue to use first service as a default (in cases 'create-client'
  is not called) and mismatch would be avoided in this case.

* when service is deleted along with associated AI image (by passing
  '-x' option) and if /etc/netboot/wanboot.conf file contains pointer
  to boot archive in that image, /etc/netboot/wanboot.conf will be
  deleted along with that AI image. It would avoid
  /etc/netboot/wanboot.conf pointing to non-existent AI image.

When those changes are applied, behavior for Sparc clients would be similar
to the one for x86 clients.

I have prepared preliminary fix with those changes and tested it for
Sparc as well as x86 clients.

The preliminary webrev is available at following location:
http://cr.opensolaris.org/~dambi/bug-8130/

please let me know, if you think that this problem can be qualified
as stopper for 2009.06, if there are other related issues I have
not noticed and if solution mentioned above can be acceptable
or different approach should be taken. Any comments are highly appreciated.

Thank you very much in advance,
Jan


Reply via email to