*Synopsis*: mail delivery hangs due to ksh93, via AST, using an old and slow 
getcwd() algorithm

CR 6946515 changed on Apr 23 2010 by <User 1-5Q-101>

=== Field ============ === New Value ============= === Old Value =============

Public Comments        New Note                                               
====================== =========================== ===========================

     
*Change Request ID*: 6946515

*Synopsis*: mail delivery hangs due to ksh93, via AST, using an old and slow 
getcwd() algorithm

  Product: solaris
  Category: shell
  Subcategory: korn93
  Type: Defect
  Subtype: Performance
  Status: 6-Fix Understood
  Substatus: 
  Priority: 3-Medium
  Introduced In Release: solaris_nevada
  Introduced In Build: 
  Responsible Engineer: 
  Keywords: 

=== *Description* ============================================================
John Beck noticed lots of processes started by sendmail
on jurassic that were hanging on stat(), statting ../<user>
where the user's home directory server was not responding.

It turns out that this happens because ksh93 uses the libast
getcwd(), which implements the traditional getcwd() algorithm
(stat ., stat every entry in .. until we find one that matches
.'s st_dev and st_ino, repeat until we find /).  Now, ksh93
has some heuristics to try to find $PWD faster based on testing
whether whether any of $PWD, $HOME, /usr/spool/cron/atjobs, or
/ are the same as ., but this fails in the case of sendmail
because sendmail starts .forward jobs with an empty environment.

Thus in practice the old, slow, dangerous (dangerous because of
the risk of straying into paths whose NFS servers are not
responding) getcwd() is rarely ever tried in ksh93.  The old
getcwd() is tried only when $PWD and $HOME are not set or are
not the same as the actual current directory and the current
directory is also not / and not /usr/spool/cron/atjobs (!).

Solaris has a much faster getcwd() that uses a single system
call.  AST should use Solaris' getcwd() on Solaris.

To reproduce:

% cd $HOME
% truss -o /tmp/tf -f env -i ksh93
% exit
% grep stat.*home /tmp/tf

The code path to hit this is

sh_main()->sh_init()->env_init()->path_pwd()->_ast_getcwd()

(_ast_getcwd() is actually named getcwd(), but the name is
mapped to _aset_getcwd via the C preprocessor.)

*** (#1 of 1): 2010-04-22 22:33:32 GMT+00:00 <User 1-5Q-10962>


=== *Public Comments* ========================================================
Note, there are likely other libc functions that AST
replaces/interposes on with lesser, on Solaris, equivalents.
For example: resolvepath().

The principle should be that one needs a good reason to replace
or interpose on any libc function.  That a given libc function
was, once upon a time, slow is not sufficient.  Improvements over
time can render invalid the rationales for earlier decisions to
replace, which means that revisiting replacement/interposition
decisions should also be a principle here.

An audit of libast replacements/interpositions on libc and other
system functions is in order.

*** (#1 of 2): 2010-04-22 22:33:32 GMT+00:00 <User 1-5Q-10962>

[ jbeck, 2010-Apr-22 ]

Olga asked me to add the following on her behalf
---
Replacing getcwd() from libast with libc will not be possible because libc uses 
libc malloc() which is not async signal safe (libast malloc and libast stdio 
are async signal safe, the libc versions of these function sets are not).

It may be able to use the getcwd syscall but I do not know what will happen if 
cwd is longer than PATH_MAX.
---

Though on an internal thread it was pointed out that libc`getcwd() only calls 
malloc() if the first argument passed to getcwd() is NULL, so libast could 
simply make sure not to call it that way.

*** (#2 of 2): 2010-04-23 01:10:26 GMT+00:00 <User 1-5Q-101>


=== *Workaround* =============================================================

=== *Additional Details* =====================================================
        Targeted Release: solaris_nevada
        Commit To Fix In Build: 
        Fixed In Build: 
        Integrated In Build: 
        Verified In Build: 
  See Also: 
  Duplicate of: 
  Hooks:
        Hook1: 
        Hook2: 
        Hook3: 
        Hook4: 
        Hook5: 
        Hook6: 
  Program Management: 
  Root Cause: 
  Fix Affects Documentation: No
  Fix Affects Localization: No

=== *History* ================================================================
        Date Submitted: 2010-04-22 22:33:32 GMT+00:00
        Submitted By: <User 1-5Q-10962>

        Status Changed    Date Updated                  Updated By
        6-Fix Understood  2010-04-22 22:38:06 GMT+00:00 <User 1-5Q-10962>


=== *Service Request* ========================================================
        Impact: Significant
        Functionality: Secondary
        Severity: 3
        Product Name: solaris
        Product Release: solaris_nevada
        Product Build: 
        Operating System: snv
        Hardware: generic
        Submitted Date: 2010-04-22 22:33:32 GMT+00:00


=== *Multiple Release (MR) Cluster* - 0 ======================================

_______________________________________________
ksh93-integration-discuss mailing list
ksh93-integration-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/ksh93-integration-discuss

Reply via email to