*Synopsis*: mail delivery hangs due to ksh93, via AST, using an old and slow getcwd() algorithm
CR 6946515 changed on Apr 23 2010 by <User 1-5Q-101> === Field ============ === New Value ============= === Old Value ============= Public Comments New Note ====================== =========================== =========================== *Change Request ID*: 6946515 *Synopsis*: mail delivery hangs due to ksh93, via AST, using an old and slow getcwd() algorithm Product: solaris Category: shell Subcategory: korn93 Type: Defect Subtype: Performance Status: 6-Fix Understood Substatus: Priority: 3-Medium Introduced In Release: solaris_nevada Introduced In Build: Responsible Engineer: Keywords: === *Description* ============================================================ John Beck noticed lots of processes started by sendmail on jurassic that were hanging on stat(), statting ../<user> where the user's home directory server was not responding. It turns out that this happens because ksh93 uses the libast getcwd(), which implements the traditional getcwd() algorithm (stat ., stat every entry in .. until we find one that matches .'s st_dev and st_ino, repeat until we find /). Now, ksh93 has some heuristics to try to find $PWD faster based on testing whether whether any of $PWD, $HOME, /usr/spool/cron/atjobs, or / are the same as ., but this fails in the case of sendmail because sendmail starts .forward jobs with an empty environment. Thus in practice the old, slow, dangerous (dangerous because of the risk of straying into paths whose NFS servers are not responding) getcwd() is rarely ever tried in ksh93. The old getcwd() is tried only when $PWD and $HOME are not set or are not the same as the actual current directory and the current directory is also not / and not /usr/spool/cron/atjobs (!). Solaris has a much faster getcwd() that uses a single system call. AST should use Solaris' getcwd() on Solaris. To reproduce: % cd $HOME % truss -o /tmp/tf -f env -i ksh93 % exit % grep stat.*home /tmp/tf The code path to hit this is sh_main()->sh_init()->env_init()->path_pwd()->_ast_getcwd() (_ast_getcwd() is actually named getcwd(), but the name is mapped to _aset_getcwd via the C preprocessor.) *** (#1 of 1): 2010-04-22 22:33:32 GMT+00:00 <User 1-5Q-10962> === *Public Comments* ======================================================== Note, there are likely other libc functions that AST replaces/interposes on with lesser, on Solaris, equivalents. For example: resolvepath(). The principle should be that one needs a good reason to replace or interpose on any libc function. That a given libc function was, once upon a time, slow is not sufficient. Improvements over time can render invalid the rationales for earlier decisions to replace, which means that revisiting replacement/interposition decisions should also be a principle here. An audit of libast replacements/interpositions on libc and other system functions is in order. *** (#1 of 2): 2010-04-22 22:33:32 GMT+00:00 <User 1-5Q-10962> [ jbeck, 2010-Apr-22 ] Olga asked me to add the following on her behalf --- Replacing getcwd() from libast with libc will not be possible because libc uses libc malloc() which is not async signal safe (libast malloc and libast stdio are async signal safe, the libc versions of these function sets are not). It may be able to use the getcwd syscall but I do not know what will happen if cwd is longer than PATH_MAX. --- Though on an internal thread it was pointed out that libc`getcwd() only calls malloc() if the first argument passed to getcwd() is NULL, so libast could simply make sure not to call it that way. *** (#2 of 2): 2010-04-23 01:10:26 GMT+00:00 <User 1-5Q-101> === *Workaround* ============================================================= === *Additional Details* ===================================================== Targeted Release: solaris_nevada Commit To Fix In Build: Fixed In Build: Integrated In Build: Verified In Build: See Also: Duplicate of: Hooks: Hook1: Hook2: Hook3: Hook4: Hook5: Hook6: Program Management: Root Cause: Fix Affects Documentation: No Fix Affects Localization: No === *History* ================================================================ Date Submitted: 2010-04-22 22:33:32 GMT+00:00 Submitted By: <User 1-5Q-10962> Status Changed Date Updated Updated By 6-Fix Understood 2010-04-22 22:38:06 GMT+00:00 <User 1-5Q-10962> === *Service Request* ======================================================== Impact: Significant Functionality: Secondary Severity: 3 Product Name: solaris Product Release: solaris_nevada Product Build: Operating System: snv Hardware: generic Submitted Date: 2010-04-22 22:33:32 GMT+00:00 === *Multiple Release (MR) Cluster* - 0 ====================================== _______________________________________________ ksh93-integration-discuss mailing list ksh93-integration-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/ksh93-integration-discuss