*Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes process hang - ksh93
CR 6876768/solaris_nevada changed on Oct 13 2009 by <User 1-90DA97> === Field ============ === New Value ============= === Old Value ============= Public Comments New Note ====================== =========================== =========================== *Change Request ID*: 6876768/solaris_nevada *Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes process hang - ksh93 Product: solaris Category: shell Subcategory: korn93 Type: Defect Subtype: Status: 2-Incomplete Substatus: Need More Info Priority: 2-High Introduced In Release: Introduced In Build: Responsible Engineer: <User 1-5Q-9460> Keywords: === *Description* ============================================================ zfs send/zfs receive operations experience hang under ksh: Problem is that the righthand side of the pipe is not being created "receive". The included truss will show that the zfs receive is not created as the truss doesn't even create the file on a failure mode. There MAY also be ioctl involvement as indicated by the last line of the included truss: 1955: ioctl(3, ZFS_IOC_SEND, 0x08044A00) (sleeping...) May also be an issue of when the receive side of an anonymous pipe dies it doesn't kill off the corresponding send side of the pipe. Potentially related CRs: 6859444 6782948 4657448 - Cu reports that the test included in CR4657448 does not fail on his host and he feels that the failure apparently depends on having multiple cores and the number of processes in the pipeline. In addition, included as an attachment are the cu scripts. You will find the following: 1) diff showing change that eliminated the hang (w/o changing shell) 2) r1.2 of the script that includes the hang 3) current version of script that includes various changes and does not (yet) cause hang (but multiple copies, run in parallel, can induce the zfs 3-way deadlock described in SR#71405530) 4) script used to create test zpool 5) script to create 100 test zfs [operations] *** (#1 of 1): 2009-08-28 00:23:49 GMT+00:00 <User 1-90DA97> *** Last Edit: 2009-09-23 15:13:21 GMT+00:00 <User 1-90DA97> === *Public Comments* ======================================================== I don't see any indication that this is a ZFS bug. Please provide a crash dump from when the system is "hung". This bug will be closed or recategorized as a ksh bug on September 10 if more information is not provided. *** (#1 of 8): 2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707> In examining the different trusses provided by the customer, we see that the action the customer wants to perform is zfs send | zfs receive We trussed both sides of the pipe with: truss -o sendside zfs send | truss -o recvside zfs receive This process was run in a loop to transfer multiple snapshots. The actual truss file names are unique across the entire loop. At some point in the process the left hand side hangs. There was no truss created for the receive side. This strongly suggests that the receive side of the pipe was not created at all. That the "zfs receive" was not a part of the issue. This issue was reproducable on the CU system. It might take take an hour or so, but the issue would happen. Since the CU is using these scripts to make backups, they are run a regular intervals and the chances of failure are very high. In an attempt to issolate the issue, we changed the send/receive pair from "zfs" to "dd". While we moved the same amount of data, in over 300 iterations we did not get a single hang. The analysis suggests that this MIGHT be an interaction between KSH and Kernel/ZFS. The "zfs send" makes a single ioctl() call which generates large amounts of data. That data is moved directly from zfs to a file, by passing user space. Additional point: This lock up is NOT seen if the user pipes across the network. I.e. zfs send | ssh remote.system.com zfs receive *** (#2 of 8): 2009-08-28 12:45:35 GMT+00:00 <User 1-92VH66> Based on the additional information that Chris provided, is a crash dump still required? *** (#3 of 8): 2009-08-28 15:03:01 GMT+00:00 <User 1-90DA97> >From the latest update it looks like it is Solaris kh93 bug. *** (#4 of 8): 2009-09-23 15:43:34 GMT+00:00 <User 1-1SURPB> Transfering to shell/korn93 based on last comment for further investigation. *** (#5 of 8): 2009-10-01 17:13:43 GMT+00:00 <User 1-3GMVGZ> Roland Mainz, the OpenSolaris ksh93 integration project lead asks: Does the problem go away if you apply the ksh93-integration update2 tarballs from http://www.opensolaris.org/os/project/ksh93-integration/downloads/2009-09-22/ You can find his e-mail address and the project mailing list/web forum on the ksh93 project pages at: http://opensolaris.org/os/project/ksh93-integration/ *** (#6 of 8): 2009-10-01 20:24:07 GMT+00:00 <User 1-5Q-1267> Queried customer and asked that he download and apply the eval/test tarballs to determine if they resolve the issue. Will update with results. *** (#7 of 8): 2009-10-09 17:04:42 GMT+00:00 <User 1-90DA97> Please see attachment 6876768emailupdate.txt for update queries related to customer response... Thanks. *** (#8 of 8): 2009-10-13 15:52:13 GMT+00:00 <User 1-90DA97> === *Workaround* ============================================================= === *Additional Details* ===================================================== Targeted Release: solaris_nevada Commit To Fix In Build: Fixed In Build: Integrated In Build: Verified In Build: See Also: 4657448, 6782948, 6859444 Duplicate of: Hooks: Hook1: Hook2: Hook3: Hook4: Hook5: Hook6: Program Management: Root Cause: Fix Affects Documentation: No Fix Affects Localization: No === *History* ================================================================ Date Submitted: 2009-08-28 00:23:46 GMT+00:00 Submitted By: <User 1-90DA97> Status Changed Date Updated Updated By 2-Incomplete 2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707> === *Service Request* ======================================================== Impact: Significant Functionality: Secondary Severity: 3 Product Name: solaris Product Release: osol_2009.06 Product Build: osol_2009.06 Operating System: osol_2009.06 Hardware: Submitted Date: 2009-08-28 00:23:49 GMT+00:00 === *Multiple Release (MR) Cluster* - 6876768 ================================ ID: +6876768/solaris_nevada SubCR Number: 6876768 Targeted Release: solaris_nevada === *SubCR* ================================================================== ID: 6876768/osol_2009.06u6 Status: 1-Dispatched Substatus: Priority: 3-Medium Responsible Engineer: <User 1-5Q-9460> SubCR Number: 2182414 Targeted Release: osol_2009.06u6 Commit To Fix In Build: Fixed In Build: Integrated In Build: Verified In Build: Hook1: Hook2: Program Management: