*Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes 
process hang - ksh93

CR 6876768/solaris_nevada changed on Oct 13 2009 by <User 1-90DA97>

=== Field ============ === New Value ============= === Old Value =============

Public Comments        New Note                                               
====================== =========================== ===========================

     
*Change Request ID*: 6876768/solaris_nevada

*Synopsis*: ZFS Send/ZFS Receive: Receive Side of Pipe is not created - Causes 
process hang - ksh93

  Product: solaris
  Category: shell
  Subcategory: korn93
  Type: Defect
  Subtype: 
  Status: 2-Incomplete
  Substatus: Need More Info
  Priority: 2-High
  Introduced In Release: 
  Introduced In Build: 
  Responsible Engineer: <User 1-5Q-9460>
  Keywords: 

=== *Description* ============================================================
zfs send/zfs receive operations experience hang under ksh:

Problem is that the righthand side of the pipe is not being created "receive".  
The included truss will show that the zfs receive is not created as the truss 
doesn't even create the file on a failure mode.

There MAY also be ioctl involvement as indicated by the last line of the 
included truss:

1955:   ioctl(3, ZFS_IOC_SEND, 0x08044A00) (sleeping...)

May also be an issue of when the receive side of an anonymous pipe dies it 
doesn't kill off the corresponding send side of the pipe.

Potentially related CRs:

6859444
6782948
4657448 - Cu reports that the test included in CR4657448 does not fail on his 
host and he feels that the failure apparently depends on having multiple cores 
and the number of processes in the pipeline.

In addition, included as an attachment are the cu scripts. You will find the 
following:

1) diff showing change that eliminated the hang (w/o changing shell)
2) r1.2 of the script that includes the hang
3) current version of script that includes various changes and does not (yet) 
cause hang (but multiple copies, run in parallel, can induce the zfs 3-way 
deadlock described in SR#71405530)
4) script used to create test zpool
5) script to create 100 test zfs [operations]

*** (#1 of 1): 2009-08-28 00:23:49 GMT+00:00 <User 1-90DA97>
*** Last Edit: 2009-09-23 15:13:21 GMT+00:00 <User 1-90DA97>


=== *Public Comments* ========================================================
I don't see any indication that this is a ZFS bug.  Please provide a crash dump 
from when the system is "hung".

This bug will be closed or recategorized as a ksh bug on September 10 if more 
information is not provided.

*** (#1 of 8): 2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707>

In examining the different trusses provided by the customer, we see that the 
action the customer wants to perform is

zfs send | zfs receive

We trussed both sides of the pipe with:
truss -o sendside zfs send | truss -o recvside zfs receive

This process was run in a loop to transfer multiple snapshots.  The actual 
truss file names are unique across the entire loop.  At some point in the 
process the left hand side hangs.   There was no truss created for the receive 
side.  

This strongly suggests that the receive side of the pipe was not created at 
all.  That the "zfs receive" was not a part of the issue.

This issue was reproducable on the CU system.   It might take take an hour or 
so, but the issue would happen.  Since the CU is using these scripts to make 
backups, they are run a regular intervals and the chances of failure are very 
high.

In an attempt to issolate the issue, we changed the send/receive pair from 
"zfs" to "dd".  While we moved the same amount of data, in over 300 iterations 
we did not get a single hang.

The analysis suggests that this MIGHT be an interaction between KSH and 
Kernel/ZFS.  The "zfs send" makes a single ioctl() call which generates large 
amounts of data.  That data is moved directly from zfs to a file, by passing 
user space.

Additional point: This lock up is NOT seen if the user pipes across the 
network.  I.e. zfs send | ssh remote.system.com zfs receive

*** (#2 of 8): 2009-08-28 12:45:35 GMT+00:00 <User 1-92VH66>

Based on the additional information that Chris provided, is a crash dump still 
required?

*** (#3 of 8): 2009-08-28 15:03:01 GMT+00:00 <User 1-90DA97>

>From the latest update it looks like it is Solaris kh93 bug.

*** (#4 of 8): 2009-09-23 15:43:34 GMT+00:00 <User 1-1SURPB>

Transfering to shell/korn93 based on last comment for further investigation.

*** (#5 of 8): 2009-10-01 17:13:43 GMT+00:00 <User 1-3GMVGZ>

Roland Mainz, the OpenSolaris ksh93 integration project lead asks:
 Does the problem go away if you apply the ksh93-integration update2 tarballs 
from
 http://www.opensolaris.org/os/project/ksh93-integration/downloads/2009-09-22/

You can find his e-mail address and the project mailing list/web forum on 
the ksh93 project pages at:
http://opensolaris.org/os/project/ksh93-integration/

*** (#6 of 8): 2009-10-01 20:24:07 GMT+00:00 <User 1-5Q-1267>

Queried customer and asked that he download and apply the eval/test tarballs to 
determine if they resolve the issue.  Will update with results.

*** (#7 of 8): 2009-10-09 17:04:42 GMT+00:00 <User 1-90DA97>

Please see attachment 6876768emailupdate.txt for update queries related to 
customer response... Thanks.

*** (#8 of 8): 2009-10-13 15:52:13 GMT+00:00 <User 1-90DA97>


=== *Workaround* =============================================================

=== *Additional Details* =====================================================
        Targeted Release: solaris_nevada
        Commit To Fix In Build: 
        Fixed In Build: 
        Integrated In Build: 
        Verified In Build: 
  See Also: 4657448, 6782948, 6859444
  Duplicate of: 
  Hooks:
        Hook1: 
        Hook2: 
        Hook3: 
        Hook4: 
        Hook5: 
        Hook6: 
  Program Management: 
  Root Cause: 
  Fix Affects Documentation: No
  Fix Affects Localization: No

=== *History* ================================================================
        Date Submitted: 2009-08-28 00:23:46 GMT+00:00
        Submitted By: <User 1-90DA97>

        Status Changed    Date Updated                  Updated By
        2-Incomplete      2009-08-28 04:38:45 GMT+00:00 <User 1-5Q-9707>


=== *Service Request* ========================================================
        Impact: Significant
        Functionality: Secondary
        Severity: 3
        Product Name: solaris
        Product Release: osol_2009.06
        Product Build: osol_2009.06
        Operating System: osol_2009.06
        Hardware: 
        Submitted Date: 2009-08-28 00:23:49 GMT+00:00


=== *Multiple Release (MR) Cluster* - 6876768 ================================
        ID: +6876768/solaris_nevada
        SubCR Number: 6876768
        Targeted Release: solaris_nevada

=== *SubCR* ==================================================================
        ID:  6876768/osol_2009.06u6
        Status: 1-Dispatched
        Substatus: 
        Priority: 3-Medium
        Responsible Engineer: <User 1-5Q-9460>
        SubCR Number: 2182414
        Targeted Release: osol_2009.06u6
        Commit To Fix In Build: 
        Fixed In Build: 
        Integrated In Build: 
        Verified In Build: 
        Hook1: 
        Hook2: 
        Program Management: 

Reply via email to