[fossil-users] crash while running 'fossil pull' on Linux

Eric Rubin-Smith Mon, 18 Aug 2014 18:40:27 -0700

I configured fossil to use openssl (for https) and built it for Linux
(kernel 3.11.0-12-generic, Ubuntu 13.10).  Fossil crashes during the 'pull'
portion of a 'fossil update' or just while running 'fossil pull'.  The pull
implies the transfer of a few large artifacts (~60MB range) as well as lots
of small ones.


The crash happens when the client speaks SSL to the fossil server via
stunnel, and also when the client speaks in the clear directly to the
fossil server.

I believe that the server is running out of memory during the transaction,
and the resulting server response causes the client to fail.

$ uname -a
Linux <snip> 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:12:00 UTC 2013
i686 i686 i686 GNU/Linux

$ ~/fossil-src-20140612172556/fossil version
This is fossil version 1.29 [3e5ebe2b90] 2014-06-12 17:25:56 UTC

I am unfortunately not able to get much useful information out of the core
file, apart from the fact that we seem to have been in the openssl lib when
we crashed:

$ ~/fossil-src-20140612172556/fossil pull
Pull from https://e...@dev.packetup.net:10444/
Segmentation fault (core dumped) 0  received: 0

$ gdb ~/fossil-src-20140612172556/fossil ./core
GNU gdb (GDB) 7.6.1-ubuntu
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/eric/fossil-src-20140612172556/fossil...done.

warning: exec file is newer than core file.
[New LWP 7065]

warning: Can't read pathname for load map: Input/output error.
Core was generated by `/home/eric/fossil-src-20140612172556/fossil update'.
Program terminated with signal 11, Segmentation fault.
#0  0xb75bffb0 in ?? () from /lib/i386-linux-gnu/libcrypto.so.1.0.0
(gdb) where
#0  0xb75bffb0 in ?? () from /lib/i386-linux-gnu/libcrypto.so.1.0.0
#1  0xefcdab89 in ?? ()
#2  0x98badcfe in ?? ()
#3  0x10325476 in ?? ()
#4  0xc3d2e1f0 in ?? ()
#5  0x402a84b0 in ?? ()
#6  0x00000000 in ?? ()

This happens every time given my current clone pairs.

I was not able to reproduce it under valgrind, because the slow-down
imposed by valgrind caused the initial network transaction to time out.

I've reproduced it under strace, but I'm not sure how helpful that output
will be.

I then recompiled fossil as a static executable and re-ran my test.  Fossil
usually crashes still, but one time it did not crash.  Instead, it produced
the following output:

$ ../fossil-src-20140612172556/fossil pull
Pull from https://e...@dev.packetup.net:10444/
Round-trips: 1   Artifacts sent: 0  received: 0
server replies with HTML instead of fossil sync protocol:
file 010dfdc0db6ff3cf63e0da8f90681fef5e80ee56 5032
/*
 * Copyright (C) 1995, 1996, 1997, 1998, and 1999 WIDE Project.
{snip a bunch of file data}
file 25fc6f1d7ddb3d0c6f812a397752b83472776e56
8d0ef3a2aee8895877014a1b273cc4dfd20e5f34 272
AWP
1d:C {snipped my check-in comment}
D 2014-08-18T16:46:56.6919j2@22,q:nvram_read.c
83e2de86fe92201e7662fc9db14387f4d4d79a0fjd@9kq
,1I:8d0ef3a2aee8895877014a1b273cc4dfd20e5f34
U eas
Z a2e9f38c4965a46f1e4f925f9e9a8a07
3GbccP;file 267f9bd5f5b347d26ede6b0c20c332d8af1b3bc0 5212

Round-trips: 2   Artifacts sent: 0  received: 51
server replies with HTML instead of fossil sync protocol:
<p class="generalError">out of memory</p>
Round-trips: 2   Artifacts sent: 0  received: 51
Pull finished with 4250 bytes sent, 99232 bytes received
$

So I guess a hypothesis is that the server is running out of memory and is
barfing back some response that the client cannot handle, and the client is
crashing.

I have taken a wire capture of the cleartext client<->server interaction
and can share that with the devs privately if they would like.  There is
not a lot of data transferred in this scenario -- only 12 packets.  The
server works for 80 seconds, sends back a small response, and the
connection is terminated.

I ran fossil and stepped it through the debugger starting from the
blob_uncompress.  Here's the contents of the uncompressed blob:

(gdb) where
#0  blob_uncompress (pIn=pIn@entry=0xbffff434, pOut=pOut@entry=0xbffff434)
    at ./src/blob.c:933
#1  0x0807bf60 in http_exchange (pSend=pSend@entry=0xbffff420,
    pReply=pReply@entry=0xbffff434, useLogin=1,
    maxRedirect=maxRedirect@entry=20) at ./src/http.c:390
#2  0x080ca4b5 in client_sync (syncFlags=<optimized out>,
    configRcvMask=<optimized out>, configSendMask=configSendMask@entry=0)
    at ./src/xfer.c:1561
#3  0x080acd1c in pull_cmd () at ./src/sync.c:172
#4  0x0804d402 in main (argc=4, argv=0xbffff6c4) at ./src/main.c:674
(gdb) print pOut->aData
$8 = 0x844aa08 "file 59cdb30316cbb0d1d35f248ac3bcabcde7a419c2
988f3554814eeae25a87235c9c8e3a4f84b29a75 64937242\n<p
class=\"generalError\">out of memory</p>"
(gdb)

My guess is that we are going to crash on that record: looks like the file
size is indicated as ~64MB but the real contents are much smaller (just the
OOM error message):

(gdb) n
932       return 0;
(gdb)
933     }
(gdb)
http_exchange (pSend=pSend@entry=0xbffff420, pReply=pReply@entry=0xbffff434,

    useLogin=1, maxRedirect=maxRedirect@entry=20) at ./src/http.c:401
401       if( ! g.url.isSsh ) closeConnection = 1; /* FIX ME */
(gdb)
403         transport_close(&g.url);
(gdb)
415     }
(gdb)
client_sync (syncFlags=<optimized out>, configRcvMask=<optimized out>,
    configSendMask=configSendMask@entry=0) at ./src/xfer.c:1569
1569        if( syncFlags & SYNC_VERBOSE ){
(gdb)
1575          nArtifactSent += xfer.nFileSent + xfer.nDeltaSent;
(gdb)
1576          fossil_print(zBriefFormat, nRoundtrip, nArtifactSent,
nArtifactRcvd);
(gdb)
1574          nRoundtrip++;
(gdb)
1576          fossil_print(zBriefFormat, nRoundtrip, nArtifactSent,
nArtifactRcvd);
(gdb)
1586d-tr    blob_reset(&send);t: 0  received: 0
(gdb)
1580        xfer.nFileSent = 0;
(gdb)
1581        xfer.nDeltaSent = 0;
(gdb)
1582        xfer.nGimmeSent = 0;
(gdb)
1583        xfer.nIGotSent = 0;
(gdb)
1586        blob_reset(&send);
(gdb)
1587        rArrivalTime = db_double(0.0, "SELECT julianday('now')");
(gdb)
1590        if( syncFlags & SYNC_PRIVATE ){
(gdb)
1587        rArrivalTime = db_double(0.0, "SELECT julianday('now')");
(gdb)
1590        if( syncFlags & SYNC_PRIVATE ){
(gdb)
1597        if( syncFlags & SYNC_PULL ){
(gdb)
1578        nCardSent = 0;
(gdb)
1597        if( syncFlags & SYNC_PULL ){
(gdb)
1598          blob_appendf(&send, "pull %s %s\n", zSCode, zPCode);
(gdb)
1599          nCardSent++;
(gdb)
1601        if( syncFlags & SYNC_PUSH ){
(gdb)
1866          blob_reset(&xfer.line);
(gdb)
1608        while( blob_line(&recv, &xfer.line) ){
(gdb)
1609          if( blob_buffer(&xfer.line)[0]=='#' ){
(gdb)
1624          xfer.nToken = blob_tokenize(&xfer.line, xfer.aToken,
count(xfer.aToken));
(gdb)
1626          if( (syncFlags & SYNC_VERBOSE)!=0 && recv.nUsed>0 ){
(gdb)
1624          xfer.nToken = blob_tokenize(&xfer.line, xfer.aToken,
count(xfer.aToken));
(gdb)
1625          nCardRcvd++;
(gdb)
1626          if( (syncFlags & SYNC_VERBOSE)!=0 && recv.nUsed>0 ){
(gdb)
1640          if( blob_eq(&xfer.aToken[0],"file") ){
(gdb)
0x08048330 in ?? ()
(gdb)
Cannot find bounds of current function
(gdb)

so at this point we have corrupted our stack I guess:

(gdb) where
#0  0x08048330 in ?? ()
#1  0x00000060 in ?? ()
#2  0x00000060 in ?? ()
#3  0x0844aa08 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Let me know if you need anything further.

Eric

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

[fossil-users] crash while running 'fossil pull' on Linux

Reply via email to