Re: [fossil-users] Fossil process hanging on sync to remote server?
On 2014-05-10 17:46, Andy Bradford wrote: Thus said Gerald Gutierrez on Sat, 10 May 2014 01:53:56 -0700: frame #8: 0x000105719ba2 fossil`ssl_receive(NotUsed=unavailable, pContent=unavailable, N=unavailable) + 50 at http_ssl.c:399 396 size_t got; 397 size_t total = 0; 398 while( N0 ){ - 399 got = BIO_read(iBio, pContent, N); 400 if( got=0 ) break; 401 total += got; 402 N -= got; I'm not sure if it is the course of the problem but got = unsigned, So when bio_read returns -1 got is a big number because by definition it cannot go below 0; on my machine if i declare int l=-1; size_t r = l; printf(l = %d r = l = %zu\n,l,r); l = -1 r =l = 18446744073709551615 N cannot go below 0 also subtracting got in this case will only yield 0 by coincidence. the 3rd argument to bio_read is listed as an int on my machine int is 4 bytes and size_t is 8 bytes depending on input this can go wrong with sufficient values not fitting in a int . e.g. this fits in an int, 0x1000 this does not fits in an int, 0x1 e.g the int will be 0 changing ssl_receive to /* ** Receive content back from the SSL connection. */ size_t ssl_receive(void *NotUsed, void *pContent, size_t N){ ssize_t got; size_t total = 0; while( N0 ){ got = BIO_read(iBio, pContent, N = INT_MAX ? N : INT_MAX); if( got=0 ) break; total += got; N -= got; pContent = (void*)((char*)pContent)[got]; } return total; } will yield better results (I hope) because I cannot test it I attached a unified patch. I patched http_socket.c and http_ssl.c against the latest of the trunk. I wonder if it solves your problem? -- Rene--- http_socket.c +++ http_socket.c @@ -182,14 +182,14 @@ /* ** Send content out over the open socket connection. */ size_t socket_send(void *NotUsed, void *pContent, size_t N){ - size_t sent; + ssize_t sent; size_t total = 0; while( N0 ){ -sent = send(iSocket, pContent, N, 0); +sent = send(iSocket, pContent, NSSIZE_MAX ?SSIZE_MAX:N , 0); if( sent=0 ) break; total += sent; N -= sent; pContent = (void*)((char*)pContent)[sent]; } --- http_ssl.c +++ http_ssl.c @@ -444,19 +444,19 @@ cert = PEM_read_bio_X509(mem, NULL, 0, NULL); free(zCert); BIO_free(mem); return cert; } - +#include limits.h /* ** Send content out over the SSL connection. */ size_t ssl_send(void *NotUsed, void *pContent, size_t N){ - size_t sent; + ssize_t sent; size_t total = 0; while( N0 ){ -sent = BIO_write(iBio, pContent, N); +sent = BIO_write(iBio, pContent,N = INT_MAX ? N : INT_MAX); if( sent=0 ) break; total += sent; N -= sent; pContent = (void*)((char*)pContent)[sent]; } @@ -465,18 +465,18 @@ /* ** Receive content back from the SSL connection. */ size_t ssl_receive(void *NotUsed, void *pContent, size_t N){ - size_t got; + ssize_t got; size_t total = 0; while( N0 ){ -got = BIO_read(iBio, pContent, N); +got = BIO_read(iBio, pContent, N = INT_MAX ? N : INT_MAX); if( got=0 ) break; total += got; N -= got; pContent = (void*)((char*)pContent)[got]; } return total; } #endif /* FOSSIL_ENABLE_SSL */ ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Fossil process hanging on sync to remote server?
On Sat, May 10, 2014 at 1:03 AM, Stephan Beal sgb...@googlemail.com wrote: For me fossil builds with -g (debug) flags by default, so you shouldn't have to rebuild. I ended up recompiling fossil just to be sure. On Mac OSX Mavericks, gdb has been replaced by lldb. I managed to get the process to hang again (it hadn't completed in over 10 minutes during a sync). I attached to it and went looking up the frames one by one. Here are the frames: frame #0: 0x7fff8aef59f0 libsystem_kernel.dylib`read + 8 frame #1: 0x7fff8fbfd84c libcrypto.0.9.8.dylib`conn_read + 76 frame #2: 0x7fff8fbf7264 libcrypto.0.9.8.dylib`BIO_read + 100 frame #3: 0x7fff86b6513d libssl.0.9.8.dylib`ssl3_read_n + 365 frame #4: 0x7fff86b65b9f libssl.0.9.8.dylib`ssl3_read_bytes + 735 frame #5: 0x7fff86b63dec libssl.0.9.8.dylib`ssl3_read + 156 frame #6: 0x7fff86b4e4c9 libssl.0.9.8.dylib`ssl_read + 73 frame #7: 0x7fff8fbf7264 libcrypto.0.9.8.dylib`BIO_read + 100 frame #8: 0x000105719ba2 fossil`ssl_receive(NotUsed=unavailable, pContent=unavailable, N=unavailable) + 50 at http_ssl.c:399 396 size_t got; 397 size_t total = 0; 398 while( N0 ){ - 399 got = BIO_read(iBio, pContent, N); 400 if( got=0 ) break; 401 total += got; 402 N -= got; So, it's hanging in the BIO_read function. Global/static variables (command: ta v) are: (SSL_CTX *) sslCtx = 0x7fa1c2e039b0 (char *) sslErrMsg = 0x (SSL *) ssl = 0x7fa1c2e04430 (BIO *) iBio = 0x7fa1c2e043c0 Frame variables (fr v) are unfortunately: (void *) NotUsed = variable not available (void *) pContent = variable not available (size_t) N = variable not available (size_t) total = 0 (size_t) got = variable not available Going up a couple more frames gives context to where in the code it is stalling but I'm not sure whether it gives any insight to why. Perhaps if someone could give some guidance on how I can investigate further I can help diagnose. Here are the remaining frames all the way up to main: frame #9: 0x00010571a3a6 fossil`transport_fetch(pUrlData=unavailable, zBuf=0x7fa1c2e078b0, N=1000) + 102 at http_transport.c:311 308 } 309 }else if( pUrlData-isHttps ){ 310 #ifdef FOSSIL_ENABLE_SSL - 311 got = ssl_receive(0, zBuf, N); 312 #else 313 got = 0; 314 #endif (lldb) up frame #10: 0x00010571a548 fossil`transport_receive_line [inlined] transport_load_buffer(pUrlData=0x00010589f430) + 236 at http_transport.c:393 390 transport.pBuf = pNew; 391 } 392 if( N0 ){ - 393 i = transport_fetch(pUrlData, transport.pBuf[transport.nUsed], N); 394 if( i0 ){ 395 transport.nRcvd += i; 396 transport.nUsed += i; (lldb) up frame #11: 0x00010571a45c fossil`transport_receive_line(pUrlData=0x00010589f430) + 76 at http_transport.c:416 413 i = iStart = transport.iCursor; 414 while(1){ 415 if( i = transport.nUsed ){ - 416 transport_load_buffer(pUrlData, pUrlData-isSsh ? 2 : 1000); 417 i -= iStart; 418 iStart = 0; 419 if( i = transport.nUsed ){ (lldb) up frame #12: 0x000105718815 fossil`http_exchange(pSend=0x7fff5a518640, pReply=0x7fff5a518620, useLogin=1, maxRedirect=20) + 1301 at http.c:206 203 */ 204 closeConnection = 1; 205 iLength = -1; - 206 while( (zLine = transport_receive_line(GLOBAL_URL()))!=0 zLine[0]!=0 ){ 207 /* printf([%s]\n, zLine); fflush(stdout); */ 208 if( fossil_strnicmp(zLine, http/1., 7)==0 ){ 209 if( sscanf(zLine, HTTP/1.%d %d, iHttpVersion, rc)!=2 ) goto write_err; (lldb) up frame #13: 0x00010576b1e3 fossil`client_sync(syncFlags=unavailable, configRcvMask=unavailable, configSendMask=unavailable) + 1955 at xfer.c:1560 1557} 1558fflush(stdout); 1559/* Exchange messages with the server */ - 1560if( http_exchange(send, recv, (syncFlags SYNC_CLONE)==0 || nCycle0, 1561MAX_REDIRECTS) ){ 1562 nErr++; 1563 break; (lldb) up frame #14: 0x00010574d2a9 fossil`autosync(flags=unavailable) + 313 at sync.c:75 72if( find_option(verbose,v,0)!=0 ) flags |= SYNC_VERBOSE; 73fossil_print(Autosync: %s\n, g.urlCanonical); 74url_enable_proxy(via proxy: ); - 75rc = client_sync(flags, configSync, 0); 76if( rc ) fossil_warning(Autosync failed); 77return rc; 78 } (lldb) up frame #15: 0x0001056f9f85 fossil`commit_cmd + 5365 at checkin.c:1927 1924 db_end_transaction(0); 1925 1926 if( !g.markPrivate ){ - 1927autosync(SYNC_PUSH|SYNC_PULL); 1928 } 1929 if( count_nonbranch_children(vid)1 ){ 1930fossil_print( warning: a fork has occurred *\n); (lldb) up frame #16: 0x000105726195 fossil`main(argc=unavailable, argv=unavailable) + 2325 at main.c:701 698 fossil_exit(1); 699 } 700 atexit( fossil_atexit ); - 701 aCommand[idx].xFunc(); 702 fossil_exit(0);
Re: [fossil-users] Fossil process hanging on sync to remote server?
On 10/05/14 10:53, Gerald Gutierrez wrote: [---] pContent=unavailable, N=unavailable) + 50 at http_ssl.c:399 396 size_t got; 397 size_t total = 0; 398 while( N0 ){ - 399 got = BIO_read(iBio, pContent, N); 400 if( got=0 ) break; 401 total += got; 402 N -= got; So, it's hanging in the BIO_read function. ...which is probably blocking and waiting for data. Either it's supposed to wait for data which the other side isn't sending (a problem at the other side?), or it has gotten the idea that it needs more data even though it doesn't (a local problem?). I'd start by taking a look at what the other side is doing. Is it possible for you to test without SSL? -- Kind Regards, Jan ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Fossil process hanging on sync to remote server?
Thus said Gerald Gutierrez on Sat, 10 May 2014 01:53:56 -0700: frame #8: 0x000105719ba2 fossil`ssl_receive(NotUsed=unavailable, pContent=unavailable, N=unavailable) + 50 at http_ssl.c:399 396 size_t got; 397 size_t total = 0; 398 while( N0 ){ - 399 got = BIO_read(iBio, pContent, N); 400 if( got=0 ) break; 401 total += got; 402 N -= got; So, it's blocking in the BIO_read function. Hard to say why without more data about both ends. netstat -na (on both sides) will probably provide some interesting information (e.g. are there blocks of data queued in either Recv-Q or Send-Q on either end of the ESTABLISHED connection). gdb on the remote fossil would provide some other details. I did find this particular comment about using BIO_read and non-blocking I/O in OpenBSD's manpage (specifically the last sentence): One technique sometimes used with blocking sockets is to use a system call (such as select(), poll() or equivalent) to determine when data is available and then call read() to read the data. The equivalent with BIOs (that is call select() on the underlying I/O structure and then call BIO_read() to read the data) should not be used because a single call to BIO_read() can cause several reads (and writes in the case of SSL BIOs) on the underlying I/O structure and may block as a result. Instead select() (or equivalent) should be combined with non blocking I/O so successive reads will request a retry instead of blocking. I'm not sure if using this technique would fare any better without understanding why it's blocking. Is there a firewall reacting to something it doesn't like? Some other network problem that exists like lost packets? Does it only happen when using cron? If so, why? Andy -- TAI64 timestamp: 4000536e49ee ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] Fossil process hanging on sync to remote server?
On Fri, May 9, 2014 at 2:25 AM, Stephan Beal sgb...@googlemail.com wrote: Other than that, i can't comment: i've only seen such behaviour in 'ping' on Solaris, where it can cause a backlog of cronjobs, which causes all other jobs to queue up until you kill the pings, at which point _all_ queued jobs, since the queue limit was reached (several days in my case), run in rapid succession! (Changed subject line to reflect new topic) Well, it's happened again. This time I have the cron logging on. I get a mail message every time a cronjob completes, and it's distinctly missing the one for the fossil sync session that is hung. If I look at processes, I get this (notice it executed at 9:05am, about 5 minutes after cronjob started and the entire cronjob, if successful, only takes about 30 seconds): $ uname -a Darwin mycomp.local 13.1.0 Darwin Kernel Version 13.1.0: Thu Jan 16 19:40:37 PST 2014; root:xnu-2422.90.20~2/RELEASE_X86_64 x86_64 $ ps auxww | grep fossil USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND xxx 7619 0.0 0.2 2490036 29836 ?? S 9:05AM 0:13.80 /usr/local/bin/fossil commit --no-warnings -m Fri May 9 09:05:41 PDT 2014 The cronjob log for the cronjob executing AFTER the one that hung says this: added 3 files, deleted 0 files /usr/local/bin/fossil: database is locked: {REPLACE INTO config(name,value,mtime) VALUES('last-sync-url','my repo url',now())} If you have recently updated your fossil executable, you might need to run fossil all rebuild to bring the repository schemas up to date. So, there is definitely a problem here. It doesn't happen all the time, but enough that it occurs at least once a day. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Fossil process hanging on sync to remote server?
On Sat, May 10, 2014 at 12:40 AM, Gerald Gutierrez gerald.gutier...@gmail.com wrote: So, there is definitely a problem here. It doesn't happen all the time, but enough that it occurs at least once a day. i suspect it's OS specific. Richard syncs many repositories via cron on a regular basis (hourly for the sqlite mirrors, IIRC) and has never had/reported any problem with this, nor can i remember it coming up before on the list. :/ -- - stephan beal http://wanderinghorse.net/home/stephan/ http://gplus.to/sgbeal Freedom is sloppy. But since tyranny's the only guaranteed byproduct of those who insist on a perfect world, freedom will have to do. -- Bigby Wolf ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Fossil process hanging on sync to remote server?
Thus said Gerald Gutierrez on Fri, 09 May 2014 15:40:13 -0700: $ ps auxww | grep fossil USER PID %CPU %MEM VSZRSS TT STAT STARTED TIME COMMAND xxx 7619 0.0 0.2 2490036 29836 ?? S 9:05AM 0:13.80 /usr/local/bin/fossil commit --no-warnings -m Fri May 9 09:05:41 PDT 2014 Any chance you could attach gdb to this process and see what it's doing? Something like: gdb fossil -p 7619 ... (gdb) bt Thanks, Andy -- TAI64 timestamp: 4000536d6a2c ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users