Akara and Shanti, I am not sure if this was more suited for the user or the development list. Anyways I cced the dev.
I have the feeling that I eventually figured this one out. After long searches on the web I came across the following links. http://old.nabble.com/-jira--Created:-(9:30 AM Pacific HTTPCLIENT-796)-IOException-when-server-closes-connection-at-end-of-chunk-td19434908.html<http://old.nabble.com/-jira--Created:-%28HTTPCLIENT-796%29-IOException-when-server-closes-connection-at-end-of-chunk-td19434908.html> https://issues.apache.org/jira/browse/HTTPCORE-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel It's about the same JIRA, saying that commons-httpclient-3.1 can behave weirdly and kill the connection when the last packet of a response is empty(at least that is how I interpreted it). I am not sure if this is the fix I am after but the stacktrace I got from Olio and the one described in the links above do match. The olio stack when I get the chunked stream ended unexpectedly is below. I have patched commons-httpclient-3.1 myself (there is a patch for 3.1 in the above links, "c.patch"). However, in order to get faban to use the patched commons-httpclient-3.1 I ll need to recompile faban (right?). Can I just drop "commons-httpclient-3.1-patched" inside faban/lib and expect it to run?? Should I rename commons-httpclient-3.1-patched to commons-httpclient-3.1, in order to work?? Anyways, this thing has been the source of great headaches. I have to run all my experiments twice. This way when I get the "cave" behavior (as Shanti said) I can consider the experiment hosed. Of course this doubles the time, plus is annoying... Let me know.... Message: UIDriverAgent[0].127.doEventDetail: chunked stream ended unexpectedly ------------------------------ Exception: Message: java.io.IOException: chunked stream ended unexpectedly Stack Trace: Class Method Line org.apache.commons.httpclient.ChunkedInputStream getChunkSizeFromInputStream 252 org.apache.commons.httpclient.ChunkedInputStream nextChunk 221 org.apache.commons.httpclient.ChunkedInputStream read 176 java.io.FilterInputStream read 116 org.apache.commons.httpclient.AutoCloseInputStream read 108 sun.nio.cs.StreamDecoder readBytes 264 sun.nio.cs.StreamDecoder implRead 306 sun.nio.cs.StreamDecoder read 158 java.io.InputStreamReader read 167 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchResponseData 912 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchResponse 838 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchURL 543 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchURL 564 org.apache.olio.workload.driver.UIDriver doEventDetail 611 sun.reflect.GeneratedMethodAccessor6 invoke sun.reflect.DelegatingMethodAccessorImpl invoke 25 java.lang.reflect.Method invoke 597 com.sun.faban.driver.engine.TimeThread doRun 169 com.sun.faban.driver.engine.AgentThread run 202 --Cheers ------------------------------------------------------------------- Kontorinis Vasileios Phd student, University of California San Diego http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> [email protected] ------------------------------------------------------------------- ------------------------------------------------------------------- Kontorinis Vasileios Phd student, University of California San Diego http://cseweb.ucsd.edu/~vkontori/ [email protected] ------------------------------------------------------------------- 2010/6/3 Shanti Subramanyam <[email protected]> > > > On Wed, Jun 2, 2010 at 3:08 PM, Vasileios Kontorinis < > [email protected]> wrote: > >> 4. I am curious about this one. I have increased the number of >> StartServers in apache.conf to 1536... >> >> <IfModule mpm_prefork_module> >> ListenBacklog 32768 >> StartServers 1536 >> MinSpareServers 20 >> MaxSpareServers 128 >> ServerLimit 16384 >> MaxClients 16384 >> MaxRequestsPerChild 0 >> </IfModule> >> >> This should take care of it. Nope? >> > > No it won't. Having a worker process already in place helps but remember > that apache uses a single process to handle all incoming connections after > which it will hand off to the worker process. > > I went over the code in sun/faban/driver/engine/AgentImpl.java and if I >> interpret it correctly faban needs to start #_of_ users=#_of_threads. If >> there is one new thread every 500ms it will take quite some time (400secs). >> In the code it shows that it will sleep for less time if we are falling >> short -- 1/3 of the actual interval, so it should be around 100secs. Which >> means 2 mins should be ok. >> > > Right. But I suggest a rampup of 3-4 mins at least to get transactions > going before you start measuring. > > >> Now regarding the "cave" behavior. Is there a chance that there a bunch of >> users dropped their connection at the same time and apache could not handle >> it? Is there a way to verify that is the case? What if I increase the >> MinSpareServers/MaxSpareServers settings? >> >> > Easiest way to verify is to try the changes I suggested. > > >> Thansk again. >> >> ------------------------------------------------------------------- >> Kontorinis Vasileios >> Phd student, University of California San Diego >> http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> >> [email protected] >> ------------------------------------------------------------------- >> >> > Shanti > > >> >> 2010/6/1 Shanti Subramanyam <[email protected]> >> >> Sorry for the late reply. I just took a look at your data. >>> Some comments : >>> >>> 1. Your rampup time of 30 secs is too small. You may want to increase it >>> to a few minutes, considering the load you're targeting. >>> 2. I see that you're specifying a variable load - yet the thruput graph >>> shows a constant thruput (until it caves in of course). Did you try doing a >>> run with a constant 800 scale i.e. don't specify the variable load. >>> 3. The apache error log shows timestamps that are different from the >>> driver. Can you please sync the time on the systems ? If running as root, >>> faban does this automatically. Otherwise, once in awhile, you should check >>> what 'date' shows on all your systems. >>> 4. Your 'threadstart/delay' is very small - For this load, I would >>> recommend 300- 500 ms. Apache is just not capable of handling lots of >>> connections in a very short time. This may be the reason for the 'connect >>> timeout out' messages in the run log. >>> >>> Shanti >>> >>> >>> On Tue, May 25, 2010 at 12:12 PM, Vasileios Kontorinis < >>> [email protected]> wrote: >>> >>>> Just to clarify, with "package up" you mean tar/gzip the output dir of >>>> the specific run with the errors, right? >>>> If yes, I have updated the earlier webpage. >>>> http://cseweb.ucsd.edu/~vkontori/olio/olio.html<http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html> >>>> >>>> >>>> There you can find a small description of the issue and the run in >>>> http://cseweb.ucsd.edu/~vkontori/olio/run.tgz<http://cseweb.ucsd.edu/%7Evkontori/olio/run.tgz> >>>> >>>> I am not sure if the apache.conf and php.ini is copied to the folder by >>>> default so I just copied in the .tgz file by myself. >>>> Let me know if there are any issues. >>>> >>>> Thanks!! >>>> ------------------------------------------------------------------- >>>> Kontorinis Vasileios >>>> Phd student, University of California San Diego >>>> http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> >>>> [email protected] >>>> ------------------------------------------------------------------- >>>> >>>> >>>> 2010/5/25 Shanti Subramanyam <[email protected]> >>>> >>>> If you're running things locally, we can't blame nfs any longer :-) >>>>> If you don't mind, package up the entire run directory and I'll take a >>>>> look at it (I assume you can put it somewhere where I can download it). >>>>> >>>>> Shanti >>>>> >>>>> >>>>> On Mon, May 24, 2010 at 10:43 PM, Vasileios Kontorinis < >>>>> [email protected]> wrote: >>>>> >>>>>> Well, >>>>>> It feels like there are more than one things wrong in my >>>>>> configuration. >>>>>> I stopped using the thumper and went back to using the local fs of the >>>>>> web-server. >>>>>> >>>>>> Some of my issues went away but not all of them. >>>>>> Now it fails less often, every 5-10 runs. >>>>>> >>>>>> The error message in the faban logs: >>>>>> >>>>>> *UIDriverAgent[0].752.do <op> : chunked stream ended unexpectedly >>>>>> Note: Error not counted in result. >>>>>> Either transaction start or end time is not within steady state. >>>>>> * >>>>>> (if the error happens in the steady state it is counted...) >>>>>> >>>>>> I looked a little bit about it online, it seems like the server is >>>>>> getting a response that is not expecting. I have no clue how to debug >>>>>> that. >>>>>> Any ideas? >>>>>> >>>>>> Just a sanity check. I am using: >>>>>> *php* v.5.2.4 (without Suhosin patch, I removed it) >>>>>> *Server* version: Apache/2.2.11 (Ubuntu) >>>>>> *memcached* 1.2.2* (*from phpinfo() I read *memcache *2.2.5) >>>>>> *mysql* Ver 14.12 Distrib 5.0.75, for debian-linux-gnu (x86_64) >>>>>> *apc* 3.0.19 >>>>>> >>>>>> Is this old?? I am considering upgrading my Ubuntu version from Hardy >>>>>> to Lucid with all the related package updates. >>>>>> As far php is concerned I know the current version is 5.2.12 so mine >>>>>> is pretty old. >>>>>> >>>>>> Let me know >>>>>> Thanks again. >>>>>> >>>>>> ------------------------------------------------------------------- >>>>>> Kontorinis Vasileios >>>>>> Phd student, University of California San Diego >>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>> [email protected] >>>>>> ------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> 2010/5/24 Vasileios Kontorinis <[email protected]> >>>>>> >>>>>> I am using v3. >>>>>>> >>>>>>> What I observed and made me suspicious of nfs is in syslog a bunch of >>>>>>> messages: >>>>>>> May 22 00:22:37 olio-web -- MARK -- >>>>>>> May 22 00:30:29 olio-web kernel: [2167206.713881] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:03 olio-web kernel: [2167240.993893] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:06 olio-web kernel: [2167243.193896] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:07 olio-web kernel: [2167244.349889] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:07 olio-web kernel: [2167244.357893] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:07 olio-web kernel: [2167244.669885] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:10 olio-web kernel: [2167247.781891] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:10 olio-web kernel: [2167247.785889] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:11 olio-web kernel: [2167248.725885] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:13 olio-web kernel: [2167250.153892] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.173886] nfs: server >>>>>>> 67.58.51.149 not responding, still trying >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.410700] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411158] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411236] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411249] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411354] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411678] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411691] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411712] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411723] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.412047] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:31:14 olio-web kernel: [2167251.414051] nfs: server >>>>>>> 67.58.51.149 OK >>>>>>> May 22 00:42:37 olio-web -- MARK -- >>>>>>> May 22 01:02:37 olio-web -- MARK -- >>>>>>> >>>>>>> Regarding the olio run log it's weird. >>>>>>> Usually says nothing. Sometimes it gives a bunch of >>>>>>> >>>>>>> UIDriverAgent[0].752.do <op> : chunked stream ended unexpectedly >>>>>>> Note: Error not counted in result. >>>>>>> Either transaction start or end time is not within steady state. >>>>>>> >>>>>>> and also >>>>>>> >>>>>>> UIDriverAgent[0].635.do <op> : connect timed out >>>>>>> Note: Error not counted in result. >>>>>>> Either transaction start or end time is not within steady state. >>>>>>> >>>>>>> where <op> different operations.... >>>>>>> >>>>>>> But it does not log this every time... >>>>>>> >>>>>>> Is there a way to monitor nfs internals (buffering, throughput etc) >>>>>>> on the fly? >>>>>>> I am using sar but it does not give much helpful info? >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------- >>>>>>> Kontorinis Vasileios >>>>>>> Phd student, University of California San Diego >>>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>>> [email protected] >>>>>>> ------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> 2010/5/24 Shanti Subramanyam <[email protected]> >>>>>>> >>>>>>> Which version of NFS are you using ? I suggest you try v3 - it is >>>>>>>> more efficient. We've run into issues with v4 causing unacceptable >>>>>>>> response >>>>>>>> times. >>>>>>>> For the kind of drop you are seeing, I am surprised that you don't >>>>>>>> find any errors anywhere. In the first 400 user run for example, it >>>>>>>> looks >>>>>>>> like all the processes either exited or are stalled. In either case, I >>>>>>>> would >>>>>>>> expect to see errors in the faban run log (either that the driver got >>>>>>>> an >>>>>>>> error or that it timed out). Are you sure you checked the faban log ? >>>>>>>> >>>>>>>> Have you tried running nfsstat to see if you can spot anything ? >>>>>>>> >>>>>>>> Shanti >>>>>>>> >>>>>>>> >>>>>>>> On Sun, May 23, 2010 at 7:07 PM, Vasileios Kontorinis < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Shanti hi again, >>>>>>>>> I sort of managed to fix that. I tried upgrading my php version >>>>>>>>> to 5.2.6 and the alert went away. My problems though are not fixed. >>>>>>>>> I even tried completely removing suhosin patch (it was a huge pain >>>>>>>>> in ubuntu, since you need to recompile the php module by yourself) >>>>>>>>> Still though my proms are there. >>>>>>>>> >>>>>>>>> Now, I get no warning the logs are clean but I get weird behavior. >>>>>>>>> I needed to send you guys some pics so I created a related page at: >>>>>>>>> http://cseweb.ucsd.edu/~vkontori/olio/olio.html<http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html> >>>>>>>>> I have comments describing the prom at the end. >>>>>>>>> >>>>>>>>> <http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html>Any help would >>>>>>>>> be most appreciated. I ve spent so much time on it without figuring >>>>>>>>> it out. >>>>>>>>> My configuration is 1 web server on a vm with 6GB of mem. 4 cpus >>>>>>>>> 1 db server on a vm with 5GB of >>>>>>>>> mem. 4 cpus >>>>>>>>> 1 fs server on a vm with 4GB of >>>>>>>>> mem. 4 cpus. (this one just exposes over NFS the filestore) >>>>>>>>> All on the same physical machine a nehalem based server, siting on >>>>>>>>> a Sun's Black box. >>>>>>>>> I got similar behavior when I exposed the filestore on the Sun's >>>>>>>>> thumper. >>>>>>>>> >>>>>>>>> Any help would be most appreciated. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------- >>>>>>>>> Kontorinis Vasileios >>>>>>>>> Phd student, University of California San Diego >>>>>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>>>>> [email protected] >>>>>>>>> ------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> >>>>>>>>> 2010/5/19 Shanti Subramanyam <[email protected]> >>>>>>>>> >>>>>>>>> It's strange that multiple files seem to be complaining about it. >>>>>>>>>> Did you try disabling Suhosin ? Are you seeing a perceptible drop in >>>>>>>>>> memory >>>>>>>>>> after reaching steady state ? >>>>>>>>>> >>>>>>>>>> shanti >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, May 19, 2010 at 4:28 PM, Vasileios Kontorinis < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Lately I get a bunch of these errors in my logs: >>>>>>>>>>> >>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT - >>>>>>>>>>> canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT - >>>>>>>>>>> canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT - >>>>>>>>>>> canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>> '10.17.255.250', file '/var/www/oliophp/public_html/users.php') >>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT - >>>>>>>>>>> canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>> '10.17.255.250', file '/var/www/oliophp/public_html/events.php') >>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT - >>>>>>>>>>> canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>> >>>>>>>>>>> According to blogs it is a php related issue. Suhosin patch >>>>>>>>>>> detects a memory overflow and complains. >>>>>>>>>>> I was just wondering if the Olio php code is having any known >>>>>>>>>>> mem. leaks. >>>>>>>>>>> >>>>>>>>>>> My php version on ubuntu: >>>>>>>>>>> PHP 5.2.4-2ubuntu5 with Suhosin-Patch 0.9.6.2 (cli) (built: Feb >>>>>>>>>>> 27 2008 20:46:51) >>>>>>>>>>> Copyright (c) 1997-2007 The PHP Group >>>>>>>>>>> Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies >>>>>>>>>>> >>>>>>>>>>> It's too bad that I do not get a line on the php files that cause >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Has anyone come across this one before? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>>> Kontorinis Vasileios >>>>>>>>>>> Phd student, University of California San Diego >>>>>>>>>>> San Diego, CA 92122 >>>>>>>>>>> Cell. phone: (858) 717 6899 >>>>>>>>>>> [email protected], [email protected] >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
