Anyways, Dropping the .jar file in the faban/lib folder seems to work. I renamed the patched version to commons-httpclient-3.1 and gave it a shot. I ll let you know if it solves the problem after I do a bunch of runs to make sure the "cave" behavior is not there.... ------------------------------------------------------------------- Kontorinis Vasileios Phd student, University of California San Diego http://cseweb.ucsd.edu/~vkontori/ [email protected] -------------------------------------------------------------------
2010/6/22 Vasileios Kontorinis <[email protected]> > Akara and Shanti, > I am not sure if this was more suited for the user or the development list. > Anyways I cced the dev. > > I have the feeling that I eventually figured this one out. After long > searches on the web I came across the following links. > http://old.nabble.com/-jira--Created:-(9:30 AM Pacific > HTTPCLIENT-796)-IOException-when-server-closes-connection-at-end-of-chunk-td19434908.html<http://old.nabble.com/-jira--Created:-%28HTTPCLIENT-796%29-IOException-when-server-closes-connection-at-end-of-chunk-td19434908.html> > > https://issues.apache.org/jira/browse/HTTPCORE-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > It's about the same JIRA, saying that commons-httpclient-3.1 can behave > weirdly and kill the connection when the last packet of a response is > empty(at least that is how I interpreted it). I am not sure if this is the > fix I am after but the stacktrace I got from Olio and the one described in > the links above do match. The olio stack when I get the chunked stream ended > unexpectedly is below. > > I have patched commons-httpclient-3.1 myself (there is a patch for 3.1 in > the above links, "c.patch"). However, in order to get faban to use the > patched commons-httpclient-3.1 I ll need to recompile faban (right?). Can I > just drop "commons-httpclient-3.1-patched" inside faban/lib and expect it to > run?? Should I rename commons-httpclient-3.1-patched to > commons-httpclient-3.1, in order to work?? > > Anyways, this thing has been the source of great headaches. I have to run > all my experiments twice. This way when I get the "cave" behavior (as Shanti > said) I can consider the experiment hosed. Of course this doubles the time, > plus is annoying... > > Let me know.... > > Message: UIDriverAgent[0].127.doEventDetail: chunked stream ended > unexpectedly > ------------------------------ > > Exception: Message: java.io.IOException: chunked stream ended unexpectedly > Stack Trace: Class Method Line > org.apache.commons.httpclient.ChunkedInputStream > getChunkSizeFromInputStream 252 > org.apache.commons.httpclient.ChunkedInputStream nextChunk 221 > org.apache.commons.httpclient.ChunkedInputStream read 176 > java.io.FilterInputStream read 116 > org.apache.commons.httpclient.AutoCloseInputStream read 108 > sun.nio.cs.StreamDecoder readBytes 264 sun.nio.cs.StreamDecoder implRead > 306 sun.nio.cs.StreamDecoder read 158 java.io.InputStreamReader read 167 > com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchResponseData > 912 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchResponse > 838 com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchURL 543 > com.sun.faban.driver.transport.hc3.ApacheHC3Transport fetchURL 564 > org.apache.olio.workload.driver.UIDriver doEventDetail 611 > sun.reflect.GeneratedMethodAccessor6 invoke > sun.reflect.DelegatingMethodAccessorImpl invoke 25 > java.lang.reflect.Method invoke 597 com.sun.faban.driver.engine.TimeThread > doRun 169 com.sun.faban.driver.engine.AgentThread > > run 202 > > --Cheers > > > ------------------------------------------------------------------- > Kontorinis Vasileios > Phd student, University of California San Diego > http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> > [email protected] > ------------------------------------------------------------------- > ------------------------------------------------------------------- > Kontorinis Vasileios > Phd student, University of California San Diego > http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> > [email protected] > ------------------------------------------------------------------- > > > 2010/6/3 Shanti Subramanyam <[email protected]> > > >> >> On Wed, Jun 2, 2010 at 3:08 PM, Vasileios Kontorinis < >> [email protected]> wrote: >> >>> 4. I am curious about this one. I have increased the number of >>> StartServers in apache.conf to 1536... >>> >>> <IfModule mpm_prefork_module> >>> ListenBacklog 32768 >>> StartServers 1536 >>> MinSpareServers 20 >>> MaxSpareServers 128 >>> ServerLimit 16384 >>> MaxClients 16384 >>> MaxRequestsPerChild 0 >>> </IfModule> >>> >>> This should take care of it. Nope? >>> >> >> No it won't. Having a worker process already in place helps but remember >> that apache uses a single process to handle all incoming connections after >> which it will hand off to the worker process. >> >> I went over the code in sun/faban/driver/engine/AgentImpl.java and if I >>> interpret it correctly faban needs to start #_of_ users=#_of_threads. If >>> there is one new thread every 500ms it will take quite some time (400secs). >>> In the code it shows that it will sleep for less time if we are falling >>> short -- 1/3 of the actual interval, so it should be around 100secs. Which >>> means 2 mins should be ok. >>> >> >> Right. But I suggest a rampup of 3-4 mins at least to get transactions >> going before you start measuring. >> >> >>> Now regarding the "cave" behavior. Is there a chance that there a bunch >>> of users dropped their connection at the same time and apache could not >>> handle it? Is there a way to verify that is the case? What if I increase the >>> MinSpareServers/MaxSpareServers settings? >>> >>> >> Easiest way to verify is to try the changes I suggested. >> >> >>> Thansk again. >>> >>> ------------------------------------------------------------------- >>> Kontorinis Vasileios >>> Phd student, University of California San Diego >>> http://cseweb.ucsd.edu/~vkontori/ <http://cseweb.ucsd.edu/%7Evkontori/> >>> [email protected] >>> ------------------------------------------------------------------- >>> >>> >> Shanti >> >> >>> >>> 2010/6/1 Shanti Subramanyam <[email protected]> >>> >>> Sorry for the late reply. I just took a look at your data. >>>> Some comments : >>>> >>>> 1. Your rampup time of 30 secs is too small. You may want to increase it >>>> to a few minutes, considering the load you're targeting. >>>> 2. I see that you're specifying a variable load - yet the thruput graph >>>> shows a constant thruput (until it caves in of course). Did you try doing a >>>> run with a constant 800 scale i.e. don't specify the variable load. >>>> 3. The apache error log shows timestamps that are different from the >>>> driver. Can you please sync the time on the systems ? If running as root, >>>> faban does this automatically. Otherwise, once in awhile, you should check >>>> what 'date' shows on all your systems. >>>> 4. Your 'threadstart/delay' is very small - For this load, I would >>>> recommend 300- 500 ms. Apache is just not capable of handling lots of >>>> connections in a very short time. This may be the reason for the 'connect >>>> timeout out' messages in the run log. >>>> >>>> Shanti >>>> >>>> >>>> On Tue, May 25, 2010 at 12:12 PM, Vasileios Kontorinis < >>>> [email protected]> wrote: >>>> >>>>> Just to clarify, with "package up" you mean tar/gzip the output dir of >>>>> the specific run with the errors, right? >>>>> If yes, I have updated the earlier webpage. >>>>> http://cseweb.ucsd.edu/~vkontori/olio/olio.html<http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html> >>>>> >>>>> >>>>> There you can find a small description of the issue and the run in >>>>> http://cseweb.ucsd.edu/~vkontori/olio/run.tgz<http://cseweb.ucsd.edu/%7Evkontori/olio/run.tgz> >>>>> >>>>> I am not sure if the apache.conf and php.ini is copied to the folder by >>>>> default so I just copied in the .tgz file by myself. >>>>> Let me know if there are any issues. >>>>> >>>>> Thanks!! >>>>> ------------------------------------------------------------------- >>>>> Kontorinis Vasileios >>>>> Phd student, University of California San Diego >>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>> [email protected] >>>>> ------------------------------------------------------------------- >>>>> >>>>> >>>>> 2010/5/25 Shanti Subramanyam <[email protected]> >>>>> >>>>> If you're running things locally, we can't blame nfs any longer :-) >>>>>> If you don't mind, package up the entire run directory and I'll take a >>>>>> look at it (I assume you can put it somewhere where I can download it). >>>>>> >>>>>> Shanti >>>>>> >>>>>> >>>>>> On Mon, May 24, 2010 at 10:43 PM, Vasileios Kontorinis < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Well, >>>>>>> It feels like there are more than one things wrong in my >>>>>>> configuration. >>>>>>> I stopped using the thumper and went back to using the local fs of >>>>>>> the web-server. >>>>>>> >>>>>>> Some of my issues went away but not all of them. >>>>>>> Now it fails less often, every 5-10 runs. >>>>>>> >>>>>>> The error message in the faban logs: >>>>>>> >>>>>>> *UIDriverAgent[0].752.do <op> : chunked stream ended unexpectedly >>>>>>> Note: Error not counted in result. >>>>>>> Either transaction start or end time is not within steady state. >>>>>>> * >>>>>>> (if the error happens in the steady state it is counted...) >>>>>>> >>>>>>> I looked a little bit about it online, it seems like the server is >>>>>>> getting a response that is not expecting. I have no clue how to debug >>>>>>> that. >>>>>>> Any ideas? >>>>>>> >>>>>>> Just a sanity check. I am using: >>>>>>> *php* v.5.2.4 (without Suhosin patch, I removed it) >>>>>>> *Server* version: Apache/2.2.11 (Ubuntu) >>>>>>> *memcached* 1.2.2* (*from phpinfo() I read *memcache *2.2.5) >>>>>>> *mysql* Ver 14.12 Distrib 5.0.75, for debian-linux-gnu (x86_64) >>>>>>> *apc* 3.0.19 >>>>>>> >>>>>>> Is this old?? I am considering upgrading my Ubuntu version from Hardy >>>>>>> to Lucid with all the related package updates. >>>>>>> As far php is concerned I know the current version is 5.2.12 so mine >>>>>>> is pretty old. >>>>>>> >>>>>>> Let me know >>>>>>> Thanks again. >>>>>>> >>>>>>> ------------------------------------------------------------------- >>>>>>> Kontorinis Vasileios >>>>>>> Phd student, University of California San Diego >>>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>>> [email protected] >>>>>>> ------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> 2010/5/24 Vasileios Kontorinis <[email protected]> >>>>>>> >>>>>>> I am using v3. >>>>>>>> >>>>>>>> What I observed and made me suspicious of nfs is in syslog a bunch >>>>>>>> of messages: >>>>>>>> May 22 00:22:37 olio-web -- MARK -- >>>>>>>> May 22 00:30:29 olio-web kernel: [2167206.713881] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:03 olio-web kernel: [2167240.993893] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:06 olio-web kernel: [2167243.193896] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:07 olio-web kernel: [2167244.349889] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:07 olio-web kernel: [2167244.357893] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:07 olio-web kernel: [2167244.669885] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:10 olio-web kernel: [2167247.781891] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:10 olio-web kernel: [2167247.785889] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:11 olio-web kernel: [2167248.725885] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:13 olio-web kernel: [2167250.153892] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.173886] nfs: server >>>>>>>> 67.58.51.149 not responding, still trying >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.410700] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411158] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411236] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411249] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411354] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411678] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411691] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411712] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.411723] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.412047] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:31:14 olio-web kernel: [2167251.414051] nfs: server >>>>>>>> 67.58.51.149 OK >>>>>>>> May 22 00:42:37 olio-web -- MARK -- >>>>>>>> May 22 01:02:37 olio-web -- MARK -- >>>>>>>> >>>>>>>> Regarding the olio run log it's weird. >>>>>>>> Usually says nothing. Sometimes it gives a bunch of >>>>>>>> >>>>>>>> UIDriverAgent[0].752.do <op> : chunked stream ended unexpectedly >>>>>>>> Note: Error not counted in result. >>>>>>>> Either transaction start or end time is not within steady state. >>>>>>>> >>>>>>>> and also >>>>>>>> >>>>>>>> UIDriverAgent[0].635.do <op> : connect timed out >>>>>>>> Note: Error not counted in result. >>>>>>>> Either transaction start or end time is not within steady state. >>>>>>>> >>>>>>>> where <op> different operations.... >>>>>>>> >>>>>>>> But it does not log this every time... >>>>>>>> >>>>>>>> Is there a way to monitor nfs internals (buffering, throughput etc) >>>>>>>> on the fly? >>>>>>>> I am using sar but it does not give much helpful info? >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------- >>>>>>>> Kontorinis Vasileios >>>>>>>> Phd student, University of California San Diego >>>>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>>>> [email protected] >>>>>>>> ------------------------------------------------------------------- >>>>>>>> >>>>>>>> >>>>>>>> 2010/5/24 Shanti Subramanyam <[email protected]> >>>>>>>> >>>>>>>> Which version of NFS are you using ? I suggest you try v3 - it is >>>>>>>>> more efficient. We've run into issues with v4 causing unacceptable >>>>>>>>> response >>>>>>>>> times. >>>>>>>>> For the kind of drop you are seeing, I am surprised that you don't >>>>>>>>> find any errors anywhere. In the first 400 user run for example, it >>>>>>>>> looks >>>>>>>>> like all the processes either exited or are stalled. In either case, >>>>>>>>> I would >>>>>>>>> expect to see errors in the faban run log (either that the driver got >>>>>>>>> an >>>>>>>>> error or that it timed out). Are you sure you checked the faban log ? >>>>>>>>> >>>>>>>>> Have you tried running nfsstat to see if you can spot anything ? >>>>>>>>> >>>>>>>>> Shanti >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, May 23, 2010 at 7:07 PM, Vasileios Kontorinis < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Shanti hi again, >>>>>>>>>> I sort of managed to fix that. I tried upgrading my php >>>>>>>>>> version to 5.2.6 and the alert went away. My problems though are not >>>>>>>>>> fixed. >>>>>>>>>> I even tried completely removing suhosin patch (it was a huge pain >>>>>>>>>> in ubuntu, since you need to recompile the php module by yourself) >>>>>>>>>> Still though my proms are there. >>>>>>>>>> >>>>>>>>>> Now, I get no warning the logs are clean but I get weird behavior. >>>>>>>>>> I needed to send you guys some pics so I created a related page at: >>>>>>>>>> http://cseweb.ucsd.edu/~vkontori/olio/olio.html<http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html> >>>>>>>>>> I have comments describing the prom at the end. >>>>>>>>>> >>>>>>>>>> <http://cseweb.ucsd.edu/%7Evkontori/olio/olio.html>Any help would >>>>>>>>>> be most appreciated. I ve spent so much time on it without figuring >>>>>>>>>> it out. >>>>>>>>>> My configuration is 1 web server on a vm with 6GB of mem. 4 cpus >>>>>>>>>> 1 db server on a vm with 5GB of >>>>>>>>>> mem. 4 cpus >>>>>>>>>> 1 fs server on a vm with 4GB of >>>>>>>>>> mem. 4 cpus. (this one just exposes over NFS the filestore) >>>>>>>>>> All on the same physical machine a nehalem based server, siting on >>>>>>>>>> a Sun's Black box. >>>>>>>>>> I got similar behavior when I exposed the filestore on the Sun's >>>>>>>>>> thumper. >>>>>>>>>> >>>>>>>>>> Any help would be most appreciated. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>> Kontorinis Vasileios >>>>>>>>>> Phd student, University of California San Diego >>>>>>>>>> http://cseweb.ucsd.edu/~vkontori/<http://cseweb.ucsd.edu/%7Evkontori/> >>>>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2010/5/19 Shanti Subramanyam <[email protected]> >>>>>>>>>> >>>>>>>>>> It's strange that multiple files seem to be complaining about it. >>>>>>>>>>> Did you try disabling Suhosin ? Are you seeing a perceptible drop >>>>>>>>>>> in memory >>>>>>>>>>> after reaching steady state ? >>>>>>>>>>> >>>>>>>>>>> shanti >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, May 19, 2010 at 4:28 PM, Vasileios Kontorinis < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Lately I get a bunch of these errors in my logs: >>>>>>>>>>>> >>>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT >>>>>>>>>>>> - canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT >>>>>>>>>>>> - canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT >>>>>>>>>>>> - canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>>> '10.17.255.250', file '/var/www/oliophp/public_html/users.php') >>>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT >>>>>>>>>>>> - canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>>> '10.17.255.250', file '/var/www/oliophp/public_html/events.php') >>>>>>>>>>>> [Wed May 19 22:26:37 2010] [error] [client 10.17.255.250] ALERT >>>>>>>>>>>> - canary mismatch on efree() - heap overflow detected (attacker >>>>>>>>>>>> '10.17.255.250', file >>>>>>>>>>>> '/var/www/oliophp/public_html/taggedEvents.php') >>>>>>>>>>>> >>>>>>>>>>>> According to blogs it is a php related issue. Suhosin patch >>>>>>>>>>>> detects a memory overflow and complains. >>>>>>>>>>>> I was just wondering if the Olio php code is having any known >>>>>>>>>>>> mem. leaks. >>>>>>>>>>>> >>>>>>>>>>>> My php version on ubuntu: >>>>>>>>>>>> PHP 5.2.4-2ubuntu5 with Suhosin-Patch 0.9.6.2 (cli) (built: Feb >>>>>>>>>>>> 27 2008 20:46:51) >>>>>>>>>>>> Copyright (c) 1997-2007 The PHP Group >>>>>>>>>>>> Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies >>>>>>>>>>>> >>>>>>>>>>>> It's too bad that I do not get a line on the php files that >>>>>>>>>>>> cause this. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Has anyone come across this one before? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>>>> Kontorinis Vasileios >>>>>>>>>>>> Phd student, University of California San Diego >>>>>>>>>>>> San Diego, CA 92122 >>>>>>>>>>>> Cell. phone: (858) 717 6899 >>>>>>>>>>>> [email protected], [email protected] >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
