Reading more on JVM GC led me to investigate the java -server flag ( http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client )
>From what I can see cassandra's startup scripts don't invoke this mode, or did I miss it? Chris. On Mon, Nov 16, 2009 at 10:33 AM, Freeman, Tim <[email protected]> wrote: > You'll have to stop the swapping somehow. Maybe you can install more > memory, maybe you can run Cassandra smaller, maybe you can get some other > process on the machine to be smaller or on some other machine, maybe you can > move Cassandra to some other machine with more available physical memory. > > > > I don't have experience with running Cassandra smaller than the recommended > size, so one of those options might not work. > > > > Caching database information in swapped-out pages usually isn't a win. To > a first approximation, you need an I/O to fetch the swapped-out page, but > you'd need an I/O anyway to get the information from the database. Swapping > on modern machines usually isn't a win in general -- Memory got bigger and > CPU's got faster in the last decade, but disks didn't get much faster. > > > > Tim Freeman > Email: [email protected] > Desk in Palo Alto: (650) 857-2581 > Home: (408) 774-1298 > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and > Thursday; call my desk instead.) > > > > *From:* Chris Were [mailto:[email protected]] > *Sent:* Monday, November 16, 2009 10:13 AM > *To:* [email protected] > *Subject:* Re: Timeout Exception > > > > Hi Tim, > > > > Thanks for the great pointers. > > > > si, so are regularly in the 100-2000 range. I'll need to Google more about > what these mean etc, but are you effectively saying to tell cassandra to use > less memory? Cassandra is the only Java App running on the server. > > > > Cheers, > > Chris > > On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <[email protected]> wrote: > > I'm running 0.4.1. I used to get timeouts, then I changed my timeout from > 5 seconds to 30 seconds and I get no more timeouts. The relevant line from > storage-conf.xml is: > > > > <RpcTimeoutInMillis>30000</RpcTimeoutInMillis> > > > > The maximum latency is often just over 5 seconds in the worst case when I > fetch thousands of records, so default timeout of 5 seconds happens to be a > little bit too low for me. My records are ~100Kbytes each. You may get > different results if your records are much larger or much smaller. > > > > The other issue I was having a few days ago was that the machine was page > faulting so garbage collections were taking forever. Some GC's took 20 > minutes in another Java process. I didn't have verbose:gc turned on in > Cassandra so I'm not sure what the score was there, but there's little > reason to expect it to be qualitatively better, since it's pretty random > which process gets some of its pages swapped out. On a Linux machine, run > "vmstat 5" when your machine is loaded and if you see numbers greater than 0 > in the "si" and "so" columns in rows after the first, tell one of your Java > processes to take less memory. > > > > Tim Freeman > Email: [email protected] > Desk in Palo Alto: (650) 857-2581 > Home: (408) 774-1298 > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and > Thursday; call my desk instead.) > > > > *From:* Chris Were [mailto:[email protected]] > *Sent:* Monday, November 16, 2009 9:47 AM > *To:* Jonathan Ellis > *Cc:* [email protected] > *Subject:* Re: Timeout Exception > > > > I turned on debug logging for a few days and timeouts happened across > pretty much all requests. I couldn't see any particular request that was > consistently the problem. > > > > After some experimenting it seems that shutting down cassandra and > restarting resolves the problem. Once it hits the JVM memory limit however, > the timeouts start again. I have read the page on MemTable thresholds and > have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference. > Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of > those have lots of data. > > > > Cheers, > > Chris > > On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <[email protected]> > wrote: > > if you're timing out doing a slice on 10 columns w/ 10% cpu used, > something is broken > > is it consistent as to which keys this happens on? try turning on > debug logging and seeing where the latency is coming from. > > > On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <[email protected]> wrote: > > > > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <[email protected]> > wrote: > >> > >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <[email protected]> > wrote: > >> > Maybe... but it's not just multigets, it also happens when retreiving > >> > one > >> > row with get_slice. > >> > >> how many of the 3M columns are you trying to slice at once? > > > > Sorry, I must have mixed up the terminology. > > There's ~3M keys, but less than 10 columns in each. The get_slice calls > are > > to retreive all the columns (10) for a given key. > > > > >
