Re: cassandra freezes
I'm still in the experimentation stage so perhaps forgive this hypothetical question/idea. I am planning to load balance by putting haproxy in front of the cassandra cluster. First of all, is that a bad idea? Secondly, if I have high enough replication and # of nodes, is it possible and a good idea to proactively cause GCing to happen? (I.e. take a node out of the haproxy LB pool, somehow cause it to gc, and then put the node back in... repeat at intervals for each node?) Simon Smith
Re: Cassandra with static IP address on EC2 instance: org.apache.thrift.transport.TTransportException
Do you have your Amazon security policy set to allow that port? If you were accessing internally before, the internal security policy may have allowed that traffic, but the default external one doesn't (at least that is how it worked for my account). On Tue, Dec 8, 2009 at 10:09 AM, Sunil Khedar wrote: > Hi All, > I tried using public IP address of my EC2 instance for ThriftAddress, but > getting following error: > org.apache.thrift.transport.TTransportException: Could not create > ServerSocket on address /75.101.152.226:9160.
Re: Cassandra users survey
The company I'm with is still small and in the early stages, but we're planning on using Cassandra for user profile information (in development right now), and possibly other uses later on. We evaluated CouchDB and Voldermort, and both of those were great as well - for CouchDB, I really liked Futon but had some stability issues and didn't like the manual replication. Voldermort may be great, but I couldn't figure out the API (which probably says more about me than Voldermort). One of the reasons we chose Cassandra is because we feel like it is being used in other situations which required scaling. I'm looking forward to v0.5 because of load-balancing and for better support for the situation where a node is lost permanently. I'm very pleased with the high level of support for Cassandra, both on this mailing list and on IRC. Simon On Fri, Nov 20, 2009 at 4:17 PM, Jonathan Ellis wrote: > Hi all, > > I'd love to get a better feel for who is using Cassandra and what kind > of applications it is seeing. If you are using Cassandra, could you > share what you're using it for and what stage you are at with it > (evaluation / testing / production)? Also, what alternatives you > evaluated/are evaluating would be useful. Finally, feel free to throw > in "I'd love to use Cassandra if only it did X" wishes. :) > > I can start: Rackspace is using Cassandra for stats collection > (testing, almost production) and as a backend for the Mail & Apps > division (early testing). We evaluated HBase, Hypertable, dynomite, > and Voldemort as well. > > Thanks, > > -Jonathan > > (If you're in stealth mode or don't want to say anything in public, > feel free to reply to me privately and I will keep it off the record.) >
Re: Cassandra backup and restore procedures
I'm sorry if this was covered before, but if you lose a node and cannot bring it (or a replacement) back with the same IP address or DNS name, is your only option to restart the entire cluster? E.g. if I have nodes 1, 2, and 3 with replication factor 3, and then I lose node 3, is it possible to bring up a new node 3 with a new IP (and a Seed of either node 1 or node 2) and bootstrap it? Thanks, Simon On Wed, Nov 18, 2009 at 2:03 PM, Jonathan Ellis wrote: > Tokens can change, so IP is used for node identification, e.g. for > hinted handoff. > > On Wed, Nov 18, 2009 at 1:00 PM, Ramzi Rabah wrote: >> Hey Jonathan, why should a replacement node keep the same IP >> address/DNS name as the original node? Wouldn't having the same token >> as the node that went down be sufficient (provided that you did the >> steps above of copying the data from the 2 neighboring nodes)? >>
Re: Thrift Perl API Timeout Issues
I don't have an opinion on the default timeout. But in my experience with other applications, you want to consciously make a choice about what your timeout, based on your architecture and performance requirements. You're much better off explicitly setting a timeout that will cause your transaction to finish in a time a little longer than you'd like and then either re-try or error out the transaction. An alternate approach is is to set a quick timeout, one that is just over the 99.?th percentile of transaction times, and then retry. (But whatever you do, don't just retry endlessly, or you may end up with this terrible growing mess of transactions retrying.) In either case, it's a good idea to be monitoring the frequency of timeouts, so if they increase over the baseline you can track down the cause and fix it. Just my $0.02. Simon On Thu, Oct 15, 2009 at 11:33 PM, Eric Lubow wrote: > So I ran the tests again twice with a huge timeout and it managed to run in > just under 3 hours both times. So this issue is definitely related to the > timeouts. It might be worth changing the default timeouts for Perl to match > the infinite timeouts for Python. Thanks for the quick responses. > -e
Re: Thrift Perl API Timeout Issues
While on the topic, I'm using the python Thrift interface - if I wanted to, how would I change the timeout? I currently do: socket = TSocket.TSocket(host,port) If I wanted to change the timeout would I do something like: socket.setTimeout(timeout) or...? Sorry if I should be able to see this by looking at the code - I'm new to python. Thanks, Simon On Thu, Oct 15, 2009 at 11:42 AM, Jake Luciani wrote: > I think it's 100ms. I need to increase it to match python I guess. > > Sent from my iPhone > > On Oct 15, 2009, at 11:40 AM, Jonathan Ellis wrote: > >> What is the default? >> >> On Thu, Oct 15, 2009 at 10:37 AM, Jake Luciani wrote: >>> >>> You need to call >>> $socket->setRecvTimeout() >>> With a higher number in ms. >>> >>>
Re: use of nodeprobe snapshot
Thanks, that should get me started. On Wed, Sep 23, 2009 at 3:50 PM, Sammy Yu wrote: > Hi Simon, > The sstables are immutable so the snapshot command creates hard > links for each sstable. Right now it is more for archival purposes. > You should be able to take the sstables in the snapshot directory and > put them in the regular data directory. You can also copy the > sstables onto another machine. Currently, there is no support of > telling cassandra to use a particular snapshot if that's what you are > looking for. > > Cheers, > Sammy > > > On Wed, Sep 23, 2009 at 11:39 AM, Simon Smith wrote: >> Is there any documentation on the snapshot functionality - I'm able to >> successfully create one using nodeprobe, but I don't know how I can >> use it. >> >> Thanks! >> >> Simon >> >
use of nodeprobe snapshot
Is there any documentation on the snapshot functionality - I'm able to successfully create one using nodeprobe, but I don't know how I can use it. Thanks! Simon
Re: get_key_range (CASSANDRA-169)
Jonathan: I tried out the patch you attached to JIRA-440, I applied it to 0.4, and it works for me. Now, as soon as I take the node down, there may be one or two seconds of the thrift-internal error (timeout) but as soon as the host doing the querying can see the node is down, the error stops, and valid output is given by the get_key_range query again. And there isn't any disruption when the node comes back up. Thanks! (I put this same note in the bug report). Simon Smith On Fri, Sep 11, 2009 at 9:38 AM, Simon Smith wrote: > https://issues.apache.org/jira/browse/CASSANDRA-440 > > Thanks again, of course I'm happy to give any additional information > and will gladly do any testing of the fix. > > Simon > > > On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis wrote: >> That confirms what I suspected, thanks. >> >> Can you file a ticket on Jira and I'll work on a fix for you to test? >> >> thanks, >> >> -Jonathan >> >> On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith wrote: >>> I sent get_key_range to node #1 (174.143.182.178), and here are the >>> resulting log lines from 174.143.182.178's log (Do you want the other >>> nodes' log lines? Let me know if so.) >>> >>> DEBUG - get_key_range >>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>> startWith='', stopAt='', maxResults=100) from 6...@174.143.182.178:7000 >>> DEBUG - collecting :false:3...@1252535119 >>> [ ... chop the repeated & identical collecting messages ... ] >>> DEBUG - collecting :false:3...@1252535119 >>> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4, >>> java5, match, match1, match2, match3, match4, match5, newegg, newegg1, >>> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5, >>> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4, >>> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5], >>> completed=false) to 6...@174.143.182.178:7000 >>> DEBUG - Processing response on an async result from >>> 6...@174.143.182.178:7000 >>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.182:7000 >>> DEBUG - Processing response on an async result from >>> 6...@174.143.182.182:7000 >>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.179:7000 >>> DEBUG - Processing response on an async result from >>> 6...@174.143.182.179:7000 >>> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >>> startWith='', stopAt='', maxResults=22) from 6...@174.143.182.185:7000 >>> DEBUG - Processing response on an async result from >>> 6...@174.143.182.185:7000 >>> DEBUG - Disseminating load info ... >>> >>> >>> Thanks, >>> >>> Simon >>> >>> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis wrote: >>>> I think I see the problem. >>>> >>>> Can you check if your range query is spanning multiple nodes in the >>>> cluster? You can tell by setting the log level to DEBUG, and looking >>>> for after it logs get_key_range, it will say "reading >>>> RangeCommand(...) from ... @machine" more than once. >>>> >>>> The bug is that when picking the node to start the range query it >>>> consults the failure detector to avoid dead nodes, but if the query >>>> spans nodes it does not do that on subsequent nodes. >>>> >>>> But if you are only generating one RangeCommand per get_key_range then >>>> we have two bugs. :) >>>> >>>> -Jonathan >>>> >>> >> >
Re: get_key_range (CASSANDRA-169)
https://issues.apache.org/jira/browse/CASSANDRA-440 Thanks again, of course I'm happy to give any additional information and will gladly do any testing of the fix. Simon On Thu, Sep 10, 2009 at 7:32 PM, Jonathan Ellis wrote: > That confirms what I suspected, thanks. > > Can you file a ticket on Jira and I'll work on a fix for you to test? > > thanks, > > -Jonathan > > On Thu, Sep 10, 2009 at 4:42 PM, Simon Smith wrote: >> I sent get_key_range to node #1 (174.143.182.178), and here are the >> resulting log lines from 174.143.182.178's log (Do you want the other >> nodes' log lines? Let me know if so.) >> >> DEBUG - get_key_range >> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >> startWith='', stopAt='', maxResults=100) from 6...@174.143.182.178:7000 >> DEBUG - collecting :false:3...@1252535119 >> [ ... chop the repeated & identical collecting messages ... ] >> DEBUG - collecting :false:3...@1252535119 >> DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4, >> java5, match, match1, match2, match3, match4, match5, newegg, newegg1, >> newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5, >> sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4, >> test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5], >> completed=false) to 6...@174.143.182.178:7000 >> DEBUG - Processing response on an async result from 6...@174.143.182.178:7000 >> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.182:7000 >> DEBUG - Processing response on an async result from 6...@174.143.182.182:7000 >> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >> startWith='', stopAt='', maxResults=58) from 6...@174.143.182.179:7000 >> DEBUG - Processing response on an async result from 6...@174.143.182.179:7000 >> DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, >> startWith='', stopAt='', maxResults=22) from 6...@174.143.182.185:7000 >> DEBUG - Processing response on an async result from 6...@174.143.182.185:7000 >> DEBUG - Disseminating load info ... >> >> >> Thanks, >> >> Simon >> >> On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis wrote: >>> I think I see the problem. >>> >>> Can you check if your range query is spanning multiple nodes in the >>> cluster? You can tell by setting the log level to DEBUG, and looking >>> for after it logs get_key_range, it will say "reading >>> RangeCommand(...) from ... @machine" more than once. >>> >>> The bug is that when picking the node to start the range query it >>> consults the failure detector to avoid dead nodes, but if the query >>> spans nodes it does not do that on subsequent nodes. >>> >>> But if you are only generating one RangeCommand per get_key_range then >>> we have two bugs. :) >>> >>> -Jonathan >>> >> >
Re: get_key_range (CASSANDRA-169)
I sent get_key_range to node #1 (174.143.182.178), and here are the resulting log lines from 174.143.182.178's log (Do you want the other nodes' log lines? Let me know if so.) DEBUG - get_key_range DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, startWith='', stopAt='', maxResults=100) from 6...@174.143.182.178:7000 DEBUG - collecting :false:3...@1252535119 [ ... chop the repeated & identical collecting messages ... ] DEBUG - collecting :false:3...@1252535119 DEBUG - Sending RangeReply(keys=[java, java1, java2, java3, java4, java5, match, match1, match2, match3, match4, match5, newegg, newegg1, newegg2, newegg3, newegg4, newegg5, now, now1, now2, now3, now4, now5, sgs, sgs1, sgs2, sgs3, sgs4, sgs5, test, test1, test2, test3, test4, test5, xmind, xmind1, xmind2, xmind3, xmind4, xmind5], completed=false) to 6...@174.143.182.178:7000 DEBUG - Processing response on an async result from 6...@174.143.182.178:7000 DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, startWith='', stopAt='', maxResults=58) from 6...@174.143.182.182:7000 DEBUG - Processing response on an async result from 6...@174.143.182.182:7000 DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, startWith='', stopAt='', maxResults=58) from 6...@174.143.182.179:7000 DEBUG - Processing response on an async result from 6...@174.143.182.179:7000 DEBUG - reading RangeCommand(table='users', columnFamily=pwhash, startWith='', stopAt='', maxResults=22) from 6...@174.143.182.185:7000 DEBUG - Processing response on an async result from 6...@174.143.182.185:7000 DEBUG - Disseminating load info ... Thanks, Simon On Thu, Sep 10, 2009 at 5:25 PM, Jonathan Ellis wrote: > I think I see the problem. > > Can you check if your range query is spanning multiple nodes in the > cluster? You can tell by setting the log level to DEBUG, and looking > for after it logs get_key_range, it will say "reading > RangeCommand(...) from ... @machine" more than once. > > The bug is that when picking the node to start the range query it > consults the failure detector to avoid dead nodes, but if the query > spans nodes it does not do that on subsequent nodes. > > But if you are only generating one RangeCommand per get_key_range then > we have two bugs. :) > > -Jonathan >
Re: get_key_range (CASSANDRA-169)
I think it might take me quite a bit of effort for me figure out how to use a java debugger - it will be a lot quicker if you can give me a patch, then I can certainly re-build using ant against either latest trunk or latest 0.4 and re-run my test. Thanks, Simon On Wed, Sep 9, 2009 at 6:52 PM, Jonathan Ellis wrote: > Okay, so when #5 comes back up, #1 eventually stops erroring out and > you don't have to restart #1? That is good, that would have been a > bigger problem. :) > > If you are comfortable using a Java debugger (by default Cassandra > listens for one on ) you can look at what is going on inside > StorageProxy.getKeyRange on node #1 at the call to > > EndPoint endPoint = > StorageService.instance().findSuitableEndPoint(command.startWith); > > findSuitableEndpoint is supposed to pick a live node, not a dead one. :) > > If not I can write a patch to log extra information for this bug so we > can track it down. > > -Jonathan > > On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith wrote: >> The error starts as soon as the downed node #5 goes down and lasts >> until I restart the downed node #5. >> >> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down, >> and when it is up again) >> >> Since I set the replication set to 3, I'm confused as to why (after >> the first few seconds or so) there is an error just because one host >> is down temporarily. >> >> The way I have the test setup is that I have a script running on each >> of the nodes that is running the get_key_range over and over to >> "localhost". Depending on which node I take down, the behavior >> varies: if I take done one host, it is the only one giving errors (the >> other 4 nodes still work). For the other 4 situations, either 2 or 3 >> nodes continue to work (i.e. the downed node and either one or two >> other nodes are the ones giving errors). Note: the nodes that keep >> working, never fail at all, not even for a few seconds. >> >> I am running this on 4GB "cloud server" boxes in Rackspace, I can set >> up just about any test needed to help debug this and capture output or >> logs, and can give a Cassandra developer access if it would help. Of >> course I can include whatever config files or log files would be >> helpful, I just don't want to spam the list unless it is relevant. >> >> Thanks again, >> >> Simon >> >> >> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis wrote: >>> getting temporary errors when a node goes down, until the other nodes' >>> failure detectors realize it's down, is normal. (this should only >>> take a dozen seconds, or so.) >>> >>> but after that it should route requests to other nodes, and it should >>> also realize when you restart #5 that it is alive again. those are >>> two separate issues. >>> >>> can you verify that "bin/nodeprobe cluster" shows that node 1 >>> eventually does/does not see #5 dead, and alive again? >>> >>> -Jonathan >>> >>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith wrote: >>>> I'm seeing an issue similar to: >>>> >>>> http://issues.apache.org/jira/browse/CASSANDRA-169 >>>> >>>> Here is when I see it. I'm running Cassandra on 5 nodes using the >>>> OrderPreservingPartitioner, and have populated Cassandra with 78 >>>> records, and I can use get_key_range via Thrift just fine. Then, if I >>>> manually kill one of the nodes (if I kill off node #5), the node (node >>>> #1) which I've been using to call get_key_range will timeout and the >>>> error: >>>> >>>> Thrift: Internal error processing get_key_range >>>> >>>> And the Cassandra output shows the same trace as in 169: >>>> >>>> ERROR - Encountered IOException on connection: >>>> java.nio.channels.SocketChannel[closed] >>>> java.net.ConnectException: Connection refused >>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>>> at >>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) >>>> at >>>> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) >>>> at >>>> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) >>>> at >>>> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) &g
Re: get_key_range (CASSANDRA-169)
The error starts as soon as the downed node #5 goes down and lasts until I restart the downed node #5. bin/nodeprobe cluster is accurate (it knows quickly when #5 is down, and when it is up again) Since I set the replication set to 3, I'm confused as to why (after the first few seconds or so) there is an error just because one host is down temporarily. The way I have the test setup is that I have a script running on each of the nodes that is running the get_key_range over and over to "localhost". Depending on which node I take down, the behavior varies: if I take done one host, it is the only one giving errors (the other 4 nodes still work). For the other 4 situations, either 2 or 3 nodes continue to work (i.e. the downed node and either one or two other nodes are the ones giving errors). Note: the nodes that keep working, never fail at all, not even for a few seconds. I am running this on 4GB "cloud server" boxes in Rackspace, I can set up just about any test needed to help debug this and capture output or logs, and can give a Cassandra developer access if it would help. Of course I can include whatever config files or log files would be helpful, I just don't want to spam the list unless it is relevant. Thanks again, Simon On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis wrote: > getting temporary errors when a node goes down, until the other nodes' > failure detectors realize it's down, is normal. (this should only > take a dozen seconds, or so.) > > but after that it should route requests to other nodes, and it should > also realize when you restart #5 that it is alive again. those are > two separate issues. > > can you verify that "bin/nodeprobe cluster" shows that node 1 > eventually does/does not see #5 dead, and alive again? > > -Jonathan > > On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith wrote: >> I'm seeing an issue similar to: >> >> http://issues.apache.org/jira/browse/CASSANDRA-169 >> >> Here is when I see it. I'm running Cassandra on 5 nodes using the >> OrderPreservingPartitioner, and have populated Cassandra with 78 >> records, and I can use get_key_range via Thrift just fine. Then, if I >> manually kill one of the nodes (if I kill off node #5), the node (node >> #1) which I've been using to call get_key_range will timeout and the >> error: >> >> Thrift: Internal error processing get_key_range >> >> And the Cassandra output shows the same trace as in 169: >> >> ERROR - Encountered IOException on connection: >> java.nio.channels.SocketChannel[closed] >> java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) >> at >> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) >> at >> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) >> at >> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) >> WARN - Closing down connection java.nio.channels.SocketChannel[closed] >> ERROR - Internal error processing get_key_range >> java.lang.RuntimeException: java.util.concurrent.TimeoutException: >> Operation timed out. >> at >> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573) >> at >> org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595) >> at >> org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853) >> at >> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606) >> at >> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:675) >> Caused by: java.util.concurrent.TimeoutException: Operation timed out. >> at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97) >> at >> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569) >> ... 7 more >> >> >> >> If it was giving an error just one time, I could just rely on catching >> the error and trying again. But a get_key_range call to that node I >> was already making get_key_range queries against (node #1) never works >> again (it is still up and it responds fine to multiget Thrift calls), >> so
get_key_range (CASSANDRA-169)
I'm seeing an issue similar to: http://issues.apache.org/jira/browse/CASSANDRA-169 Here is when I see it. I'm running Cassandra on 5 nodes using the OrderPreservingPartitioner, and have populated Cassandra with 78 records, and I can use get_key_range via Thrift just fine. Then, if I manually kill one of the nodes (if I kill off node #5), the node (node #1) which I've been using to call get_key_range will timeout and the error: Thrift: Internal error processing get_key_range And the Cassandra output shows the same trace as in 169: ERROR - Encountered IOException on connection: java.nio.channels.SocketChannel[closed] java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) WARN - Closing down connection java.nio.channels.SocketChannel[closed] ERROR - Internal error processing get_key_range java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation timed out. at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573) at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595) at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853) at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) Caused by: java.util.concurrent.TimeoutException: Operation timed out. at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97) at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569) ... 7 more If it was giving an error just one time, I could just rely on catching the error and trying again. But a get_key_range call to that node I was already making get_key_range queries against (node #1) never works again (it is still up and it responds fine to multiget Thrift calls), sometimes not even after I restart the down node (node #5). I end up having to restart node #1 in addition to node #5. The behavior for the other 3 nodes varies - some of them are also unable to respond to get_key_range calls, but some of them do respond to get_key_range calls. My question is, what path should I go down in terms of reproducing this problem? I'm using Aug 27 trunk code - should I update my Cassandra install prior to gathering more information for this issue, and if so, which version (0.4 or trunk). If there is anyone who is familiar with this issue, could you let me know what I might be doing wrong, or what the next info-gathering step should be for me? Thank you, Simon Smith Arcode Corporation
Re: when using nodeprobe: java.lang.OutOfMemoryError: Java heap space
Damn, how embarrassing! User error.Thank you so much for the help. On Fri, Aug 28, 2009 at 1:00 PM, Jonathan Ellis wrote: > Oh, I see the problem: nodeprobe uses the jmx port (specified in > cassandra.in.sh -- default 8080), not the thrift port. >
Re: when using nodeprobe: java.lang.OutOfMemoryError: Java heap space
I went and grabbed apache-cassandra-incubating-2009-08-20_13-02-45-src and I get the same symptoms when using that version. Thanks again - Simon On Fri, Aug 28, 2009 at 12:34 PM, Jonathan Ellis wrote: > On Fri, Aug 28, 2009 at 11:25 AM, Simon Smith wrote: >> I'm getting a traceback when using nodeprobe against Cassandra. > > That looks like a Thrift bug. :( > > Can you try an older version of Cassandra, e.g. trunk from a week ago, > or the beta1 release, to see if the Thrift library upgrade from > yesterday is responsible? >
when using nodeprobe: java.lang.OutOfMemoryError: Java heap space
I'm getting a traceback when using nodeprobe against Cassandra. Immediately below is the traceback on the screen running cassandra -f that I get when when I do a nodeprobe command, (e.g. ./nodeprobe -host myhostname.localdomain -port 9160 info). The config and the traceback on the nodeprobe screen follow below that (basic system info is that it is an Amazon FC8 instance, just under 2GB of ram. The code is cassandra trunk code from August 27. The cassandra.in.sh is unchanged and has -Xms128M and -Xmx1G, but I changed that to -Xmx1800M and then the nodeprobe command gives same traceback but at least it doesn't crash Cassandra, and after the nodeprobe it continues to let me run multiget via thrift. I only have about 80 items in the users keyspace, and inserting and running multiget works fine, it is only the nodeprobe which causes problems (same symptom if I do "nodeprobe ring"). I have previously worked successfully with Cassandra with the default JVM options in cassandra.in.sh - on CentOS 5 but that was a while ago using older trunk code. Any hints as to what is going on? Do I need to be on a machine with more memory and crank the JVM -Xmx up? And just to confirm, are there any non-recommended Linux systems, and are there any recommended ones? Thanks, Simon java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid21095.hprof ... Heap dump file created [3821447 bytes in 0.133 secs] ERROR - Fatal exception in thread Thread[pool-1-thread-2,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:296) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:203) at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:594) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) ++ MY CONFIG (storage-conf.xml) (basically unchanged except for the Keyspaces stanza) Test Cluster 0.01 org.apache.cassandra.dht.RandomPartitioner org.apache.cassandra.locator.EndPointSnitch org.apache.cassandra.locator.RackUnawareStrategy 1 /var/lib/cassandra/commitlog /var/lib/cassandra/data /var/lib/cassandra/callouts /var/lib/cassandra/bootstrap /var/lib/cassandra/staging 127.0.0.1 5000 128 7000 7001 0.0.0.0 9160 false 64 32 8 64 64 0.1 8 32 periodic 1000 864000 ++ OUTPUT ON THE SCREEN RUNNING nodeprobe: ./nodeprobe -host `hostname -f` -port 9160 info Error connecting to remote JMX agent! java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.io.EOFException] at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:342) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:267) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:151) at org.apache.cassandra.tools.NodeProbe.(NodeProbe.java:113) at org.apache.cassandra.tools.NodeProbe.main(NodeProbe.java:533) Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.io.EOFException] at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:119) at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:203) at javax.naming.InitialContext.lookup(InitialContext.java:410) at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1902) at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1871) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:276) ... 4 more Caused by: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.io.EOFException at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:304) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:340) at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:115) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:268) at sun.rmi.transport