Visiting Auckland
So long as the Volcanic Ash stays away I'll be visiting Auckland next week on the 23rd and 24th. Drop me an email if you would like to meet to talk about things Cassandra. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com
Re: Cassandra JVM GC settings
It would help if you can provide some log messages from the GCInspector so people can see how much GC is going on. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 02:46, Sebastien Coutu wrote: Hi Everyone, I'm seeing Cassandra GC a lot and I would like to tune the Young space and the Tenured space. Anyone would have recommendations on the NewRatio or NewSize/MaxNewSize to use for an environment where Cassandra has several column families and in which we are doing a mixed load of reading and writing. The JVM has 8G of heap space assigned to it and there are 9 nodes to this cluster. Thanks for the comments! Sébastien Coutu
Re: client API
The Thrift Java compiler creates code that is not compliant with Java 5. https://issues.apache.org/jira/browse/THRIFT-1170 So you may have trouble getting the thrift API to run. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 03:14, karim abbouh wrote: i use jdk1.6 to install and launch cassandra in a linux platform,but can i use jdk1.5 for my cassandra Client ?
Re: Docs: Token Selection
But, I'm thinking about using OldNetworkTopStrat. NetworkTopologyStrategy is where it's at. A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 01:39, AJ wrote: Thanks Eric! I've finally got it! I feel like I've just been initiated or something by discovering this secret. I kid! But, I'm thinking about using OldNetworkTopStrat. Do you, or anyone else, know if the same rules for token assignment applies to ONTS? On 6/16/2011 7:21 AM, Eric tamme wrote: AJ, sorry I seemed to miss the original email on this thread. As Aaron said, when computing tokens for multiple data centers, you should compute them independently for each data center - as if it were its own Cassandra cluster. You can have overlapping token ranges between multiple data centers, but no two nodes can have the same token, so for subsequent data centers I just increment the tokens. For two data centers with two nodes each using RandomPartitioner calculate the tokens for the first DC normally, but int he second data center, increment the tokens by one. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 For RowMutations this will give each data center a local set of nodes that it can write to for complete coverage of the entire token space. If you are using NetworkTopologyStrategy for replication, it will give an offset mirror replication between the two data centers so that your replicas will not get pinned to a node in the remote DC. There are other ways to select the tokens, but the increment method is the simplest to manage and continue to grow with. Hope that helps. -Eric
Re: Easy way to overload a single node on purpose?
The short answer to the problem you saw is monitor the disk space. Also monitor client side logs for errors. Running out of commit log space does not stop the node from doing reads, so it can still be considered up. One nodes view of it's own UP'ness is not as important as the other nodes (or clients) view of it. For example... A node will appear UP in the ring view of another node if it is participating in gossip messages and it's application state is normal. But a node will appear UP in it's own view of the ring most of time (assuming not bootstrap, leaving etc and it has joined the ring). This applies even if it's gossip service has been disabled. To a client a node will appear down if it is not responding to RPC requests. But it could still be part of the cluster, appear UP to other nodes and be responding to read and/or write. So to monitor that a node is running in some form you can... - you should be monitoring the TP stats anyway, so you know the node is in some running state - check that you can connect as a client to each node and do some simple call. Either read/write or describe_ring() which will exec locallay or describe_schema_versions() which will call all live nodes. A read/write will only verify that the node can act as a coordinator, not that it can read/write it's self. - monitor the other nodes view of each node using nodetool ring. Now that i've written that I'm not 100% sold on it, but it will do for now :) Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 10:25, Suan Aik Yeo wrote: Having a ping column can work if every key is replicated to every node. It would tell you the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells you a subset of the nodes works. The way our check works is that each node checks itself, so in this context we're not concerned about whether the cluster is up, but that each individual node is up. So the symptoms I saw, the node actually going down etc, were probably due to many different events happening at the time, and will be very hard to recreate? On Thu, Jun 16, 2011 at 6:16 AM, aaron morton aa...@thelastpickle.com wrote: DEBUG 14:36:55,546 ... timed out Is logged when the coordinator times out waiting for the replicas to respond, the timeout setting is rpc_timeout in the yaml file. This results in the client getting a TimedOutException. AFAIK There is no global everything is good / bad flags to check. e.g. AFAIK I node will not mark its self down if it runs out of disk space. So you need to monitor the free disk space and alert on that. Having a ping column can work if every key is replicated to every node. It would tell you the cluster is working, sort of. Once the number of nodes is greater than the RF, it tells you a subset of the nodes works. If you google around you'll find discussions about monitoring with munin, ganglia, cloud kick and Ops Centre. If you install mx4j you can access the JMX metrics via HTTP, Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 10:38, Suan Aik Yeo wrote: Here's a weird one... what's the best way to get a Cassandra node into a half-crashed state? We have a 3-node cluster running 0.7.5. A few days ago this happened organically to node1 - the partition the commitlog was on was 100% full and there was a No space left on device error, and after a while, although the cluster and node1 was still up, to the other nodes it was down, and messages like: DEBUG 14:36:55,546 ... timed out started to show up in its debug logs. We have a tool to indicate to the load balancer that a Cassandra node is down, but it didn't detect it that time. Now I'm having trouble purposefully getting the node back to that state, so that I can try other monitoring methods. I've tried to fill up the commitlog partition with other files, and although I get the No space left on device error, the node still doesn't go down and show the other symptoms it showed before. Also, if anyone could recommend a good way for a node itself to detect that its in such a state I'd be interested in that too. Currently what we're doing is making a describe_cluster_name() thrift call, but that still worked when the node was down. I'm thinking of something like reading/writing to a fixed value in a keyspace as a check... Unfortunately Java-based solutions are out of the question. Thanks, Suan
Re: cassandra crash
What do you mean by crash ? If there was some sort of error in cassandra (including java running out of heap space) it will appear in the logs. Are there any error messages in the log. If there was some sort of JVM error it will be outputted to std error and probably end up on std out / console. If you are using a packed distro it will probably be in /var/log/cassandra/output.log Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 19:18, Donna Li wrote: All: Can you find some exception from the last sentence? Would cassandra crash when memory is not enough? There are some other application run with cassandra, the other application may use large memory. 发件人: Donna Li 发送时间: 2011年6月17日 9:58 收件人: user@cassandra.apache.org 主题: cassandra crash All: Why cassandra crash after print the following log? INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-206-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-207-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,020 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-137-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-205-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-139-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/VCCCurScheduleTable-138-Data.db INFO [SSTABLE-CLEANUP-TIMER] 2011-06-16 14:19:01,021 SSTableDeletingReference.java (line 104) Deleted /usr/local/rss/DDB/data/data/PSCluster/CsiStatusTab-208-Data.db INFO [GC inspection] 2011-06-16 14:22:59,562 GCInspector.java (line 110) GC for ParNew: 385 ms, 26859800 reclaimed leaving 117789112 used; max is 118784 Best Regards Donna li
Re: Cassandra.yaml
The change to the remove the calls to DatabaseDecriptor were in this commit on the 0.8 branch https://github.com/apache/cassandra/commit/fe122c8c7d9ca0f002d5f394b4414dc91f278d1f It looks like it did not make it over to the 0.8.0 branch https://github.com/apache/cassandra/blob/cassandra-0.8.0/src/java/org/apache/cassandra/config/CFMetaData.java#L642 It is in the trunk and the current trunk and builds. Can you try the nightly here https://builds.apache.org/job/Cassandra-0.8/ Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 20:52, Vivek Mishra wrote: Thanks Aaron. But I tried it with 0.8.0 release only! From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Friday, June 17, 2011 1:55 PM To: user@cassandra.apache.org Subject: Re: Cassandra.yaml sounds like https://issues.apache.org/jira/browse/CASSANDRA-2694 Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 20:10, Vivek Mishra wrote: Hi Sasha, This is what I am trying . I can sense this is happening with JDBCDriver stuff. public static void main(String[] args) { try { java.sql.Connection con = null; Class.forName(org.apache.cassandra.cql.jdbc.CassandraDriver); con = DriverManager .getConnection(jdbc:cassandra:root/root@localhost:9160/Key1); // con. System.out.println(con !=null); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (SQLException e) { e.printStackTrace(); } } Getting following error: org.apache.cassandra.config.ConfigurationException: Cannot locate cassandra.yaml at org.apache.cassandra.config.DatabaseDescriptor.getStorageConfigURL(DatabaseDescriptor.java:111) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:121) at org.apache.cassandra.config.CFMetaData.fromThrift(CFMetaData.java:642) at org.apache.cassandra.cql.jdbc.ColumnDecoder.init(ColumnDecoder.java:61) at org.apache.cassandra.cql.jdbc.Connection.execute(Connection.java:142) at org.apache.cassandra.cql.jdbc.Connection.execute(Connection.java:124) at org.apache.cassandra.cql.jdbc.CassandraConnection.init(CassandraConnection.java:83) at org.apache.cassandra.cql.jdbc.CassandraDriver.connect(CassandraDriver.java:86) at java.sql.DriverManager.getConnection(Unknown Source) at java.sql.DriverManager.getConnection(Unknown Source) Ideally it should get it . Not sure what is the issue. -Vivek -Original Message- From: Sasha Dolgy [mailto:sdo...@gmail.com] Sent: Friday, June 17, 2011 1:31 PM To: user@cassandra.apache.org Subject: Re: Cassandra.yaml Hi Vivek, When I write client code in Java, using Hector, I don't specify a cassandra.yaml ... I specify the host(s) and keyspace I want to connect to. Alternately, I specify the host(s) and create the keyspace if the one I would like to use doesn't exist (new cluster for example). At no point do I use yaml file with my client code The conf/cassandra.yaml is there to tell the cassandra server how to behave / operate when it starts ... -sd On Fri, Jun 17, 2011 at 9:55 AM, Vivek Mishra vivek.mis...@impetus.co.in wrote: I have a query: I have my Cassandra server running on my local machine and it has loaded Cassandra specific settings from apache-cassandra-0.8.0-src/apache-cassandra-0.8.0-src/conf/cassandra.y aml Now If I am writing a java program to connect to this server why do I need to provide a new Cassandra.yaml file again? Even if server is already up and running Even if I can create keyspaces, columnfamilies programmatically? Isn’t it some type of redundancy? Might be my query is a bit irrelevant. -Vivek Write to us for a Free Gold Pass to the Cloud Computing Expo, NYC to attend a live session by Head of Impetus Labs on ‘Secrets of Building a Cloud Vendor Agnostic PetaByte Scale Real-time Secure Web Application on the Cloud ‘. Looking to leverage the Cloud for your Big Data Strategy ? Attend Impetus webinar on May 27 by registering athttp://www.impetus.com/webinar?eventid=42 . NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity
Re: MemoryMeter uninitialized (jamm not specified as java agent)
What do you get for $ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode) Also you can check if the wrapper has correctly detected things with ps aux | grep javaagent The args to the java process should include -javaagent:bin/../lib/jamm-0.2.2.jar Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 22:18, Rene Kochen wrote: Since using cassandra 0.8, I see the following warning: WARN 12:05:59,807 MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0. Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead I'am using Sun JRE. What can I do to resolve this? What are the consequences of this warning? Thanx, Rene
Re: Re : last record rowId
get_range_slice() api call allows you to iterate of the keys in the DB. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 18 Jun 2011, at 05:00, karim abbouh wrote: is there any way to remember the keys (rowId) inserted in cassandra database? B.R De : Jonathan Ellis jbel...@gmail.com À : user@cassandra.apache.org Cc : karim abbouh karim_...@yahoo.fr Envoyé le : Mercredi 15 Juin 2011 18h05 Objet : Re: last record rowId You're better served using UUIDs than numeric row IDs for surrogate keys. (Of course natural keys work fine too.) On Wed, Jun 15, 2011 at 9:16 AM, Utku Can Topçu u...@topcu.gen.tr wrote: As far as I can tell, this functionality doesn't exist. However you can use such a method to insert the rowId into another column within a seperate row, and request the latest column. I think this would work for you. However every insert would need a get request, which I think would be performance issue somehow. Regards, Utku On Wed, Jun 15, 2011 at 11:14 AM, karim abbouh karim_...@yahoo.fr wrote: in my java application,when we try to insert we should all the time know the last rowId in order the insert the new record in rowId+1,so for that we should save this rowId in a file is there other way to know the last record rowId? thanks B.R -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Error trying to move a node - 0.7
I *think* someone had a similar problem once before, moving a node that was the only node in a DC. Whats version are you using ? - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17 Jun 2011, at 07:42, Ben Frank wrote: Hi All, I'm getting the following error when trying to move a nodes token: nodetool -h 145.6.92.82 -p 18080 move 56713727820156410577229101238628035242 cassandra.in.sh executing for environment DEV1 Exception in thread main java.lang.AssertionError at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:393) at org.apache.cassandra.locator.TokenMetadata.ringIterator(TokenMetadata.java:418) at org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(NetworkTopologyStrategy.java:94) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:807) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:773) at org.apache.cassandra.service.StorageService.startLeaving(StorageService.java:1468) at org.apache.cassandra.service.StorageService.move(StorageService.java:1605) at org.apache.cassandra.service.StorageService.move(StorageService.java:1580) . . . my ring looks like this: Address Status State LoadOwnsToken 113427455640312821154458202477256070484 145.6.99.80 Up Normal 1.63 GB 36.05% 4629135223504085509237477504287125589 145.6.92.82 Up Normal 2.86 GB 1.09% 6479163079760931522618457053473150444 145.6.99.81 Up Normal 2.01 GB 62.86% 113427455640312821154458202477256070484 '80' and '81' are configured to be in the East coast data center and '82' is in the West Anyone shed any light as to what might be going on here? -Ben
Re: framed transport and buffered transport
From changes.txt = https://github.com/apache/cassandra/blob/cassandra-0.8.0/CHANGES.txt#L687 make framed transport the default so malformed requests can't OOM the=20= server (CASSANDRA-475) btw, you *really* should upgrade. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20 Jun 2011, at 15:07, Donna Li wrote: My cassandra version is 0.6.3, what is the advantage of framed transport? -邮件原件- 发件人: Jonathan Ellis [mailto:jbel...@gmail.com] 发送时间: 2011年6月20日 10:56 收件人: user@cassandra.apache.org 主题: Re: framed transport and buffered transport The most important difference is that only framed is supported in 0.8+ On Sun, Jun 19, 2011 at 9:27 PM, Donna Li donna...@utstar.com wrote: All: What is the difference of framed transport and buffered transport? And what is the advantage and disadvantage of the two different transports? Thanks Donna li -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Secondary indexes performance
Can you provide some more information on the query you are running ? How many terms are you selecting with? How long does it take to return 1024 rows ? IMHO thats a reasonably big slice to get. The server will pick the most selective equality predicate, and then filter the results from that using the other predicates. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jun 2011, at 09:04, Wojciech Pietrzok wrote: Hello, I've noticed that queries using secondary indexes seems to be getting rather slow. Right now I've got an Column Family with 4 indexed columns (plus 5-6 non indexed columns, column values are small), and around 1,5-2 millions of rows. I'm using pycassa client and query using get_indexed_slices method that returns over 10k rows (in batches of 1024 rows) can take up to 30 seconds. Is it normal? Seems too long for me. Maybe there's a way to tune Cassandra config for better secondary indexes performance? Using Cassandra 0.7.6 -- KosciaK
Re: OOM during restart
AFAIK the node will not announce itself in the ring until the log replay is complete, so it will not get the schema update until after log replay. If possible i'd avoid making the schema change until you have solved this problem. My theory on OOM during log replay is that the high speed inserts are a good way of finding out if the maximum memory required by the schema is too big to fit in the JVM. How big is the max JVM Heap SIze and do you have a lot of CF's? The simple solution it to either (temporarily) increase the JVM Heap Size or move the log files so that the server can process only one at a time. The JVM option D.cassandra_ring=false will stop the node from joining the cluster and stop other nodes sending requests to it until you have sorted it out. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jun 2011, at 10:24, Gabriel Ki wrote: Hi, Cassandra: 7.6-2 I was restarting a node and ran into OOM while replaying the commit log. I am not able to bring the node up again. DEBUG 15:11:43,501 forceFlush requested but everything is clean For this I don't know what to do. java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:123) at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.init(SSTableWriter.java:395) at org.apache.cassandra.io.sstable.SSTableWriter.init(SSTableWriter.java:76) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2238) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:166) at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49) at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:189) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Any help will be appreciated. If I update the schema while a node is down, the new schema is loaded before the flushing when the node is brought up again, correct? Thanks, -gabe
Re: Create columnFamily
You've set a comparator for the super column names, but not the sub columns. e.g. [default@dev] set data['31']['address']['city']='noida'; org.apache.cassandra.db.marshal.MarshalException: cannot parse 'city' as hex bytes [default@dev] set data['31']['address'][utf8('city')]='noida'; Value inserted. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jun 2011, at 19:06, Vivek Mishra wrote: I understand that I might be missing something on my end. But somehow I cannot get this working using Cassandra-cli: [default@key1] create column family supusers with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type and column_type=Super; 59e2e950-9bd4-11e0--242d50cf1fbf Waiting for schema agreement... ... schemas agree across the cluster SuperColumn family got created. Issued [default@key1] assume supusers keys as ascii; But still it is failing for: [default@key1] set supusers['31']['address']['city']='noida'; org.apache.cassandra.db.marshal.MarshalException: cannot parse 'city' as hex bytes Please suggest, what am I doing incorrect here? Write to us for a Free Gold Pass to the Cloud Computing Expo, NYC to attend a live session by Head of Impetus Labs on ‘Secrets of Building a Cloud Vendor Agnostic PetaByte Scale Real-time Secure Web Application on the Cloud ‘. Looking to leverage the Cloud for your Big Data Strategy ? Attend Impetus webinar on May 27 by registering at http://www.impetus.com/webinar?eventid=42 . NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Flushing behavior in Cassandra 0.8
The new memtable_total_space_in_mb option is kicking in https://github.com/apache/cassandra/blob/cassandra-0.8.0/NEWS.txt#L34 http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jun 2011, at 22:12, Rene Kochen wrote: I try to understand the flushing behavior in Cassandra 0.8 When I create rows, after a few seconds, I see the following line in the log: INFO 11:18:46,470 flushing high-traffic column family ColumnFamilyStore(table='Traxis', columnFamily='Customers') INFO 11:18:46,471 Enqueuing flush of Memtable-Customers@14306556(697958/50059836 serialized/live bytes, 30346 ops) INFO 11:18:46,472 Writing Memtable-Customers@14306556(697958/50059836 serialized/live bytes, 30346 ops) INFO 11:18:47,415 Completed flushing C:\Cassandra\Storage\data\Traxis\Customers-g-1-Data.db (4157370 bytes) The super-column-family is configured as follows: Memtable thresholds: 0.2953125/63/1440 (millions of ops/MB/minutes): I don’t think any of the three tresholds should trigger the flush? Thanks, Rene
Re: CommitLog replay
use nodetool cfstats or show keyspaces; in cassandra-cli to see the flush settings, default is (i think) 60 minutes, 0.1 million ops or 1/16th of hte heap size when the CF was created. But under 0.8 there is an automagical global memory manager, see https://github.com/apache/cassandra/blob/cassandra-0.8.0/NEWS.txt#L34 http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 01:51, Stephen Pope wrote: I've only got one cf, and haven't changed the default flush expiry period. I'm not sure the node had fully started or not. I had to restart my data insertion (for other reasons), so I can check the system log upon restart when the data is finished inserting. Do you know off-hand how long the default flush expiry period is? Cheers, Steve -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: Tuesday, June 21, 2011 9:13 AM To: user@cassandra.apache.org Subject: Re: CommitLog replay I’ve got a single node deployment of 0.8 set up on my windows box. When I insert a bunch of data into it, the commitlogs directory doesn’t clear upon completion (should it?). It is expected that commit logs are retained for a while, and that there is reply going on when restarting a node. The main way to ensure that a smaller amount of commit log is active at any given moment, is to ensure that all column familes are flushed sufficiently often. This is because when column families are flushed, they are no longer necessitating the retention of the commit logs that contain the writes that were just flushed. Pay attention to whether you maybe have some cf:s that are written very rarely and won't flush until the flush expiry period. As a result, when I stop and restart Cassandra it replays all the commitlogs, then starts compacting (which seems like it’s taking a long time). While it’s compacting it won’t talk to my test client. That it starts compacting is expected if the data flushed as a result of the commit log reply triggers compactions. However, compaction does not imply that the node refuses to talk to clients. Are you sure the node has fully started? it should log when it starts up the thrift interface - check system.log. -- / Peter Schuller
Re: Compressing data types
Also https://issues.apache.org/jira/browse/HADOOP-7206 Now part of brisk http://www.datastax.com/dev/blog/brisk-1-0-beta-2-released Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 04:04, Vijay wrote: You might want to watch https://issues.apache.org/jira/browse/CASSANDRA-47 Regards, /VJ On Tue, Jun 21, 2011 at 5:14 AM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! Just wondering why this doesn't already exist: wouldn't it make sense to have decorating data types that compress (gzip, snappy) other data types (esp. UTF8Type, AsciiType) transparently? -tcn
Re: Storing files in blob into Cassandra
If the Cassandra JVM is down, Tomcat and Httpd will continue to handle requests. And Pelops will redirect these requests to another Cassandra node on another server (maybe am I wrong with this assertion). I was thinking of the server been turned off / broken / rebooting / disconnected from the network / taken out of rotation for maintenance. There are lots of reasons for a server to not be doing what it should be. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 23:10, Damien Picard wrote: 2011/6/22 aaron morton aa...@thelastpickle.com I think I have to detail my configuration. On every server of my cluster, I deploy : - a Cassandra node - a Tomcat instance - the webapp, deployed on Tomcat - Apache httpd, in front of Tomcat with mod_jakarta You will have a bunch of services on the machine competing with each other for resources (cpu, memory and network IO). It's not an approach I would take. You will also tightly couple the front end HTTP capacity to the DB capacity. e.g. consider what happens when a cassandra node is down for a while, what does this mean for your ability to accept http connections? If the Cassandra JVM is down, Tomcat and Httpd will continue to handle requests. And Pelops will redirect these requests to another Cassandra node on another server (maybe am I wrong with this assertion). Requests from your web app may go to the local cassandra node, but thats just the coordinator. They will be forwarded onto the replicas that contain the data. Yes, but as you notice before, this node can be down, so I will configure Pelops to redistribute requests on another node. So there is no strong couple between Cassandra and Tomcat ; It will works as if they was on different servers. Data are stored with RandomPartitionner, replication factor is 2. RF 3 is the minimum RF you need to use for QUORUM to be less than the RF. Thank you for this advice ; I will reconsider the RF, but for this time, I use only CL.ONE, not QUORUM. But it could change in a near future. In such case, do you advise me to store files in Cassandra ? Depends on your scale, workload and performance requirements. I would do some tests about how much data you expect to hold and what sort of workloads you need to support. Personally I think files are best kept in a file system, until a compelling reason is found to do other wise. Thank you, I think that distributing files in the cluster with something like distributed file systems is a compelling reason to store files on Cassandra. I don't want to add another complex component to my arch. Hope that helps. It does ! A lot ! Thank you. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 20:23, Damien Picard wrote: store your images / documents / etc. somewhere and reference them in Cassandra. That's the consensus that's been bandied about on this list quite frequently Thank you for your answers. I think I have to detail my configuration. On every server of my cluster, I deploy : - a Cassandra node - a Tomcat instance - the webapp, deployed on Tomcat - Apache httpd, in front of Tomcat with mod_jakarta In front of these, I use a Round-Robin DNS load balancer which balance request on every httpd. Every Tomcat instance can access every Cassandra node, allowing them to deal with every request. Data are stored with RandomPartitionner, replication factor is 2. In my case, it would be very easy to store images in Cassandra because these images will be accessible everywhere in my cluster. If I store images in FileSystem, I have to replicate them manually (probably with a distributed filesystem) on every server (quite complicated). This is why I prefer to store files into Cassandra. According to Sylvain, the main thing to know is the max size of a file. In so far as I am on a web purpose, I can define this max file size to 10 Mb (HTTP POST max size) without disapointing my users.Furthermore, most of these files will not exceed 2 or 3 Mb. In such case, do you advise me to store files in Cassandra ? Thank you. 2011/6/22 Sylvain Lebresne sylv...@datastax.com Let's be more precise in saying that this all depends on the expected size of the documents. If you know that the documents will be on the few hundreds kilobytes mark on average and no more than a few megabytes (say 5MB, even though there is no magic number), then storing them as blob will work perfectly fine (which is not saying storing them externally with metadata in Cassandra won't, but using blobs can be simpler in some cases). I've very successfully stored tons of images as blobs in Cassandra. I just knew they couldn't get super big because the system wasn't allowing it. The point with the size being that each time
Re: Secondary indexes performance
it will probably be better to denormalize and store some precomputed data Yes, if you know there are queries you need to serve it is better to support those directly in the data model. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 23:52, Wojciech Pietrzok wrote: OK, got some results (below). 2 nodes, one on localhost, second on LAN, reading with ConsistencyLevel.ONE, buffer_size=512 rows (that's how many rows pycassa will get on one connection, than it will use last row_id as start row for next query) Queries types: 1) get_range - just added limit of 1024 rows 2) get_indexed_slices ASCII - one term: on indexed column with ASCII type 3) get_indexed_slices INT - one term: on indexed column with INT type 4) get_indexed_slices ASCII + GTE, LTE on indexed INT - three terms: on indexed column with INT type + LTE, GTE on indexed column with INT type 5) get_indexed_slices 2 terms, ASCII - two terms, both columns indexed, with ASCII type 6) get_indexed_slices ASCII + GTE, LTE on non indexed INT - like 4) but LTE, GTE are on non-indexed column 3 runs for each set of queries, on successive runs times were better. Times are in seconds But if you say that 1024 rows is reasonably big slice (not mentioning over 10k rows) it will probably be better to denormalize and store some precomputed data Results: # Run 1 PERF: [a] get_range: 0.58[s] PERF: [a] get_indexed_slices ASCII: 3.96[s] PERF: [a] get_indexed_slices INT: 1.82[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT: 1.31[s] # 314 returned PERF: [cr] get_indexed_slices ASCII: 1.13[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 8.69[s] # Run 2, same queries PERF: [a] get_range: 0.33[s] PERF: [a] get_indexed_slices ASCII: 0.36[s] PERF: [a] get_indexed_slices INT: 5.39[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT : 5.42[s] # 314 returned PERF: [cr] get_indexed_slices ASCII: 0.55[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 3.57[s] # Run 3, same queries PERF: [a] get_range: 0.18[s] PERF: [a] get_indexed_slices ASCII: 0.39[s] PERF: [a] get_indexed_slices INT: 0.83[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT : 0.85[s] # 314 returned PERF: [cr] get_indexed_slices ASCII: 0.39[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 3.36[s] # changed some terms, so always 1024 returned are returned # Run 1 PERF: [a] get_range: 0.31[s] PERF: [a] get_indexed_slices ASCII: 3.14[s] PERF: [a] get_indexed_slices INT: 0.70[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT : 4.72[s] PERF: [cr] get_indexed_slices ASCII: 0.73[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 0.85[s] PERF: [cr] get_indexed_slices ASCII + GTE, LTE on non indexed INT : 2.17[s] # Run 2, same queries PERF: [a] get_range: 0.20[s] PERF: [a] get_indexed_slices ASCII: 0.60[s] PERF: [a] get_indexed_slices INT: 1.22[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT : 1.27[s] PERF: [cr] get_indexed_slices ASCII: 0.48[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 0.50[s] PERF: [cr] get_indexed_slices ASCII + GTE, LTE on non indexed INT : 2.22[s] # Run 3, same queries PERF: [a] get_range: 0.25[s] PERF: [a] get_indexed_slices ASCII: 0.44[s] PERF: [a] get_indexed_slices INT: 0.89[s] PERF: [a] get_indexed_slices INT + GTE, LTE on indexed INT : 6.58[s] PERF: [cr] get_indexed_slices ASCII: 1.18[s] PERF: [cr] get_indexed_slices 2 terms, ASCII: 0.50[s] PERF: [cr] get_indexed_slices ASCII + GTE, LTE on non indexed INT : 2.09[s] 2011/6/21 aaron morton aa...@thelastpickle.com: Can you provide some more information on the query you are running ? How many terms are you selecting with? How long does it take to return 1024 rows ? IMHO thats a reasonably big slice to get. The server will pick the most selective equality predicate, and then filter the results from that using the other predicates. Cheers -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- KosciaK mail: kosci...@gmail.com www : http://kosciak.net/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Re: insufficient space to compact even the two smallest files, aborting
Setting them to 2 and 2 means compaction can only ever compact 2 files at time, so it will be worse off. Lets the try following: - restore the compactions settings to the default 4 and 32 - run `ls -lah` in the data dir and grab the output - run `nodetool flush` this will trigger minor compaction once the memtables have been flushed - check the logs for messages from 'CompactionManager' - when done grab the output from `ls -lah` again. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 02:04, Héctor Izquierdo Seliva wrote: Hi All. I set the compaction threshold at minimum 2, maximum 2 and try to run compact, but it's not doing anything. There are over 69 sstables now, read performance is horrible, and it's taking an insane amount of space. Maybe I don't quite get how the new per bucket stuff works, but I think this is not normal behaviour. El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió: As Terje already said in this thread, the threshold is per bucket (group of similarly sized sstables) not per CF. 2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com: I was already way over the minimum. There were 12 sstables. Also, is there any reason why scrub got stuck? I did not see anything in the logs. Via jmx I saw that the scrubbed bytes were equal to one of the sstables size, and it stuck there for a couple hours . El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió: That most likely happened just because after scrub you had new files and got over the 4 file minimum limit. https://issues.apache.org/jira/browse/CASSANDRA-2697 Is the bug report.
Re: Strange Connection error of nodetool
Check the list here http://wiki.apache.org/cassandra/JmxGotchas I *think* the jmx server tells the client to connect back on another host/port. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jun 2011, at 21:02, 박상길 wrote: Hi. I'm running 5 cassandra nodes. Say, the addresses are 112.234.123.111 ~ 112.234.123.115; the real address is different. When I run nodetool, the one node of address 112.234.123.112 has failed to connect. Showing error message like this. iPark:~ hayarobi$ nodetool --host 112.234.123.112 ring Error connection to remote JMX agent! java.rmi.ConnectException: Connection refused to host: 122.234.123.112; nested exception is: The host to connect address differ! I had tried to query 112.* but, the nodetool tried to connect 122.*. It happened just one machine. All other machines works fine. And I can connect to 112.234.123.112 by cassandra-cli or other tools using other port (such as 22 of ssh, 80 of http). It has trouble only on nodetool. Does anyone has an idea? I'll paste the full stack trace below. iPark:~ hayarobi$ nodetool --host 112.234.123.111 ring Address Status State LoadOwnsToken 136112946768375 112.234.123.111 Up Normal 725.01 KB 20.00% 0 112.234.123.112 Up Normal 725.93 KB 20.00% 340282366920938000 112.234.123.113 Up Normal 728.2 KB20.00% 680564733841877000 112.234.123.114 Up Normal 713.1 KB20.00% 102084710076282 112.234.123.115 Up Normal 722.67 KB 20.00% 136112946768375 iPark:~ hayarobi$ nodetool --host 112.234.123.115 ring Address Status State LoadOwnsToken 136112946768375 112.234.123.111 Up Normal 725.01 KB 20.00% 0 112.234.123.112 Up Normal 725.93 KB 20.00% 340282366920938000 112.234.123.113 Up Normal 728.2 KB20.00% 680564733841877000 112.234.123.114 Up Normal 713.1 KB20.00% 102084710076282 112.234.123.115 Up Normal 722.67 KB 20.00% 136112946768375 iPark:~ hayarobi$ nodetool --host 112.234.123.114 ring Address Status State LoadOwnsToken 136112946768375 112.234.123.111 Up Normal 725.01 KB 20.00% 0 112.234.123.112 Up Normal 725.93 KB 20.00% 340282366920938000 112.234.123.113 Up Normal 728.2 KB20.00% 680564733841877000 112.234.123.114 Up Normal 713.1 KB20.00% 102084710076282 112.234.123.115 Up Normal 722.67 KB 20.00% 136112946768375 iPark:~ hayarobi$ nodetool --host 112.234.123.112 ring Error connection to remote JMX agent! java.rmi.ConnectException: Connection refused to host: 122.234.123.112; nested exception is: java.net.ConnectException: Connection refused at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:601) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2327) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:137) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:107) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:511) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200
Re: unsubscribe
http://wiki.apache.org/cassandra/FAQ#unsubscribe - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 06:02, Carey Hollenbeck wrote: unsubscribe From: William Oberman [mailto:ober...@civicscience.com] Sent: Wednesday, June 22, 2011 1:46 PM To: user@cassandra.apache.org Subject: Re: rpm from 0.7.x - 0.8? Thanks Jonathan. I'm sure it's been true for everyone else as well, but the rolling upgrade seems to have worked like a charm for me (other than the JMX port # changing initial confusion). One minor thing that probably particular to my case: when I removed the old package, it unlinked my symlink /var/lib/cassandra/data (rather than edit the cassandra config, I symlinked my amazon disk to where cassandra expected it). At first I thought I had lost all of my data, but after restoring the link, everything was happy. will On Wed, Jun 22, 2011 at 12:34 PM, Jonathan Ellis jbel...@gmail.com wrote: Doesn't matter. auto_bootstrap only applies to first start ever. On Wed, Jun 22, 2011 at 10:48 AM, William Oberman ober...@civicscience.com wrote: I have a question about auto_bootstrap. When I originally brought up the cluser, I did: -seed with auto_boot = false -1,2,3 with auto_boot = true Now that I'm doing a rolling upgrade, do I set them all to auto_boot = true? Or does the seed stay false? Or should I mark them all false? I have manually set tokens on all of the. The doc confused me: Set to 'true' to make new [non-seed] nodes automatically migrate the right data to themselves. (If no InitialToken is specified, they will pick one such that they will get half the range of the most-loaded node.) If a node starts up without bootstrapping, it will mark itself bootstrapped so that you can't subsequently accidently bootstrap a node with data on it. (You can reset this by wiping your data and commitlog directories.) Default is: 'false', so that new clusters don't bootstrap immediately. You should turn this on when you start adding new nodes to a cluster that already has data on it. I'm not adding new nodes, but the cluster does have data on it... will On Wed, Jun 22, 2011 at 11:39 AM, William Oberman ober...@civicscience.com wrote: I just did a remove then install, and it seems to work. For those of you out there with JMX issues, the default port moved from 8080 to 7199 (which includes the internal default to nodetool). I was confused why nodetool ring would fail on some boxes and not others. I had to add -p depending on the version of nodetool will On Wed, Jun 22, 2011 at 10:15 AM, William Oberman ober...@civicscience.com wrote: I'm running 0.7.4 from rpm (riptano). If I do a yum upgrade, it's trying to do 0.7.6. To get 0.8.x I have to do install apache-cassandra08. But that is going to install two copies. Is there a semi-official way of properly upgrading to 0.8 via rpm? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Atomicity Strategies
Atomic on a single machine yes. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 09:42, AJ wrote: On 4/9/2011 7:52 PM, aaron morton wrote: My understanding of what they did with locking (based on the examples) was to achieve a level of transaction isolation http://en.wikipedia.org/wiki/Isolation_(database_systems) http://en.wikipedia.org/wiki/Isolation_%28database_systems%29 I think the issue here is more about atomicity http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic We cannot guarantee that all or none of the mutations in your batch are completed. There is some work in this area though https://issues.apache.org/jira/browse/CASSANDRA-1684 Just to be clear, you are speaking in the general sense, right? The batch mutate link you provide says that in the case that ALL the mutates of the batch are for the SAME key (row), then the whole batch is atomic: As a special case, mutations against a single key are atomic but not isolated. So, is it true that if I want to update multiple columns for one key, then it will be an all or nothing update for the whole batch if using batch update? But, if your batch mutate containts mutates for more than one key, then all the updates for one key will be atomic, followed by all the updates for the next key will be atomic, and so on. Correct? Thanks!
Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?
1. Is it feasible to run directly against a Cassandra data directory restored from an EBS snapshot? (as opposed to nodetool snapshots restored from an EBS snapshot). I dont have experience with the EBS snapshot, but I've never been a fan of OS level snapshots that are not coordinated with the DB layer. 2. Noting the wiki's consistent Cassandra backups advice; if I schedule nodetool snapshots across the cluster, should the relative age of the 'sibling' snapshots be a concern? How far apart can they be before its a problem? (seconds? minutes? hours?) Consider the snapshot to be from the time of the first one. Previous discussion on AWS backup http://www.mail-archive.com/user@cassandra.apache.org/msg12831.html Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 10:48, Thoku Hansen wrote: I have a couple of questions regarding the coordination of Cassandra nodetool snapshots with Amazon EBS snapshots as part of a Cassandra backup/restore strategy. Background: I have a cluster running in EC2. Its nodes are configured like so: * Instance type: m1.xlarge * Cassandra commit log writing to RAID-0 ephemeral storage * Cassandra data writing to an EBS volume. Note: there is a lot of conflicting information/advice about using Cassandra in EC2 w.r.t ephemeral vs. EBS. The above configuration seems to work well for my application. I only described this to provide context for my EBS snapshotting question. With respect, I hope not to debate Cassandra performance for ephemeral vs. EBS in this thread! I am setting up a process that performs regular EBS (-S3) snapshots for the purpose of backing up Cassandra plus other data. I presume this will need to be coordinated with regular Cassandra (nodetool) snapshots also. My questions: 1. Is it feasible to run directly against a Cassandra data directory restored from an EBS snapshot? (as opposed to nodetool snapshots restored from an EBS snapshot). 2. Noting the wiki's consistent Cassandra backups advice; if I schedule nodetool snapshots across the cluster, should the relative age of the 'sibling' snapshots be a concern? How far apart can they be before its a problem? (seconds? minutes? hours?) My motivation for these two questions: I'm trying to figure out how much effort needs to be put into: * Time-coordinated scheduling of nodetool snapshots across the cluster * Automation of the process of determining the most appropriate set of nodetool snapshots to use when restoring a cluster. Thanks!
Re: Decorator Algorithm
Various places in the code call IPartitioner.decorateKey() which returns a DecoratedKeyT which contains both the original key and the TokenT The RandomPartitioner md5 to hash the key ByteBuffer and create a BigInteger. OPP converts the key into utf8 encoded String. Using the token to find which endpoints contain replicas is done by the AbstractReplicationStrategy.calculateNaturalEndpoints() implementations. Does that help? - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 19:58, Jonathan Colby wrote: Hi - I'd like to understand more how the token is hashed with the key to determine on which node the data is stored - called decorating in cassandra speak. Can anyone share any documentation on this or describe this more in detail? Yes, I could look at the code, but I was hoping to be able to read more about how it works first. thanks.
Re: insufficient space to compact even the two smallest files, aborting
Missed that in the history, cheers. A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 20:26, Sylvain Lebresne wrote: As Jonathan said earlier, you are hitting https://issues.apache.org/jira/browse/CASSANDRA-2765 This will be fixed in 0.8.1 that is currently under a vote and should be released soon (let's say beginning of next week, maybe sooner). -- Sylvain 2011/6/23 Héctor Izquierdo Seliva izquie...@strands.com: Hi Aaron. Reverted back to 4-32. Did the flush but it did not trigger any minor compaction. Ran compact by hand, and it picked only two sstables. Here's the ls before: http://pastebin.com/xDtvVZvA And this is the ls after: http://pastebin.com/DcpbGvK6 Any suggestions? El jue, 23-06-2011 a las 10:55 +1200, aaron morton escribió: Setting them to 2 and 2 means compaction can only ever compact 2 files at time, so it will be worse off. Lets the try following: - restore the compactions settings to the default 4 and 32 - run `ls -lah` in the data dir and grab the output - run `nodetool flush` this will trigger minor compaction once the memtables have been flushed - check the logs for messages from 'CompactionManager' - when done grab the output from `ls -lah` again. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 02:04, Héctor Izquierdo Seliva wrote: Hi All. I set the compaction threshold at minimum 2, maximum 2 and try to run compact, but it's not doing anything. There are over 69 sstables now, read performance is horrible, and it's taking an insane amount of space. Maybe I don't quite get how the new per bucket stuff works, but I think this is not normal behaviour. El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió: As Terje already said in this thread, the threshold is per bucket (group of similarly sized sstables) not per CF. 2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com: I was already way over the minimum. There were 12 sstables. Also, is there any reason why scrub got stuck? I did not see anything in the logs. Via jmx I saw that the scrubbed bytes were equal to one of the sstables size, and it stuck there for a couple hours . El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió: That most likely happened just because after scrub you had new files and got over the 4 file minimum limit. https://issues.apache.org/jira/browse/CASSANDRA-2697 Is the bug report.
Re: get_range_slices result
Not sure what your question is. Does this help ? http://wiki.apache.org/cassandra/FAQ#range_rp Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 21:59, karim abbouh wrote: how can get_range_slices() function returns sorting key ? BR
Re: RAID or no RAID
RAID0 so you have one big volume. For performance (cassandra does not stripe sstables across the data dirs) and otherwise you'll have fragmentation and wont be able to utilise all your space. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 11:46, mcasandra wrote: Which one is preferred RAID0 or spreading data files accross various disks on the same node? I like RAID0 but what would be the most convincing argument to put additional RAID controller card in the machine? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RAID-or-no-RAID-tp6522904p6522904.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Clock skew
Without exception the timestamp is set by the client, not the server. The one exception to the without exception rule is CounterColumnType operations. If you are in a situation where you need better timing than you can get with ntp you should try to design around it. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 10:03, A J wrote: During writes, the timestamp field in the column is the system-time of that node (correct me if that is not the case and the system-time of the co-ordinator is what gets applied to all the replicas). During reads, the latest write wins. What if there is a clock skew ? It could lead to a stale write over-riding the actual latest write, just because the clock of that node is ahead of the other node. Right ?
Re: RAID or no RAID
Not sure what the intended purpose is, but we've mostly used it as an emergency disk-capacity-increase option Thats what I've used it for. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 15:55, Dan Kuebrich wrote: Not sure what the intended purpose is, but we've mostly used it as an emergency disk-capacity-increase option. It's not as good as raid because each disk size is counted individually (a compacted sstable can only be on one disk) so compaction size limits aren't expanded as one might expect. On Mon, Jun 27, 2011 at 11:30 PM, mcasandra mohitanch...@gmail.com wrote: I thought there is an option to give multiple data dirs in cassandra.yaml. What's the purpose of that? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RAID-or-no-RAID-tp6522904p6523523.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: remove all the columns of a key in a column family
That error is thrown if you send a Deletion with a predicate that has neither columns or a SliceRange. Send a Deletion that does not have a predicate. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 18:11, Donna Li wrote: To delete all the columns for row send a Mutation where the Deletion has neither a super_column or predicate I test, but throw the exception “A SlicePredicate must be given a list of Columns, a SliceRange, or both” Best Regards Donna li 发件人: aaron morton [mailto:aa...@thelastpickle.com] 发送时间: 2011年6月28日 12:30 收件人: user@cassandra.apache.org 主题: Re: remove all the columns of a key in a column family AFAIK that is still not supported. To delete all the columns for row send a Mutation where the Deletion has neither a super_column or predicate Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 15:50, Donna Li wrote: Cassandra version is 0.7.2, when I use batch_mutate, the following exception throw “TException:Deletion does not yet support SliceRange predicates”, which version support delete the whole row of a key? Best Regards Donna li 发件人: Donna Li 发送时间: 2011年6月28日 10:59 收件人: user@cassandra.apache.org 主题: remove all the columns of a key in a column family All: Can I remove all the columns of a key in a column family under the condition that not know what columns the column family has? Best Regards Donna li
Re: Truncate introspection
Drop CF takes a snapshot of the CF first, and then marks SSTables on disk as compacted so they will be safely deleted later. Finally it removes the CF from the meta data. If you see the SSTables on disk, you should see 0 length .compacted files for every one of them. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 20:00, David Boxenhorn wrote: Does drop work in a similar way? When I drop a CF and add it back with a different schema, it seems to work. But I notice that in between the drop and adding it back, when the CLI tells me the CF doesn't exist, the old data is still there. I've been assuming that this works, but just wanted to make sure... On Tue, Jun 28, 2011 at 12:56 AM, Jonathan Ellis jbel...@gmail.com wrote: Each node (independently) has logic that guarantees that any writes processed before the truncate, will be wiped out. This does not mean that each node will wipe out the same data, or even that each node will process the truncate (which would result in a timedoutexception). It also does not mean you can't have writes immediately after the truncate that would race w/ a truncate, check for zero sstables procedure. On Mon, Jun 27, 2011 at 3:35 PM, Ethan Rowe et...@the-rowes.com wrote: If those went to zero, it would certainly tell me something happened. :) I guess watching that would be a way of seeing something was going on. Is the truncate itself propagating a ring-wide marker or anything so the CF is logically empty before being physically removed? That's the impression I got from the docs but it wasn't totally clear to me. On Mon, Jun 27, 2011 at 3:33 PM, Jonathan Ellis jbel...@gmail.com wrote: There's a JMX method to get the number of sstables in a CF, is that what you're looking for? On Mon, Jun 27, 2011 at 1:04 PM, Ethan Rowe et...@the-rowes.com wrote: Is there any straightforward means of seeing what's going on after issuing a truncate (on 0.7.5)? I'm not seeing evidence that anything actually happened. I've disabled read repair on the column family in question and don't have anything actively reading/writing at present, apart from my one-off tests to see if rows have disappeared. Thanks in advance. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Re : Re : get_range_slices result
First thing is you really should upgrade from 0.6, the current release is 0.8. Info on time uuid's http://wiki.apache.org/cassandra/FAQ#working_with_timeuuid_in_java If you are using a higher level client like Hector or Pelops it will take care of encoding for you. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 28 Jun 2011, at 22:20, karim abbouh wrote: can i have an example for usingTimeUUIDType as comparator in a client java code. De : karim abbouh karim_...@yahoo.fr À : user@cassandra.apache.org user@cassandra.apache.org Envoyé le : Lundi 27 Juin 2011 17h59 Objet : Re : Re : get_range_slices result i used TimeUUIDType as type in storage-conf.xml file ColumnFamily Name=table CompareWith=TimeUUIDType / and i used it as comparator in my java code, but in the execution i get exception : Erreur --java.io.UnsupportedEncodingException: TimeUUIDType how can i write it? BR De : David Boxenhorn da...@citypath.com À : user@cassandra.apache.org Cc : karim abbouh karim_...@yahoo.fr Envoyé le : Vendredi 24 Juin 2011 11h25 Objet : Re: Re : get_range_slices result You can get the best of both worlds by repeating the key in a column, and creating a secondary index on that column. On Fri, Jun 24, 2011 at 1:16 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Fri, Jun 24, 2011 at 10:21 AM, karim abbouh karim_...@yahoo.fr wrote: i want get_range_slices() function returns records sorted(orded) by the key(rowId) used during the insertion. is it possible? You will have to use the OrderPreservingPartitioner. This is no without inconvenience however. See for instance http://wiki.apache.org/cassandra/StorageConfiguration#line-100 or http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ that give more details on the pros and cons (the short version being that the main advantage of OrderPreservingPartitioner is what you're asking for, but it's main drawback is that load-balancing the cluster will likely be very very hard). In general the advice is to stick with RandomPartitioner and design a data model that avoids needing range slices (or at least needing that the result is sorted). This is very often not too hard and more efficient, and much more simpler than to deal with the load balancing problems of OrderPreservingPartitioner. -- Sylvain De : aaron morton aa...@thelastpickle.com À : user@cassandra.apache.org Envoyé le : Jeudi 23 Juin 2011 20h30 Objet : Re: get_range_slices result Not sure what your question is. Does this help ? http://wiki.apache.org/cassandra/FAQ#range_rp Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 21:59, karim abbouh wrote: how can get_range_slices() function returns sorting key ? BR
Re: Query indexed column with key filter
Currently these are two different types of query, using a key range is equivalent to the get_range_slices() API function and column clauses is a get_indexed_slices() call. So you would be asking for a potentially painful join between. Creating a column with the same value as the key sounds reasonable. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29 Jun 2011, at 05:31, Daning wrote: I found this code // Start and finish keys, *and* column relations (KEY foo AND KEY bar and name1 = value1). if (select.isKeyRange() (select.getKeyFinish() != null) (select.getColumnRelations().size() 0)) throw new InvalidRequestException(You cannot combine key range and by-column clauses in a SELECT); in http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java This operation is exactly what I want - query by column then filter by key. I want to know why this query is not supported, and what's the good work around for it? At this moment my workaound is to create a column which is exactly same as key. Thanks, Daning
Re: Server-side CQL parameters substitution
see https://issues.apache.org/jira/browse/CASSANDRA-2475 - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29 Jun 2011, at 08:45, Michal Augustýn wrote: Hi all, in most SQL implementations, it's possible to declare parameters in SQL command text (i.e. SELECT * FROM T WHERE Id=@myId). Then the client application sends this SQL command and parameters values separately - the server is responsible for the parameters substitution. In CQL API (~the execute_cql_query method), we must compose the command (~substitute the parameters) in client application, the same code must be re-implemented in all drivers (Java, Python, Node.js, .NET, ...) respectively. And that's IMHO tedious and error prone. So do you/we plane to improve CQL API in this way? Thanks! Augi P.S.: Yes, I'm working on .NET driver and I'm too lazy to implement client-side parameters substitution ;-)
Re: custom reconciling columns?
Can you provide some more info: - how big are the rows, e.g. number of columns and column size ? - how much data are you asking for ? - what sort of read query are you using ? - what sort of numbers are you seeing ? - are you deleting columns or using TTL ? I would consider issues with the data churn, data model and query before looking at serialisation. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29 Jun 2011, at 10:37, Yang wrote: I can see that as my user history grows, the reads time proportionally ( or faster than linear) grows. if my business requirements ask me to keep a month's history for each user, it could become too slow.- I was suspecting that it's actually the serializing and deserializing that's taking time (I can definitely it's cpu bound) On Tue, Jun 28, 2011 at 3:04 PM, aaron morton aa...@thelastpickle.com wrote: There is no facility to do custom reconciliation for a column. An append style operation would run into many of the same problems as the Counter type, e.g. not every node may get an append and there is a chance for lost appends unless you go to all the trouble Counter's do. I would go with using a row for the user and columns for each item. Then you can have fast no look writes. What problems are you seeing with the reads ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 29 Jun 2011, at 04:20, Yang wrote: for example, if I have an application that needs to read off a user browsing history, and I model the user ID as the key, and the history data within the row. with current approach, I could model each visit as a column, the possible issue is that *possibly* (I'm still doing a lot of profiling on this to verify) that a lot of time is spent on serialization into the message and out of the message, plus I do not need the full features provided by the column : for example I do not need a timestamp on each visit, etc, so it might be faster to put the entire history in a blob, and each visit only takes up a few bytes in the blob, and my code manipulates the blob. problem is, I still need to avoid the read-before-write, so I send only the latest visit, and let cassandra do the reconcile, which appends the visit to the blob, so this needs custom reconcile behavior. is there a way to incorporate such custom reconcile under current code framework? (I see custom sorting, but no custom reconcile) thanks yang
Re: Cannot set column value to zero
The extra () in the describe keyspace output is only there if the column comparator is the BytesType, the client tries to format the data as UTF8. Dont forget truncate is doing snapshots, so check the snapshots dir and delete things if you are using it a lot for testing. The 0 == 1 thing does not ring any bells. Let us know if it happens again. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jun 2011, at 02:13, dnalls...@taz.qinetiq.com wrote: I had a strange problem recently where I was unable to set the value of a column to '0' (it always returned '1') but setting it to other values worked fine: [default@Test] set Urls['rowkey']['status']='1'; Value inserted. [default@Test] get Urls['rowkey']; = (column=status, value=1, timestamp=1309189541891000) Returned 1 results. [default@Test] set Urls['rowkey']['status']='0'; Value inserted. [default@Test] get Urls['rowkey']; = (column=status, value=1, timestamp=1309189551407616) Returned 1 results. This was on a one-node test cluster (v0.7.6) with no other clients; setting other values (e.g. '9') worked fine. However, attempting to set the value back to '0' always resulted in a value of '1'. I noticed this shortly after truncating the CF. The column family was shown as follows below. One thing that looks odd is that on other test clusters the Column Name is followed by a reference to the index, e.g. Column Name: status (737461747573) - but here it isn't. I was wondering if there was some interaction between truncating the CF and the use of a KEYS index? (Presumably it would be safer to delete all data directories in order to wipe the cluster during experimentation, rather than truncating?) Unfortunately I'm not sure how to recreate the situation as this was a test machine on which I played around with various configurations - but maybe someone has seen a similar problem elsewhere? In the end I had to wipe the data and start again, and all seemed fine, although the index reference is still absent as mentioned above. [default@Test] describe keyspace; Keyspace: Test: ... ColumnFamily: Foo default_validation_class: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 0.0/14400 Memtable thresholds: 0.5/128/60 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Built indexes: [Foo.737461747573] Column Metadata: Column Name: status Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS ... This message was sent using IMP, the Internet Messaging Program. This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.
Re: hadoop results
How about get_slice() with reversed == true and count = 1 to get the highest time UUID ? Or you can also store a column with a magic name that have the value of the timeuuid that is the current metric to use. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jun 2011, at 06:35, William Oberman wrote: I'll start with my question: given a CF with comparator TimeUUIDType, what is the most efficient way to get the greatest column's value? Context: I've been running cassandra for a couple of months now, so obviously it's time to start layering more on top :-) In my test environment, I managed to get pig/hadoop running, and developed a few scripts to collect metrics I've been missing since I switched from MySQL to cassandra (including the ever useful select count(*) from table equivalent). I was hoping to dump the results of this processing back into cassandra for use in other tools/processes. My initial thought was: new CF called stats with comparator TimeUUIDType. The basic idea being I'd store: stat_name - time stat was computed (as UUID) - value That way I can also see a historical perspective of any given stat for auditing (and for cumulative stats to see trends). The stat_name itself is a URI that is composed of what and any constraints on the what (including an optional time range, if the stat supports it). E.g. ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still deciding on the format of the URI). But, right now, the only way I know to get the current stat value would be to iterate over all columns (the TimeUUIDs) and then return the last one. Thanks for any tips, will
Re: Chunking if size 64MB
AFAIK there is no server side chunking of column values. This link http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage is just suggesting in the app you do not store more than 64MB per column. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jun 2011, at 07:25, A J wrote: From what I read, Cassandra allows a single column value to be up-to 2GB but would chunk the data if greater than 64MB. Is the chunking transparent to the application or does the app need to know if/how/when the chunking happened for a specific column value that happened to be 64MB. Thank you.
Re: SimpleAuthenticator
cassandra.in.sh is old skool 0.6 series, 0.7 series uses cassandra-env.sh. The packages put it in /etc/cassandra. This works for me at the end of cassandra-env.sh JVM_OPTS=$JVM_OPTS -Dpasswd.properties=/etc/cassandra/passwd.properties JVM_OPTS=$JVM_OPTS -Daccess.properties=/etc/cassandra/access.properties btw at a minimum you should upgrade from 0.7.2 to 0.7.6-2 see https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/NEWS.txt#L61 Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 1 Jul 2011, at 02:20, Earl Barnes wrote: Hi, I am encountering an error while trying to set up simple authentication in a test environment. BACKGROUND Cassandra Version: ReleaseVersion: 0.7.2-0ubuntu4~lucid1 OS Level: Linux cassandra1 2.6.32-32-server #62-Ubuntu SMP Wed Apr 20 22:07:43 UTC 2011 x86_64 GNU/Linux 2 node cluster Properties file exist in the following directory: /etc/cassandra/access.properties /etc/cassandra/passwd.properties The authenticator element in the /etc/cassandra/cassandra.yaml file is set to: authenticator: org.apache.cassandra.auth.SimpleAuthenticator The authority element in the /etc/cassandra/cassandra.yaml file is set to: authority: org.apache.cassandra.auth.SimpleAuthority The cassandra.in.sh file located in /usr/share/cassandra has been updated to show the location of the properties files in the following manner: # Location of access.properties and passwd.properties JVM_OPTS= -Dpasswd.properties=/etc/cassandra/passwd.properties -Daccess.properties=/etc/cassandra/access.properties Also, the destination of the configuration directory: CASSANDRA_CONF=/etc/cassandra ERROR After setting DEBUG mode, I get the following error message in the system.log: INFO [main] 2011-06-30 10:12:01,365 AbstractCassandraDaemon.java (line 249) Cassandra shutting down... INFO [main] 2011-06-30 10:12:01,366 CassandraDaemon.java (line 159) Stop listening to thrift clients INFO [main] 2011-06-30 10:13:14,186 AbstractCassandraDaemon.java (line 77) Logging initialized INFO [main] 2011-06-30 10:13:14,196 AbstractCassandraDaemon.java (line 97) Heap size: 510263296/511311872 WARN [main] 2011-06-30 10:13:14,227 CLibrary.java (line 93) Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN [main] 2011-06-30 10:13:14,227 CLibrary.java (line 93) Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN [main] 2011-06-30 10:13:14,228 CLibrary.java (line 125) Unknown mlockall error 0 INFO [main] 2011-06-30 10:13:14,234 DatabaseDescriptor.java (line 121) Loading settings from file:/etc/cassandra/cassandra.yaml INFO [main] 2011-06-30 10:13:14,337 DatabaseDescriptor.java (line 181) DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap ERROR [main] 2011-06-30 10:13:14,342 DatabaseDescriptor.java (line 405) Fatal configuration error org.apache.cassandra.config.ConfigurationException: When using org.apache.cassandra.auth.SimpleAuthenticator passwd.properties properties must be defined. at org.apache.cassandra.auth.SimpleAuthenticator.validateConfiguration(SimpleAuthenticator.java:148) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:200) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:100) at org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassandraDaemon.java:217) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:160) Data from the output.log: INFO 10:12:01,365 Cassandra shutting down... INFO 10:12:01,366 Stop listening to thrift clients INFO 10:13:14,186 Logging initialized INFO 10:13:14,196 Heap size: 510263296/511311872 WARN 10:13:14,227 Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN 10:13:14,227 Obsolete version of JNA present; unable to read errno. Upgrade to JNA 3.2.7 or later WARN 10:13:14,228 Unknown mlockall error 0 INFO 10:13:14,234 Loading settings from file:/etc/cassandra/cassandra.yaml INFO 10:13:14,337 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap ERROR 10:13:14,342 Fatal configuration error org.apache.cassandra.config.ConfigurationException: When using org.apache.cassandra.auth.SimpleAuthenticator passwd.properties properties must be defined. at org.apache.cassandra.auth.SimpleAuthenticator.validateConfiguration(SimpleAuthenticator.java:148
Re: Repair doesn't work after upgrading to 0.8.1
This seems to be a known issue related to https://issues.apache.org/jira/browse/CASSANDRA-2818 e.g. https://issues.apache.org/jira/browse/CASSANDRA-2768 There was some discussion on the IRC list today, driftx said the simple fix was a full cluster restart. Or perhaps a rolling restart with the 2818 patch applied may work. Starting with Dcassandra.load_ring_state=false causes the node to rediscover the ring which may help (just a guess really). But if there is bad node start been passed around in gossip it will just get the bad state again. Anyone else ? - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 1 Jul 2011, at 09:11, Héctor Izquierdo Seliva wrote: Hi all, I have upgraded all my cluster to 0.8.1. Today one of the disks in one of the nodes died. After replacing the disk I tried running repair, but this message appears: INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30 20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.77 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30 20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30 20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair with for sbs on (170141183460469231731687303715884105727,28356863910078205288614550619314017621]: manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098 completed. INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30 20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.79 from repair because it is on version 0.7 or sooner. You should consider updating this node before running repair again. INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30 20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair with for sbs on (141784319550391026443072753096570088105,170141183460469231731687303715884105727]: manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf completed. INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30 20:36:25,087 AntiEntropyService.java (line 782) No neighbors to repair with for sbs on (113427455640312821154458202477256070484,141784319550391026443072753096570088105]: manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a completed. What can I do?
Re: incomplete schema sync for new node
First, move off 0.7.2 if you can. While you may not get hit with this https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/NEWS.txt#L61 you may have trouble with this https://issues.apache.org/jira/browse/CASSANDRA-2554 For background read the section on Staring Up and on Concurrency here http://wiki.apache.org/cassandra/LiveSchemaUpdates java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /50.0.0.3 in 6ms Means you have split brain schemas in the your cluster. use describe cluster in the cli to see how many versions of the schema you have out there. The exception is thrown when the placement strategy (Simple or OldNTS) is trying to calculate the Natural Endpoints for a Token (AbstractReplicationStrategy.calculateNaturalEndpoints()) . This can happen reading/writing a key, or in your case when the node is bootstrapping and trying to work out which endpoints are responsible for the token ranges. AFAIK adding the migrations if an online process, the server is up and running while they are been added. So if there is anything that happens while the schema is invalid that requires a valid schema you will get the error. All the Previous version mismatch. cannot apply. errors in the log for 0.4 mean it got a migration from someone but the migration was received out of order. The current version on the node is not the version that was present when this migration was applied. The simple answer is stop doing do what you're doing, it sounds dangerous and inefficient to me. AFAIK it's not what the schema migrations were designed to do and moving from CL 1 to 3 will increase the repair workload.Aside from the risks of changing RF up and down a lot. The long answer may be to always apply the schema changes on the same node; check there is a single version of the schema before adding a new one; and take a look at monkeying around with the Schema and Migrations CF's in the System KS to delete migrations you want skipped. Am frowning and tutting and stoking my beard. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 2 Jul 2011, at 12:58, Jeremy Stribling wrote: Oops, forgot to mention that we're using Cassandra 0.7.2. On 07/01/2011 05:46 PM, Jeremy Stribling wrote: Hi all, I'm running into a problem with Cassandra, where a new node coming up seems to only get an incomplete set of schema mutations when bootstrapping, and as a result hits an IllegalStateException: replication factor (3) exceeds number of endpoints (2) error. I will describe the sequence of events below as I see them, but first I need to warn you that I run Cassandra in a very non-standard way. I embed it in a JVM, along with Zookeeper, and other classes for a product we are working on. We need to bring nodes up and down dynamically in our product, including going from one node to three nodes, and back down to one, at any time. If we ever drop below three nodes, we have code that sets the replication factor of our keyspaces to 1; similarly, whenever we have three or more nodes, we change the replication factor to 3. I know this is frowned upon by the community, but we're stuck with doing it this way for now. Ok, here's the scenario: 1) Node 50.0.0.4 bootstraps into a cluster consisting of nodes 50.0.0.2 and 50.0.0.3. 2) Once 50.0.0.4 is fully bootstrapped, we change the replication factor for our two keyspaces to 3. 3) Then node 50.0.0.2 is taken down permanently, and we change the replication factor back down to 1. 4) We then remove node 50.0.0.2's tokens using the removeToken call on node 50.0.0.3. 5) Then we start node 50.0.0.5, and have it join the cluster using 50.0.0.3 and 50.0.0.4 as seeds. 6) 50.0.0.5 starts receiving schema mutations to get it up to speed; the last one it receives (7d51e757-a40b-11e0-a98d-65ed1eced995) has the replication factor at 3. However, there should be more schema updates after this that never arrive (you can see them arrive at 50.0.0.4 while it is bootstrapping). 7) Minutes after receiving this last mutation, node 50.0.0.5 hits the IllegalStateException I've listed above, and I think for that reason never successfully joins the cluster. My question is why doesn't node 50.0.0.5 receive the schema updates that follow 7d51e757-a40b-11e0-a98d-65ed1eced995? (For example, 8fc8820d-a40c-11e0-9eaf-6720e49624c2 is present in 50.0.0.4's log and sets the replication factor back down to 1.) I've put logs for nodes 50.0.0.3/4/5 at http://pdos.csail.mit.edu/~strib/cassandra_logs.tgz . The logs are pretty messy because they includes log messages from both Zookeeper and our product code -- sorry about that. Also, I think the clock on node 50.0.0.4 is a few minutes ahead of the other nodes' clocks. I also noticed in 50.0.0.4's log the following exceptions: 2011-07-01 18:00:49,832
Re: flushing issue
When you say using CassandraServer do you mean an embedded cassandra server ? What process did you use to add the Keyspaces ? Adding a KS via the thrift API should take care of everything. The simple test is stop the server and the clients, start the server again and see if the KS is defined by using nodetool cfstats. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 4 Jul 2011, at 22:28, Vivek Mishra wrote: Hi, I know, I might be missing something here. I am currently facing 1 issue. I have 2 cassandra clients(1. Using CassandraServer 2. Using Cassandra.Client) running connecting to same host. I have created Keyspace K1, K2 using client1(e.g. CassandraServer), but somehow those keyspaces are not available with Client2(e.g. Cassandra.Client). I have also tried by flusing StorageService.instance.ForceFlush to tables. But that also didn’t work. Any help/Suggestion? Register for Impetus Webinar on ‘Leveraging the Cloud for your Product Testing Needs’ on June 22 (10:00am PT). Meet Impetus as a sponsor for Hadoop Summit 2011 in Santa Clara, CA on June 29. Click http://www.impetus.com to know more. Follow us on www.twitter.com/impetuscalling NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: copy data from multi-node cluster to single node
How do you change the name of a cluster? The FAQ instructions do not seem to work for me - are they still valid for 0.7.5? Is the backup / restore mechanism going to work, or is there a better/simpler to copy data from multi-node to single-node? Bug fixed on 0.7.6 https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/CHANGES.txt#L21 Also you should move to 0.7.6 to get the Gossip fix https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/CHANGES.txt#L6 When it comes to moving the data back to a single node I would: - run repair - snapshot prod node - clear all data including the system KS data from the dev node - copy the snapshot data for only your KS to the dev node into the correct directory, e.g. data/my-keyspace . - start the dev node - add your KS, the node will now load the data Ignoring the system data means the dev node can sort it's cluster name and token out using the yaml file. Even with 3 nodes and RF 3 it's impossible to ever say that one node has a complete copy of the data. Running repair will make it more likely, but the node could drop a mutation message during the repair or drop off gossip for few seconds. If you really want to have *everything* from the prod cluster then copy the data from all 3 nodes onto the dev node and compact it down. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 Jul 2011, at 03:05, Ross Black wrote: Hi, I am using Cassandra 0.7.5 on Linux machines. I am trying to backup data from a multi-node cluster (3 nodes) and restore it into a single node cluster that has a different name (for development testing). The multi-node cluster is backed up using clustertool global_snapshot, and then I copy the snapshot from a single node and replace the data directory in the single node. The multi-node cluster has a replication factor of 3, so I assume that restoring any node from the multi-node cluster will be the same. When started up this fails with a node name mismatch. I have tried removing all the Location* files in the data directory (as per http://wiki.apache.org/cassandra/FAQ#clustername_mismatch) but the single node then fails with an error message: org.apache.cassandra.config.ConfigurationException: Found system table files, but they couldn't be loaded. Did you change the partitioner? How do you change the name of a cluster? The FAQ instructions do not seem to work for me - are they still valid for 0.7.5? Is the backup / restore mechanism going to work, or is there a better/simpler to copy data from multi-node to single-node? Thanks, Ross
Re: copy data from multi-node cluster to single node
Is it possible the snapshots from different nodes have the same name? The directory name will be made up of the current timestamp on the machine and the optional name passed via the command line. The SSTables from different nodes may have name collisions. If you are aggregating data from multiple nodes onto one you will need to manually update them. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 Jul 2011, at 14:59, Zhu Han wrote: On Tue, Jul 5, 2011 at 8:58 AM, aaron morton aa...@thelastpickle.com wrote: How do you change the name of a cluster? The FAQ instructions do not seem to work for me - are they still valid for 0.7.5? Is the backup / restore mechanism going to work, or is there a better/simpler to copy data from multi-node to single-node? Bug fixed on 0.7.6 https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/CHANGES.txt#L21 Also you should move to 0.7.6 to get the Gossip fix https://github.com/apache/cassandra/blob/cassandra-0.7.6-2/CHANGES.txt#L6 When it comes to moving the data back to a single node I would: - run repair - snapshot prod node - clear all data including the system KS data from the dev node - copy the snapshot data for only your KS to the dev node into the correct directory, e.g. data/my-keyspace . - start the dev node - add your KS, the node will now load the data Ignoring the system data means the dev node can sort it's cluster name and token out using the yaml file. Even with 3 nodes and RF 3 it's impossible to ever say that one node has a complete copy of the data. Running repair will make it more likely, but the node could drop a mutation message during the repair or drop off gossip for few seconds. If you really want to have *everything* from the prod cluster then copy the data from all 3 nodes onto the dev node and compact it down. Is it possible the snapshots from different nodes have the same name? Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 Jul 2011, at 03:05, Ross Black wrote: Hi, I am using Cassandra 0.7.5 on Linux machines. I am trying to backup data from a multi-node cluster (3 nodes) and restore it into a single node cluster that has a different name (for development testing). The multi-node cluster is backed up using clustertool global_snapshot, and then I copy the snapshot from a single node and replace the data directory in the single node. The multi-node cluster has a replication factor of 3, so I assume that restoring any node from the multi-node cluster will be the same. When started up this fails with a node name mismatch. I have tried removing all the Location* files in the data directory (as per http://wiki.apache.org/cassandra/FAQ#clustername_mismatch) but the single node then fails with an error message: org.apache.cassandra.config.ConfigurationException: Found system table files, but they couldn't be loaded. Did you change the partitioner? How do you change the name of a cluster? The FAQ instructions do not seem to work for me - are they still valid for 0.7.5? Is the backup / restore mechanism going to work, or is there a better/simpler to copy data from multi-node to single-node? Thanks, Ross
Re: Problems Iterating over tokens in 0.7.5
If you still have problems send through some details of where you get incorrect results. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6/07/2011, at 3:23 AM, Anand Somani meatfor...@gmail.com wrote: Hi, Using thrift and get_range_slices call with tokenrange. Using Random Partionioner. Have only tried this on 0.7.5 Used to work in 0.6.4 or earlier version for me , but I notice that it does not work for me anymore. The need is to iterate over a token range to do some bookkeeping. The logic is use TokenRange from describe_ring and then for each range set the start and end token get a batch of rows using get_range_slices Then use the last token from the batch to set the start_token and repeat (get the next batch). iterate until no more to get (or last from new batch is same as last from previous batch) Now this works when in a test I insert n records and then for iterating use a batch size m such that m n. As soon as I use m n, I get incorrect count or an infinite loop where the range seems to repeat. Anybody seen this issue or am I using it incorrectly for newer versions of cassandra? I will also look up how this is done in Hector, but in the meantime if somebody has seen this behavior, please do respond. Thanks Anand
Re: deleting keys
See http://wiki.apache.org/cassandra/FAQ#range_ghosts Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6/07/2011, at 3:46 AM, karim abbouh karim_...@yahoo.fr wrote: i use get_range_slice to get the list of keys, then i call client.remove(keyspace, key, columnFamily, timestamp, ConsistencyLevel.ALL); to delete the record but i still have the keys. why? can i do it otherwise?
Re: Details of 'nodetool move'
Use move when you need to change the token distribution, e.g. To re-balance the ring. During decommission writes that would go to the old node will also (I think, may be instead off) go to the node that will later be responsible for the old nodes range. Same thing when a node enters the ring, it will also be sent writes while it is bootstrapping. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6/07/2011, at 10:35 AM, A J s5a...@gmail.com wrote: Hello, Where can I find details of nodetool move. Most places just mention that 'move the target node to a given Token. Moving is essentially a convenience over decommission + bootstrap.' Stuff like, when do I need to do and on what nodes? What is the value of 'new token' to be provided ? What happens if there is a mis-match between 'new token' in nodetool move command and initial_token in cassandra.yaml file. What happens when nodetool move is not successful. Does Cassandra know where to look for data (some data might still be on the old node and some on new) ? Repercussions of not running nodetool move or running it incorrectly ? Does a Read Repair take care of move for that specific key in question ? Does anti-entropy somehow take care of move ? Thanks.
Re: memory
That advice is a little out of date, specially in the future world of 0.8 memory management, see http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6/07/2011, at 5:51 PM, Donna Li donna...@utstar.com wrote: All: For a rough rule of thumb, Cassandra's internal datastructures will require about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches. Why cassandra need so much memory? What is the 1G memory used for? Best Regards Donna li
Re: result sorted by keys in reversed
It's not currently supported via the api. But I *think* it's technically possible, the code could page backwards using the index sampling the same way it does for columns. Best advice is to raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA (maybe do a search first, someone else may have requested it) Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7/07/2011, at 1:39 AM, Monnom Monprenom accountfor...@yahoo.fr wrote: Hi, I am using get_range_slice and I get the results sorted by keys, Is it possible to have the results also sorted by keys but in reverse (from the biggest to the smallest)?
Re: commitlogs not draining
When you run drain the node will log someone like node drained when it is done. The commit log should be empty, any data in the log may be due to changes in the system tables after the drain. Can you raise a ticket and include the commit logs left behind and any relevant log messages? The non draining logs may be this https://issues.apache.org/jira/browse/CASSANDRA-2829 . If a node tool flush does not clear them a restart will. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7/07/2011, at 12:26 PM, Scott Dworkis s...@mylife.com wrote: couple questions about commitlogs and the nodetool drain operator: * in 0.6, after invoking a drain, the commitlog directory would be empty. in 0.8, it seems to contain 2 files, a 44 byte .header file and 270 byte .log file. do these indicate a fully drained commitlog? * i have a couple nodes for which the commitlogs do not seem to be draining at all... they remain several hundred k or meg in size. are they corrupt? if the data is not precious, is there some way to clear and reset them to work around this? also, i see this in system.log: /data/var/log/cassandra/system.log.1:DEBUG [COMMIT-LOG-WRITER] 2011-07-06 11:04:10,076 CommitLog.java (line 473) Not safe to delete commit log CommitLogSegment(/data/var/lib/cassandra/commitlog/CommitLog-1309288064262.log); dirty is LocationInfo (0), ; hasNext: true -scott
Re: Running hadoop jobs against data in remote data center
See http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers and http://www.datastax.com/docs/0.8/brisk/about_brisk#about-the-brisk-architecture It's possible to run multi DC and use LOCAL_QUORUM consistency level in your production centre to allow the prod code to get on with it's life without worrying about the other DC. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7/07/2011, at 1:29 PM, Jason Baker ja...@apture.com wrote: I'm just setting up a Cassandra cluster for my company. For a variety of reasons, we have the servers that run our hadoop jobs in our local office and our production machines in a collocated data center. We don't want to run hadoop jobs against cassandra servers on the other side of the US from us, not to mention that we don't want them impacting performance in production. What's the best way to handle this? My first instinct is to add some servers locally to the node and use NetworkTopologyStrategy. This way, the servers automatically get updated with the latest changes, and we get a bit of extra redundancy for our production machine. Of course, the glaring weakness of this strategy is that our stats servers aren't in a datacenter with any kind of production guarantees. The network connection is relatively slow and unreliable, the servers may go out at any time, and I generally don't want to tie our production performance or reliability to these servers. Is this as dumb an idea as I suspect it is, or can this be made to work? :-) Are there any better ways to accomplish what I'm trying to accomplish?
Re: Pig pulling an older value from cassandra
Jeremy did you get anywhere with this ? If you are reading at CL ONE Read Repair will run in the background, so it may only be visible to subsequent reads. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6 Jul 2011, at 20:52, Jeremy Hanna wrote: I'm seeing some strange behavior and not sure how it is possible. We updated some data using a pig script and that wrote back to cassandra. We get the value and list the value on the Cassandra CLI and it's the updated value - from MARKET to market. However, when doing a pig script to filter by the known good values, we are left with about 42k rows that still have MARKET. If we list a subset of them, get the key, and get/list them on the CLI, they are lowercase market. Anyone have any suggestions as to how this might be possible? Our read repair chance is set to 1.0. Jeremy
Re: Re : result sorted by keys in reversed
Is it possible to have same results sorting in reversed by another method without get_range_slice in JAVA ? Sorry I don't understand your question. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7 Jul 2011, at 01:56, Monnom Monprenom wrote: Thanks, Is it possible to have same results sorting in reversed by another method without get_range_slice in JAVA ? De : Aaron Morton aa...@thelastpickle.com À : user@cassandra.apache.org user@cassandra.apache.org Envoyé le : Jeudi 7 Juillet 2011 2h52 Objet : Re: result sorted by keys in reversed It's not currently supported via the api. But I *think* it's technically possible, the code could page backwards using the index sampling the same way it does for columns. Best advice is to raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA (maybe do a search first, someone else may have requested it) Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7/07/2011, at 1:39 AM, Monnom Monprenom accountfor...@yahoo.fr wrote: Hi, I am using get_range_slice and I get the results sorted by keys, Is it possible to have the results also sorted by keys but in reverse (from the biggest to the smallest)?
Re: List nodes where write was applied to
The logs will give you some idea, but it's not information that is available as part of a request. Turn the logging up to DEBUG and watch what happens. You will see the coordinator log where it is sending messages together with some unique identifiers that you will also see logged on the replicas. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7 Jul 2011, at 10:01, A J wrote: Is there a way to find what all nodes was a write applied to ? It could be a successful write (i.e. w was met) or unsuccessful write (i.e. less than w nodes were met). In either case, I am interested in finding: Number of nodes written to (before timeout or on success) Name of nodes written to (before timeout or on success) Thanks.
Re: how large cassandra could scale when it need to do manual operation?
AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time ago. Twitter is a vocal supporter with a large Apache Cassandra install, e.g. Twitter currently runs a couple hundred Cassandra nodes across a half dozen clusters. http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011 If you are working with a 3 node cluster removing/rebuilding/what ever one node will effect 33% of your capacity. When you scale up the contribution from each individual node goes down, and the impact of one node going down is less. Problems that happen with a few nodes will go away at scale, to be replaced by a whole set of new ones. 1): the load balance need to manually performed on every node, according to: Yes 2): when adding new nodes, need to perform node repair and cleanup on every node You only need to run cleanup, see http://wiki.apache.org/cassandra/Operations#Bootstrap 3) when decommission a node, there is a chance that slow down the entire cluster. (not sure why but I saw people ask around about it.) and the only way to do is shutdown the entire the cluster, rsync the data, and start all nodes without the decommission one. I cannot remember any specific cases where decommission requires a full cluster stop, do you have a link? With regard to slowing down, the decommission process will stream data from the node you are removing onto the other nodes this can slow down the target node (I think it's more intelligent now about what is moved). This will be exaggerated in a 3 node cluster as you are removing 33% of the processing and adding some (temporary) extra load to the remaining nodes. after all, I think there is alot of human work to do to maintain the cluster which make it impossible to scale to thousands of nodes, Automation, Automation, Automation is the only way to go. Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, munin, ganglia etc for monitoring. And Ops Centre (http://www.datastax.com/products/opscenter) for cassandra specific management. I am totally wrong about all of this, currently I am serving 1 millions pv every day with Cassandra and it make me feel unsafe, I am afraid one day one node crash will cause the data broken and all cluster goes wrong With RF3 and a 3Node cluster you have room to lose one node and the cluster will be up for 100% of the keys. While better than having to worry about *the* database server, it's still entry level fault tolerance. With RF 3 in a 6 Node cluster you can lose up to 2 nodes and still be up for 100% of the keys. Is there something you are specifically concerned about with your current installation ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 Jul 2011, at 08:50, Yan Chunlu wrote: hi, all: I am curious about how large that Cassandra can scale? from the information I can get, the largest usage is at facebook, which is about 150 nodes. in the mean time they are using 2000+ nodes with Hadoop, and yahoo even using 4000 nodes of Hadoop. I am not understand why is the situation, I only have little knowledge with Cassandra and even no knowledge with Hadoop. currently I am using cassandra with 3 nodes and having problem bring one back after it out of sync, the problems I encountered making me worry about how cassandra could scale out: 1): the load balance need to manually performed on every node, according to: def tokens(nodes): for x in xrange(nodes): print 2 ** 127 / nodes * x 2): when adding new nodes, need to perform node repair and cleanup on every node 3) when decommission a node, there is a chance that slow down the entire cluster. (not sure why but I saw people ask around about it.) and the only way to do is shutdown the entire the cluster, rsync the data, and start all nodes without the decommission one. after all, I think there is alot of human work to do to maintain the cluster which make it impossible to scale to thousands of nodes, but I hope I am totally wrong about all of this, currently I am serving 1 millions pv every day with Cassandra and it make me feel unsafe, I am afraid one day one node crash will cause the data broken and all cluster goes wrong in the contrary, relational database make me feel safety but it does not scale well. thanks for any guidance here.
Re: Corrupted data
You may not lose data. - What version and whats the upgrade history? - What RF / node count / CL ? - Have you been running repair consistently ? - Is this on a single node or all nodes ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 8 Jul 2011, at 09:38, Héctor Izquierdo Seliva wrote: Hi everyone, I'm having thousands of these errors: WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 737) Non-fatal error reading row (stacktrace follows) java.io.IOError: java.io.IOException: Impossible row size 6292724931198053 at org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:719) at org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:633) at org.apache.cassandra.db.compaction.CompactionManager.access $600(CompactionManager.java:65) at org.apache.cassandra.db.compaction.CompactionManager $3.call(CompactionManager.java:250) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor $Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor $Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Impossible row size 6292724931198053 ... 9 more INFO [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 743) Retrying from row index; data is -8 bytes starting at 4735525245 WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 767) Retry failed too. Skipping to next row (retry's stacktrace follows) java.io.IOError: java.io.EOFException: bloom filter claims to be 863794556 bytes, longer than entire row size -8 THis is during scrub, as I saw similar errors while in normal operation. Is there anything I can do? It looks like I'm going to lose a ton of data
Re: how large cassandra could scale when it need to do manual operation?
about the decommission problem, here is the link: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html The key part of that post is and since the second node was under heavy load, and not enough ram, it was busy GCing and worked horribly slow . maybe I was misunderstanding the replication factor, doesn't it RF=3 means I could lose two nodes and still have one available(with 100% of the keys), once Nodes=3? When you start losing replicas the CL you use dictates if the cluster is still up for 100% of the keys. See http://thelastpickle.com/2011/06/13/Down-For-Me/ I have the strong willing to set RF to a very high value... As chris said 3 is about normal, it means the QUORUM CL is only 2 nodes. I am also trying to deploy cassandra across two datacenters(with 20ms latency). Lookup LOCAL_QUORUM in the wiki Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 02:01, Chris Goffinet wrote: As mentioned by Aaron, yes we run hundreds of Cassandra nodes across multiple clusters. We run with RF of 2 and 3 (most common). We use commodity hardware and see failure all the time at this scale. We've never had 3 nodes that were in same replica set, fail all at once. We mitigate risk by being rack diverse, using different vendors for our hard drives, designed workflows to make sure machines get serviced in certain time windows and have an extensive automated burn-in process of (disk, memory, drives) to not roll out nodes/clusters that could fail right away. On Sat, Jul 9, 2011 at 12:17 AM, Yan Chunlu springri...@gmail.com wrote: thank you very much for the reply. which brings me more confidence on cassandra. I will try the automation tools, the examples you've listed seems quite promising! about the decommission problem, here is the link: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-decommission-two-slow-nodes-td5078455.html I am also trying to deploy cassandra across two datacenters(with 20ms latency). so I am worrying about the network latency will even make it worse. maybe I was misunderstanding the replication factor, doesn't it RF=3 means I could lose two nodes and still have one available(with 100% of the keys), once Nodes=3? besides I am not sure what's twitters setting on RF, but it is possible to lose 3 nodes in the same time(facebook once encountered photo loss because there RAID broken, rarely happen though). I have the strong willing to set RF to a very high value... Thanks! On Sat, Jul 9, 2011 at 5:22 AM, aaron morton aa...@thelastpickle.com wrote: AFAIK Facebook Cassandra and Apache Cassandra diverged paths a long time ago. Twitter is a vocal supporter with a large Apache Cassandra install, e.g. Twitter currently runs a couple hundred Cassandra nodes across a half dozen clusters. http://www.datastax.com/2011/06/chris-goffinet-of-twitter-to-speak-at-cassandra-sf-2011 If you are working with a 3 node cluster removing/rebuilding/what ever one node will effect 33% of your capacity. When you scale up the contribution from each individual node goes down, and the impact of one node going down is less. Problems that happen with a few nodes will go away at scale, to be replaced by a whole set of new ones. 1): the load balance need to manually performed on every node, according to: Yes 2): when adding new nodes, need to perform node repair and cleanup on every node You only need to run cleanup, see http://wiki.apache.org/cassandra/Operations#Bootstrap 3) when decommission a node, there is a chance that slow down the entire cluster. (not sure why but I saw people ask around about it.) and the only way to do is shutdown the entire the cluster, rsync the data, and start all nodes without the decommission one. I cannot remember any specific cases where decommission requires a full cluster stop, do you have a link? With regard to slowing down, the decommission process will stream data from the node you are removing onto the other nodes this can slow down the target node (I think it's more intelligent now about what is moved). This will be exaggerated in a 3 node cluster as you are removing 33% of the processing and adding some (temporary) extra load to the remaining nodes. after all, I think there is alot of human work to do to maintain the cluster which make it impossible to scale to thousands of nodes, Automation, Automation, Automation is the only way to go. Chef, Puppet, CF Engine for general config and deployment; Cloud Kick, munin, ganglia etc for monitoring. And Ops Centre (http://www.datastax.com/products/opscenter) for cassandra specific management. I am totally wrong about all of this, currently
Re: Cassandra Secondary index/Twissandra
Is there a limit on the number of columns in a single column family that serve as secondary indexes? AFAIK there is no coded limit, however every index is implemented as another (hidden) Column Family that inherits the settings of the parent CF. So under 0.7 you may run out of memory, under 0.8 you may flush a lot. Also, when an indexed column is updated there are potentially 3 operations that have to happen: read the old value, delete the old value, write the new value. More indexes == more index updating, just like any other database. Does performance decrease (significantly) if the uniqueness of the column’s values is high? Low cardinality is recommended http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? Probably just to make the demo easier. It's used to order tweets in the user and public timelines by the current time https://github.com/twissandra/twissandra/blob/master/cass.py#L204 Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. Hope that helps. - - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 12:15, Eldad Yamin wrote: Hi, I have few questions: Secondary index Is there a limit on the number of columns in a single column family that serve as secondary indexes? Does performance decrease (significantly) if the uniqueness of the column’s values is high? Twissandra Why in the source (or any tutorial I've read): The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Thanks! Eldad
Re: Corrupted data
Nop, only when something breaks Unless you've been working at QUORUM life is about to get trickier. Repair is an essential part of running a cassandra cluster, without it you risk data loss and dead data coming back to life. If you have been writing at QUORUM, so have a reasonable expectation of data replication, the normal approach is to happily let scrub skip the rows, after scrub has completed a repair will see the data repaired using one of the other replicas. That's probably already happened as the scrub process skipped the rows when writing them out to the new files. Try to run repair. Try running it on a single CF to start with. Good luck - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 16:45, Héctor Izquierdo Seliva wrote: Hi Peter. I have a problem with repair, and it's that it always brings the node doing the repairs down. I've tried setting index_interval to 5000, and it still dies with OutOfMemory errors, or even worse, it generates thousands of tiny sstables before dying. I've tried like 20 repairs during this week. None of them finished. This is on a 16GB machine using 12GB heap so it doesn't crash (too early). El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió: - Have you been running repair consistently ? Nop, only when something breaks This is unrelated to the problem you were asking about, but if you never run delete, make sure you are aware of: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair http://wiki.apache.org/cassandra/DistributedDeletes
Re: Cassandra Secondary index/Twissandra
Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF? A select with no start or finish column name, a column count and not in reversed order is about the fastest read query. You will need to do a reversed query, which will be a little slower. But may still be plenty fast enough, depending on scale and throughput and all those other things. see http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Jul 2011, at 00:14, Eldad Yamin wrote: Aaron - Thank you for the fast response! Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. In Twissandra, the columns are used as alternative index for the Userline/Timeline. therefore the operation I'm going to do is slice_range. I'm going to get (for example) the first 50 columns (using comparator of TimeUUID/LONG). Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF? Thanks! On Sun, Jul 10, 2011 at 3:26 AM, aaron morton aa...@thelastpickle.com wrote: Is there a limit on the number of columns in a single column family that serve as secondary indexes? AFAIK there is no coded limit, however every index is implemented as another (hidden) Column Family that inherits the settings of the parent CF. So under 0.7 you may run out of memory, under 0.8 you may flush a lot. Also, when an indexed column is updated there are potentially 3 operations that have to happen: read the old value, delete the old value, write the new value. More indexes == more index updating, just like any other database. Does performance decrease (significantly) if the uniqueness of the column’s values is high? Low cardinality is recommended http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? Probably just to make the demo easier. It's used to order tweets in the user and public timelines by the current time https://github.com/twissandra/twissandra/blob/master/cass.py#L204 Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. Hope that helps. - - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 12:15, Eldad Yamin wrote: Hi, I have few questions: Secondary index Is there a limit on the number of columns in a single column family that serve as secondary indexes? Does performance decrease (significantly) if the uniqueness of the column’s values is high? Twissandra Why in the source (or any tutorial I've read): The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Thanks! Eldad
Re: Corrupted data
1) do I need to treat every node as failure and do a rolling replacement? since there might be some inconsistent in the cluster even I have no way to find out. see http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds 2) is that the reason that caused the node repair hung? the log message says: Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run WARNING: Failed to check the connection: java.net.SocketTimeoutException: Read timed out I cannot find that anywhere in the code base, can you provide some more information ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Jul 2011, at 03:26, Yan Chunlu wrote: I am running RF=2(I have changed it from 2-3 and back to 2) and 3 nodes and didn't running node repair more than 10 days, did not aware of this is critical. I run node repair recently and one of the node always hung... from log it seems doing nothing related to the repair. so I got two problems: 1) do I need to treat every node as failure and do a rolling replacement? since there might be some inconsistent in the cluster even I have no way to find out. 2) is that the reason that caused the node repair hung? the log message says: Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run WARNING: Failed to check the connection: java.net.SocketTimeoutException: Read timed out then nothing. thanks! On Sat, Jul 9, 2011 at 10:16 PM, Peter Schuller peter.schul...@infidyne.com wrote: - Have you been running repair consistently ? Nop, only when something breaks This is unrelated to the problem you were asking about, but if you never run delete, make sure you are aware of: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair http://wiki.apache.org/cassandra/DistributedDeletes -- / Peter Schuller -- 闫春路
Re: node stuck leaving
Thats the correct way to use remove token, it's there when the node you are removing from the ring cannot be started http://wiki.apache.org/cassandra/Operations#Removing_nodes_entirely Dead nodes popping up and an inconsistent view of the ring is a bit nasty. You can *try* restarting the node which thing the missing node is up with using the Dcassandra.load_ring_state=false JVM property. But you may have to take more drastic action. http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Jul 2011, at 03:52, Héctor Izquierdo Seliva wrote: I'm also having problems with removetoken. Maybe I'm doing it wrong, but I was under the impression that I just had to call once removetoken. When I take a look at the nodes ring, the dead node keeps popping up. What's even more incredible is that in some of them it says UP
Re: R: Re: Re: AntiEntropy?
Running nodetool repair causes Cassandra to execute a major compaction This is not what I would call factually accurate. Repair does not run a major compaction. Major compaction is when all SSTables for a CF are compacted down to one SSTable. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 12 Jul 2011, at 10:09, cbert...@libero.it wrote: The book is wrong, at least by current versions of Cassandra (I'm basing that on the quote you pasted, I don't know the context). To be sure that I didn't misunderstand (English is not my mother tongue) here is what the entire repair paragraph says ... Basic Maintenance There are a few tasks that you’ll need to perform before or after more impactful tasks. For example, it makes sense to take a snapshot only after you’ve performed a flush. So in this section we look at some of these basic maintenance tasks: repair, snapshot, and cleanup. Repair Running nodetool repair causes Cassandra to execute a major compaction. A Merkle tree of the data on the target node is computed, and the Merkle tree is compared with those of other replicas. This step makes sure that any data that might be out of sync with other nodes isn’t forgotten. During a major compaction (see “Compaction” in the Glossary), the server initiates a TreeRequest/TreeReponse conversation to exchange Merkle trees with neighboring nodes. The Merkle tree is a hash representing the data in that column family. If the trees from the different nodes don’t match, they have to be reconciled (or “repaired”) in order to determine the latest data values they should all be set to. This tree compar- ison validation is the responsibility of the org.apache.cassandra.service. AntiEntropy Service class. AntiEntropyService implements the Singleton pattern and defines the static Differencer class as well, which is used to compare two trees. If it finds any differences, it launches a repair for the ranges that don’t agree. So although Cassandra takes care of such matters automatically on occasion, you can run it yourself as well. nodetool repair must be scheduled by the operator to run regularly. The name repair is a bit unfortunate; it is not meant to imply that it only needs to run when something is wrong. -- / Peter Schuller
Re: commitlog replay missing data
Have you verified that data you expect to see is not in the server after shutdown? WRT the differed in the difference between the Memtable data size and SSTable live size, don't believe everything you read :) Memtable live size is increased by the serialised byte size of every column inserted, and is never decremented. Deletes and overwrites will inflate this value. What was your workload like? As of 0.8 we now have global memory management for cf's that tracks actual JVM bytes used by a CF. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 12/07/2011, at 3:28 PM, Jeffrey Wang jw...@palantir.com wrote: Hey all, Recently upgraded to 0.8.1 and noticed what seems to be missing data after a commitlog replay on a single-node cluster. I start the node, insert a bunch of stuff (~600MB), stop it, and restart it. There are log messages pertaining to the commitlog replay and no errors, but some of the data is missing. If I flush before stopping the node, everything is fine, and running cfstats in the two cases shows different amounts of data in the SSTables. Moreover, the amount of data that is missing is nondeterministic. Has anyone run into this? Thanks. Here is the output of a side-by-side diff between cfstats outputs for a single CF before restarting (left) and after (right). Somehow a 37MB memtable became a 2.9MB SSTable (note the difference in write count as well)? Column Family: Blocks Column Family: Blocks SSTable count: 0 | SSTable count: 1 Space used (live): 0 | Space used (live): 2907637 Space used (total): 0 | Space used (total): 2907637 Memtable Columns Count: 8198 | Memtable Columns Count: 0 Memtable Data Size: 37550510 | Memtable Data Size: 0 Memtable Switch Count: 0 | Memtable Switch Count: 1 Read Count: 0 Read Count: 0 Read Latency: NaN ms. Read Latency: NaN ms. Write Count: 8198 | Write Count: 1526 Write Latency: 0.018 ms. | Write Latency: 0.011 ms. Pending Tasks: 0Pending Tasks: 0 Key cache capacity: 20 Key cache capacity: 20 Key cache size: 0 Key cache size: 0 Key cache hit rate: NaN Key cache hit rate: NaN Row cache: disabled Row cache: disabled Compacted row minimum size: 0 | Compacted row minimum size: 1110 Compacted row maximum size: 0 | Compacted row maximum size: 2299 Compacted row mean size: 0| Compacted row mean size: 1960 Note that I patched https://issues.apache.org/jira/browse/CASSANDRA-2317 in my version, but there are no deletions involved so I don’t think it’s relevant unless I messed something up while patching. -Jeffrey
Re: Storing counters in the standard column families along with non-counter columns ?
If you can provide some more details on the use case we may be able to provide some data model help. You can always use a dedicated CF for the counters, and use the same row key. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 12/07/2011, at 6:36 AM, Aditya Narayan ady...@gmail.com wrote: Oops that's really very much disheartening and it could seriously impact our plans for going live in near future. Without this facility I guess counters currently have very little usefulness. On Mon, Jul 11, 2011 at 8:16 PM, Chris Burroughs chris.burrou...@gmail.com wrote: On 07/10/2011 01:09 PM, Aditya Narayan wrote: Is there any target version in near future for which this has been promised ? The ticket is problematic in that it would -- unless someone has a clever new idea -- require breaking thrift compatibility to add it to the api. Since is unfortunate since it would be so useful. If it's in the 0.8.x series it will only be through CQL.
Re: Range query ordering with CQL JDBC
You are probably seeing this http://wiki.apache.org/cassandra/FAQ#range_rp Row keys are not ordered by their key, they are ordered by the token created by the partitioner. If you still think there is a problem provide an example of the data your are seeing and what you expected to see. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jul 2011, at 06:09, Matthieu Nahoum wrote: Hi Eric, I am using the default partitioner, which is the RandomPartitioner I guess. The key type is String. Are Strings ordered by lexicographic rules? Thanks On Fri, Jul 15, 2011 at 12:04 PM, Eric Evans eev...@rackspace.com wrote: On Thu, 2011-07-14 at 11:07 -0500, Matthieu Nahoum wrote: I am trying to range-query a column family on which the keys are epochs (similar to the output of System.currentTimeMillis() in Java). In CQL (Cassandra 0.8.1 with JDBC driver): SELECT * FROM columnFamily WHERE KEY '130920500'; I can't get to have a result that make sense, it always returns wrong timestamps. So I must make an error somewhere in the way I input the querying value. I tried in clear (like above), in hexadecimal, etc. What is the correct way of doing this? Is it possible that my key is too long? What partitioner are you using? What is the key type? -- Eric Evans eev...@rackspace.com -- --- Engineer at NAVTEQ Berkeley Systems Engineer '10 ENAC Engineer '09 151 N. Michigan Ave. Appt. 3716 Chicago, IL, 60601 USA Cell: +1 (510) 423-1835 http://www.linkedin.com/in/matthieunahoum
Re: Data overhead discussion in Cassandra
What RF are you using ? On disk each column has 15 bytes of overhead, plus the column name and the column value. So for an 8 byte long and a 8 byte double there will be 16 bytes of data and 15 bytes of data. The index file also contains the the row key, the MD5 token (for RP) and the row offset for the data file. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 15 Jul 2011, at 07:09, Sameer Farooqui wrote: We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and loaded 1.5 TB of data into it. However, the actual space on disk being used by data files in Cassandra is 3 TB. We're using a standard column family with a million rows (key=string) and 35,040 columns per key. The column name is a long and the column value is a double. I was just hoping to understand more about why the data overhead is so large. We're not using expiring columns. Even considering indexing and bloom filters, it shouldn't have bloated up the data size to 2x the original amount. Or should it have? How can we better anticipate the actual data usage on disk in the future? - Sameer
Re: Thrift Java Client - Get a column family from a Keyspace
Currently the only way for that would be iterating through the list of column families returned by the getCf_defs() method. Yes. BTW most people access cassandra via a higher level client, for the Java peeps tend to use either Hector or Pelops. Aside from not having to code against thrift they also provide connection management and retry features that are dead handy. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 14 Jul 2011, at 23:59, Chandrasekhar M wrote: Hi I have been playing around with Cassandra and its Java Thrift Client. From my understanding, one could get/retrieve a Keyspace, KsDef object, using the describe_keyspace(String name) method on the Cassandra.Client object. Subsequently, one could get a list of all the ColumnFamily definitions in a keyspace, using the getCf_defs() method on the KsDef Object. Is there a way to get a single ColumnFamily if I know the name of the columnfamily (just a convenience function) ? Currently the only way for that would be iterating through the list of column families returned by the getCf_defs() method. Thanks in Advance Chandra Register for Impetus Webinar on ‘Device Side Performance Optimization of Mobile Apps’, July 08 (10:00 am Pacific Time). Impetus is presenting a Cassandra case study on July 11 as a sponsor for Cassandra SF 2011 in San Francisco. Click http://www.impetus.com to know more. Follow us on www.twitter.com/impetuscalling NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: What available Cassandra schema documentation is available?
Indexes are not supported on sub columns. Also, you definition seems to mix standard and sub columns together in the CF. For a super CF all top level columns contain sub columns. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 14 Jul 2011, at 19:39, Andreas Markauskas wrote: I couldn't find any schema example for the supercolumn column family that is strongly typed. For example, create column family Super1 with comparator=UTF8Type and column_type=Super and key_validation_class=UTF8Type and column_metadata = [ {column_name: username, validation_class:UTF8Type}, {column_name: email, validation_class:UTF8Type, index_type: KEYS}, {column_name: address, validation_class:UTF8Type, subcolumn_metadata = [ {column_name: street, validation_class:UTF8Type}, {column_name: state, validation_class:UTF8Type, index_type: KEYS} ] } ]; Or does someone know a better method? I like to make it as painless as possible for developers with a strongly typed schema so as to avoid orphan data.
Re: thrift install
Why are you installing thrift ? The cassandra binary packages contain all the dependancies. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 19 Jul 2011, at 07:51, Sal Lopez wrote: Does anyone have documentation/tips for installing thrift on a server that does not have access to the internet? See error below: Buildfile: build.xml setup.init: [mkdir] Created dir: /tmp/thrift-0.6.1/lib/java/build [mkdir] Created dir: /tmp/thrift-0.6.1/lib/java/build/lib [mkdir] Created dir: /tmp/thrift-0.6.1/lib/java/build/tools [mkdir] Created dir: /tmp/thrift-0.6.1/lib/java/build/test mvn.ant.tasks.download: [get] Getting: http://repo1.maven.org/maven2/org/apache/maven/maven-ant-tasks/2.1.3/maven-ant-tasks-2.1.3.jar [get] To: /tmp/thrift-0.6.1/lib/java/build/tools/maven-ant-tasks-2.1.3.jar [get] Error getting http://repo1.maven.org/maven2/org/apache/maven/maven-ant-tasks/2.1.3/maven-ant-tasks-2.1.3.jar to /tmp/thrift-0.6.1/lib/java/build/tools/maven-ant-tasks-2.1.3.jar BUILD FAILED java.net.ConnectException: Connection timed out Thanks. Sal
Re: How to keep only exactly column of key
There is no support for a feature like that, and i doubt it would ever be supported. For one there there are no locks during a write, so it's not possible to definitively say there are 100 columns at a particular instance of time. You would need to read all columns and delete the ones you no longer need. You could also try Redis. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 19 Jul 2011, at 03:22, JKnight JKnight wrote: Dear all, I want to keep only 100 column of a key: when I add a column for a key, if the number column of key is 100, another column (by order) will be deleted. Does Cassandra have setting for that? -- Best regards, JKnight
Re: b-tree
Just throwing out a (half baked) idea, perhaps the Nested Set Model of trees would work http://en.wikipedia.org/wiki/Nested_set_model * Ever row would represent a set with a left and right encoded into the key * Members are inserted as columns into *every* set / row they are a member. So we are de-normalising and trading space for time. * May need to maintain a custom secondary index of the materialised sets. e.g. slice a row to get the first column = the left value you are interested in, that is the key for the set. I've not thought it through much further than that, a lot would depend on your data. The top sets may get very big, . Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 08:33, Jeffrey Kesselman wrote: Im not sure if I have an answer for you, anyway, but I'm curious A b-tree and a binary tree are not the same thing. A binary tree is a basic fundamental data structure, A b-tree is an approach to storing and indexing data on disc for a database. Which do you mean? On Wed, Jul 20, 2011 at 4:30 PM, Eldad Yamin elda...@gmail.com wrote: Hello, Is there any good way of storing a binary-tree in Cassandra? I wonder if someone already implement something like that and how accomplished that without transaction supports (while the tree keep evolving)? I'm asking that becouse I want to save geospatial-data, and SimpleGeo did it using b-tree: http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php Thanks! -- It's always darkest just before you are eaten by a grue.
Re: Data Visualization Best Practices
This project may provide some inspiration https://github.com/driftx/chiton Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 06:36, Selcuk Bozdag wrote: Hi, Cassandra provides a flexible scheme-less data storage facility which is a perfect match for one of our projects. However, regarding the requirements it is also necessary to list the CFs in a tabular fashion. I searched on the Internet for some guidelines but could not get a handy practice for viewing such scheme-less data. Have you experienced such a case where you required to show CFs (which obviously may not have the same columns) inside tables? What would be the most relevant way of showing such data? Regards, Selcuk
Re: Repair taking a long, long time
The first thing to do is understand what the server is doing. As Edward said, there are two phases to the repair first the differences are calculated and then they are shared between the neighbours. Lets an a third step, once the neighbour gets the data it has to rebuild the indexes and bloom filter, not huge but lets include it for completeness. So... 0. Check for ERRORS in the log. 1. check nodetool compactstats , if the Merkle tree build is going on it will say Validation Compaction. Run it twice and check for progress. 2. check nodetool netstats, this will show which segments of the data are been streamed. Run it twice and check for progress. 3. check nodetool compactstats, if the data has completed streaming and indexes are been built it will say SSTable build Once we know what stage of the repair your server is at it's possible to reason about what is going on. If you want to dive deeper look for a log messages on the machine you started the repair on from the AnitEntropyService. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 02:31, David Boxenhorn wrote: As I indicated below (but didn't say specifically) another option is to set read repair chance to 1.0 for all your CFs and loop over all your data, since read triggers a read repair. On Wed, Jul 20, 2011 at 4:58 PM, Maxim Potekhin potek...@bnl.gov wrote: I can re-load all data that I have in the cluster, from a flat-file cache I have on NFS, many times faster than the nodetool repair takes. And that's not even accurate because as other noted nodetool repair eats up disk space for breakfast and takes more than 24hrs on 200GB data load, at which point I have to cancel. That's not acceptable. I simply don't know what to do now. On 7/20/2011 8:47 AM, David Boxenhorn wrote: I have this problem too, and I don't understand why. I can repair my nodes very quickly by looping though all my data (when you read your data it does read-repair), but nodetool repair takes forever. I understand that nodetool repair builds merkle trees, etc. etc., so it's a different algorithm, but why can't nodetool repair be smart enough to choose the best algorithm? Also, I don't understand what's special about my data that makes nodetool repair so much slower than looping through all my data. On Wed, Jul 20, 2011 at 12:18 AM, Maxim Potekhin potek...@bnl.gov wrote: Thanks Edward. I'm told by our IT that the switch connecting the nodes is pretty fast. Seriously, in my house I copy complete DVD images from my bedroom to the living room downstairs via WiFi, and a dozen of GB does not seem like a problem, on dirt cheap hardware (Patriot Box Office). I also have just _one_ column major family but caveat emptor -- 8 indexes attached to it (and there will be more). There is one accounting CF which is small, can't possibly make a difference. By contrast, compaction (as in nodetool) performs quite well on this cluster. I start suspecting some sort of malfunction. Looked at the system log during the repair, there is some compaction agent doing work that I'm not sure makes sense (and I didn't call for it). Disk utilization all of a sudden goes up to 40% per Ganglia, and stays there, this is pretty silly considering the cluster is IDLE and we have SSDs. No external writes, no reads. There are occasional GC stoppages, but these I can live with. This repair debacle happens 2nd time in a row. Cr@p. I need to go to production soon and that doesn't look good at all. If I can't manage a system that simple (and/or get help on this list) I may have to cut losses i.e. stay with Oracle. Regards, Maxim On 7/19/2011 12:16 PM, Edward Capriolo wrote: Well most SSD's are pretty fast. There is one more to consider. If Cassandra determines nodes are out of sync it has to transfer data across the network. If that is the case you have to look at 'nodetool streams' and determine how much data is being transferred between nodes. There are some open tickets where with larger tables repair is streaming more then it needs to. But even if the transfers are only 10% of your 200GB. Transferring 20 GB is not trivial. If you have multiple keyspaces and column families repair one at a time might make the process more manageable.
Re: node repair eat up all disk io and slow down entire cluster(3 nodes)
If you have never run repair also check the section on repair on this page http://wiki.apache.org/cassandra/Operations About how frequently it should be run. There is an issue where repair can stream too much data, and this can lead to excessive disk use. My non scientific approach to the never run repair before problem is to repair a single CF at a time, starting with the small ones that are less likely to have differences as they will stream the smallest amount of data. If you really want to conserve disk IO during the repair consider disabling the minor compaction by setting the min and max thresholds to 0 via node tool. hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20/07/2011, at 11:46 PM, Yan Chunlu springri...@gmail.com wrote: just found this: https://issues.apache.org/jira/browse/CASSANDRA-2156 but seems only available to 0.8 and people submitted a patch for 0.6, I am using 0.7.4, do I need to dig into the code and make my own patch? does add compaction throttle solve the io problem? thanks! On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu springri...@gmail.com wrote: at the beginning of using cassandra, I have no idea that I should run node repair frequently, so basically, I have 3 nodes with RF=3 and have not run node repair for months, the data size is 20G. the problem is when I start running node repair now, it eat up all disk io and the server load became 20+ and increasing, the worst thing is, the entire cluster has slowed down and can not handle request. so I have to stop it immediately because it make my web service unavailable. the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory, with Western Digital WD RE3 WD1002FBYS SATA disk. I really have no idea what to do now, as currently I have already found some data loss, any suggestions would be appreciated. -- 闫春路
Re: PHPCassa get number of rows
Cassandra does not provide a way to count the number of rows, the best you can do is a series of range calls and count them on the client side http://thobbs.github.com/phpcassa/tutorial.html If this is something you need in your app consider creating a custom secondary index to store the row keys and counting the columns. NOTE: counting columns just reads aol the columns, for a big row it can result in an OOM. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20/07/2011, at 8:29 AM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Hi, How can I get the number of rows with PHPCassa? Thanks in advance.
Re: with proof Re: cassandra goes infinite loop and data lost.....
Personally I would do a repair first if you need to do one, just so you are confident everything is where is should be. Then do the move as described in the wiki. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 15:14, Yan Chunlu wrote: sorry for the misunderstanding. I saw many N of 2147483647 which N=0 and thought it was not doing anything. my node was very unbalanced and I was intend to rebalance it by nodetool move after a node repair, does that cause the slices much large? Address Status State LoadOwnsToken 84944475733633104818662955375549269696 10.28.53.2 Down Normal 71.41 GB81.09% 52773518586096316348543097376923124102 10.28.53.3 Up Normal 14.72 GB10.48% 70597222385644499881390884416714081360 10.28.53.4 Up Normal 13.5 GB 8.43% 84944475733633104818662955375549269696 should I do nodetool move according to http://wiki.apache.org/cassandra/Operations#Load_balancing before doing repair? thank you for your help! On Thu, Jul 21, 2011 at 10:47 AM, Jonathan Ellis jbel...@gmail.com wrote: This is not an infinite loop, you can see the column objects being iterated over are different. Like I said last time, I do see that it's saying N of 2147483647 which looks like you're doing slices with a much larger limit than is advisable. On Wed, Jul 20, 2011 at 9:00 PM, Yan Chunlu springri...@gmail.com wrote: this time it is another node, the node goes down during repair, and come back but never up, I change log level to DEBUG and found out it print out the following message infinitely DEBUG [main] 2011-07-20 20:58:16,286 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:6@1311207851757243 DEBUG [main] 2011-07-20 20:58:16,319 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:98@1306722716288857 DEBUG [main] 2011-07-20 20:58:16,424 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:95@1311089980134545 DEBUG [main] 2011-07-20 20:58:16,611 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:85@1311154048866767 DEBUG [main] 2011-07-20 20:58:16,754 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:366@1311207176880564 DEBUG [main] 2011-07-20 20:58:16,770 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:80@1310443605930900 DEBUG [main] 2011-07-20 20:58:16,816 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:486@1311173929610402 DEBUG [main] 2011-07-20 20:58:16,870 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:101@1310818289021118 DEBUG [main] 2011-07-20 20:58:17,041 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:677@1311202595772170 DEBUG [main] 2011-07-20 20:58:17,047 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 76616c7565:false:374@1311147641237918 On Thu, Jul 14, 2011 at 1:36 PM, Jonathan Ellis jbel...@gmail.com wrote: That says I'm collecting data to answer requests. I don't see anything here that indicates an infinite loop. I do see that it's saying N of 2147483647 which looks like you're doing slices with a much larger limit than is advisable (good way to OOM the way you already did). On Wed, Jul 13, 2011 at 8:27 PM, Yan Chunlu springri...@gmail.com wrote: I gave cassandra 8GB heap size and somehow it run out of memory and crashed. after I start it, it just runs in to the following infinite loop, the last line: DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 100zs:false:14@1310168625866434 goes for ever I have 3 nodes and RF=2, so I am losing data. is that means I am screwed and can't get it back? DEBUG [main] 2011-07-13 22:19:00,585 SliceQueryFilter.java (line 123) collecting 20 of 2147483647: q74k:false:14@1308886095008943 DEBUG [main] 2011-07-13 22:19:00,585 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 10fbu:false:1@1310223075340297 DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: apbg:false:13@1305641597957086 DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123) collecting 1 of 2147483647: auje:false:13@1305641597957075 DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123) collecting 2 of 2147483647: ayj8:false:13@1305641597957060 DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123) collecting 3 of 2147483647: b4fz:false:13@1305641597957096 DEBUG [main] 2011-07-13 22:19:00,586 SliceQueryFilter.java (line 123
Re: reset keys_cached
To clear the key cache use the invalidateKeyCache() operation on the column family in JConsole / JMX Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 18:15, 魏金仙 wrote: Can any one tell how to reset keys_cached? Thanks.
Re: Memtables stored in which location
Try the project wiki here http://wiki.apache.org/cassandra/ArchitectureOverview or the my own blog here http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ There is also a list of articles on the wiki here http://wiki.apache.org/cassandra/ArticlesAndPresentations in short, writes got to the commit log first, then the memtable in memory, which is later flushed to disk. A read is from potentially multiple sstables and memtables. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 21:17, CASSANDRA learner wrote: Hi, You r right but i too have some concerns... Any ways , some where memtable has to be stored right, like we say memtable data is flushed to create sstable on disk. Exactly from which location or memory it will be getting from. is it like an objects streams or like it is storing the values in commitlog. my next question is , data is written to commit log. all the data is available here, and the sstable are getting created on disk, then where and when these memtables are coming into picture On Thu, Jul 21, 2011 at 1:44 PM, samal sa...@wakya.in wrote: SSTable is stored on disk not memtable. Memtable is memory representation of data, which is on flush to create SSTable on disk. This is the location where SSTable is stored https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L71 Where as Commitlog which is back up (log) for memtable replaying store in https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L75 location. Once the all memtable is flushed to disk, new commit log segment is created. On Thu, Jul 21, 2011 at 1:12 PM, Abdul Haq Shaik abdulsk.cassan...@gmail.com wrote: Hi, Can you please let me know where exactly the memtables are getting stored. I wanted to know the physical location
Re: cassandra massive write perfomance problem
background http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts Without more info my initial guess is some GC pressure and/or IO pressure from compaction. Check the logs for messages from the GCInspector or connect JConsole to the instance and take a look at the heap. Here is some info on looking at the IO stats http://spyced.blogspot.com/2010/01/linux-performance-basics.html With regard to the 25+GB on disk, that all depends on how much data you are writing. Be aware that compacted files are not immediately deleted http://wiki.apache.org/cassandra/FAQ#cleaning_compacted_tables You may also want to track things by looking at nodetool tpstats and cfstats (for latency). Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 21:49, lebron james wrote: Please help me solve one problem I have server with 4 GB RAM and 2x 4 cores CPU When i start do massive writes in cassandra all works fine. but after couple hours with 10K inserts per second database grows up to 25+ GB performance go down to 500 insert per seconds I find out this because compacting operations is very slow and i dont understand why, i set 8 concurrent compacting threads but cassandra dont use 8 threads only 2 cores are loaded.
Re: node repair eat up all disk io and slow down entire cluster(3 nodes)
What are you seeing in compaction stats ? You may see some of https://issues.apache.org/jira/browse/CASSANDRA-2280 Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 23:17, Yan Chunlu wrote: after tried nodetool -h reagon repair key cf, I found that even repair single CF, it involves rebuild all sstables(using nodetool compactionstats), is that normal? On Thu, Jul 21, 2011 at 7:56 AM, Aaron Morton aa...@thelastpickle.com wrote: If you have never run repair also check the section on repair on this page http://wiki.apache.org/cassandra/Operations About how frequently it should be run. There is an issue where repair can stream too much data, and this can lead to excessive disk use. My non scientific approach to the never run repair before problem is to repair a single CF at a time, starting with the small ones that are less likely to have differences as they will stream the smallest amount of data. If you really want to conserve disk IO during the repair consider disabling the minor compaction by setting the min and max thresholds to 0 via node tool. hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20/07/2011, at 11:46 PM, Yan Chunlu springri...@gmail.com wrote: just found this: https://issues.apache.org/jira/browse/CASSANDRA-2156 but seems only available to 0.8 and people submitted a patch for 0.6, I am using 0.7.4, do I need to dig into the code and make my own patch? does add compaction throttle solve the io problem? thanks! On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu springri...@gmail.com wrote: at the beginning of using cassandra, I have no idea that I should run node repair frequently, so basically, I have 3 nodes with RF=3 and have not run node repair for months, the data size is 20G. the problem is when I start running node repair now, it eat up all disk io and the server load became 20+ and increasing, the worst thing is, the entire cluster has slowed down and can not handle request. so I have to stop it immediately because it make my web service unavailable. the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory, with Western Digital WD RE3 WD1002FBYS SATA disk. I really have no idea what to do now, as currently I have already found some data loss, any suggestions would be appreciated. -- 闫春路 -- 闫春路
Re: Memtables stored in which location
The data file with rows and columns, the bloom filter for the rows in the data file, the index for rows in the data file and the statistics. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 23:26, Nilabja Banerjee wrote: One more thing I want to ask here ...in the data folder of cassandra, for each columnfamily four type of .db files are generated. for example: CFname-f-1-Data.db, CFname-f-1-Filter.db, CFname-f-1-Index.db, CFname-f-1-Statistic.db, What are these extensions are? Thank you On 21 July 2011 16:11, samal sa...@wakya.in wrote: Any ways , some where memtable has to be stored right, like we say memtable data is flushed to create sstable on disk. Exactly from which location or memory it will be getting from. is it like an objects streams or like it is storing the values in commitlog. A Memtable is Cassandra's in-memory representation of key/value pairs. my next question is , data is written to commit log. all the data is available here, and the sstable are getting created on disk, then where and when these memtables are coming into picture Commitlog is append only file which record write sequentially, more[2], can be thought as check sum file, which to used to recalculate data for memtables in case of crash. A write first hits the CommitLog, then Cassandra stores/writes values to in-memory data structures called Memtables. The Memtables are flushed to disk whenever one of the configurable thresholds is met.[3] For each column family there is corresponding memtable. There is generally one commitlog file for all CF. SSTables are immutable once written to disk cannot be modified. It will only be replaced by new SSTable after compaction [1]http://wiki.apache.org/cassandra/ArchitectureOverview [2]http://wiki.apache.org/cassandra/ArchitectureCommitLog [3]http://wiki.apache.org/cassandra/MemtableThresholds
Re: Need help json2sstable
mmm, there is no -f option for sstable2json / SSTableExport. Datastax guys/girls ?? this works for me bin/sstable2json /var/lib/cassandra/data/dev/data-g-1-Data.db -k 666f6f output.txt NOTE: key is binary, so thats the ascii encoding for foo Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 23:19, Nilabja Banerjee wrote: This is the full path of SSTables: /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/cctest/BTP-f-1-Data.db cctest= keyspace BTP= Columnfamily name json file= /Users/nilabja/Development/Cassandra/testjson.txt commands are: bin/sstable2json -f output.txt /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/cctest1/BTP-f-1-Data.db -k keyname bin/json2sstable -k cctest -c BTP /Users/nilabja/Desktop/testjson.txt /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/json2sstable/Fetch_CCDetails-f-1-Data.db Thank You On 21 July 2011 16:07, aaron morton aa...@thelastpickle.com wrote: What is the command line you are executing ? That error is only returned by sstable2json when an sstable path is not passed on the command line. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 18:50, Nilabja Banerjee wrote: Thank you... but I have already gone through that.. but still not working... I am getting .. You must supply exactly one sstable Can you tell me why I am getting this? On 21 July 2011 02:41, Tyler Hobbs ty...@datastax.com wrote: The sstable2json/json2sstable format is detailed here: http://www.datastax.com/docs/0.7/utilities/sstable2json On Wed, Jul 20, 2011 at 4:58 AM, Nilabja Banerjee nilabja.baner...@gmail.com wrote: On 20 July 2011 11:33, Nilabja Banerjee nilabja.baner...@gmail.com wrote: Hi All, Here Is my Json structure. {Fetch_CC :{ cc:{ :1000, :ICICI, :, city:{ name:banglore }; }; } If the structure is incorrect, please give me one small structre to use below utility. I am using 0.7.5 version. Now how can I can use Json2SStable utilities? Please provide me the steps. What are the things I have configure? Thank You -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: b-tree
But how will you be able to maintain it while it evolves and new data is added without transactions? What is the situation you think you need transactions for ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 00:06, Eldad Yamin wrote: Aaron, Nested set is exactly what I had in mind. But how will you be able to maintain it while it evolves and new data is added without transactions? Thanks! On Thu, Jul 21, 2011 at 1:44 AM, aaron morton aa...@thelastpickle.com wrote: Just throwing out a (half baked) idea, perhaps the Nested Set Model of trees would work http://en.wikipedia.org/wiki/Nested_set_model * Ever row would represent a set with a left and right encoded into the key * Members are inserted as columns into *every* set / row they are a member. So we are de-normalising and trading space for time. * May need to maintain a custom secondary index of the materialised sets. e.g. slice a row to get the first column = the left value you are interested in, that is the key for the set. I've not thought it through much further than that, a lot would depend on your data. The top sets may get very big, . Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 08:33, Jeffrey Kesselman wrote: Im not sure if I have an answer for you, anyway, but I'm curious A b-tree and a binary tree are not the same thing. A binary tree is a basic fundamental data structure, A b-tree is an approach to storing and indexing data on disc for a database. Which do you mean? On Wed, Jul 20, 2011 at 4:30 PM, Eldad Yamin elda...@gmail.com wrote: Hello, Is there any good way of storing a binary-tree in Cassandra? I wonder if someone already implement something like that and how accomplished that without transaction supports (while the tree keep evolving)? I'm asking that becouse I want to save geospatial-data, and SimpleGeo did it using b-tree: http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php Thanks! -- It's always darkest just before you are eaten by a grue.
Re: Compacting manual managing and optimization
See the online help in cassandra-cli on CREATE / UPDATE COLUMN FAMILY for min_compaction_threshold and max_compaction_threshold. Also look in the cassandra.yaml file for information on configuring compaction. If compaction is really hurting your system it may be a sign that you need to scale up or make some other changes. What does your cluster look like ? # nodes, load per node, throughput, # clients etc Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 00:30, lebron james wrote: Hi! Tell me please, how i can manage compacting process, turn them off and start manualy when i need. How i can improve performance of compacting process? Thanks!
Re: Need help json2sstable
In my DB the keys added by the client were ascii strings like foo, but these are stored as binary arrays in cassandra. So I cannot use the string foo with 22table2json I have to use the ascii encoding 666f6f . This will *probably* be what you see in the output from cassandra-cli list (unless you have either set a key_validation_class for the CF or used the assume statement). If one way does not work try the other. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 01:15, Nilabja Banerjee wrote: Thank You... But truely speaking I dnt get you what do you mean by key is binary, so thats the ascii encoding for foo and another thing... this is the output of list BTP command RowKey: 0902 = (super_column=0902, (column=30, value=303039303030303032, timestamp=1310471032735000) (column=31, value=303139303030303032, timestamp=1310471032737000) (column=3130, value=30313039303030303032, timestamp=131047103275) (column=3131, value=30313139303030303032, timestamp=1310471032752000) (column=3132, value=30313239303030303032, timestamp=1310471032753000) (column=3133, value=30313339303030303032, timestamp=1310471032755000) (column=3134, value=30313439303030303032, timestamp=1310471032757000) (column=3135, value=30313539303030303032, timestamp=1310471032758000) (column=3136, value=30313639303030303032, timestamp=131047103276) (column=3137, value=30313739303030303032, timestamp=1310471032761000) (column=3138, value=30313839303030303032, timestamp=1310471032763000) (column=3139, value=30313939303030303032, timestamp=1310471032764000) (column=32, value=303239303030303032, timestamp=1310471032738000) (column=3230, value=30323039303030303032, timestamp=1310471032766000) (column=3231, value=30323139303030303032, timestamp=1310471032767000) (column=3232, value=30323239303030303032, timestamp=1310471032769000) (column=3233, value=30323339303030303032, timestamp=1310471032771000) (column=3234, value=30323439303030303032, timestamp=1310471032772000) (column=3235, value=30323539303030303032, timestamp=1310471032774000) (column=3236, value=30323639303030303032, timestamp=1310471032775000) (column=3237, value=30323739303030303032, timestamp=1310471032776000) (column=3238, value=30323839303030303032, timestamp=1310471032778000) (column=3239, value=30323939303030303032, timestamp=131047103278) (column=33, value=303339303030303032, timestamp=131047103274) How can I Use this facility sstable2json ? Thank you for keeping your patience.. ;) On 21 July 2011 17:33, aaron morton aa...@thelastpickle.com wrote: mmm, there is no -f option for sstable2json / SSTableExport. Datastax guys/girls ?? this works for me bin/sstable2json /var/lib/cassandra/data/dev/data-g-1-Data.db -k 666f6f output.txt NOTE: key is binary, so thats the ascii encoding for foo Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 23:19, Nilabja Banerjee wrote: This is the full path of SSTables: /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/cctest/BTP-f-1-Data.db cctest= keyspace BTP= Columnfamily name json file= /Users/nilabja/Development/Cassandra/testjson.txt commands are: bin/sstable2json -f output.txt /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/cctest1/BTP-f-1-Data.db -k keyname bin/json2sstable -k cctest -c BTP /Users/nilabja/Desktop/testjson.txt /Users/nilabja/Development/Cassandra/apache-cassandra-0.7.5/data/json2sstable/Fetch_CCDetails-f-1-Data.db Thank You On 21 July 2011 16:07, aaron morton aa...@thelastpickle.com wrote: What is the command line you are executing ? That error is only returned by sstable2json when an sstable path is not passed on the command line. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 18:50, Nilabja Banerjee wrote: Thank you... but I have already gone through that.. but still not working... I am getting .. You must supply exactly one sstable Can you tell me why I am getting this? On 21 July 2011 02:41, Tyler Hobbs ty...@datastax.com wrote: The sstable2json/json2sstable format is detailed here: http://www.datastax.com/docs/0.7/utilities/sstable2json On Wed, Jul 20, 2011 at 4:58 AM, Nilabja Banerjee nilabja.baner...@gmail.com wrote: On 20 July 2011 11:33, Nilabja Banerjee nilabja.baner...@gmail.com wrote: Hi All, Here Is my Json structure. {Fetch_CC :{ cc:{ :1000, :ICICI, :, city
Re: Modeling troubles
I've no idea about the game or how long you will have to live to compute all the combinations but how about: - row key is byte array describing the position of white/black pieces and the move indicator. You would need to have both rows keyed from blacks perspective and rows keyed from whites perspective. - each column name is the byte array for the possible positions of the other colour Good luck. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 01:18, Stephen Pope wrote: For a side project I’m working on I want to store the entire set of possible Reversi boards. There are an estimated 10^28 possible boards. Each board (from the best way I could think of to implement it) is made up of 2, 64-bit numbers (black pieces, white pieces…pieces in neither of those are empty spaces) and a bit to indicate who’s turn it is. I’ve thought of a few possible ways to do it: - Entire board as row key, in an array of bytes. I’m not sure how well Cassandra can handle 10^28 rows. I could also break this up into separate cfs for each depth of move (initially there are 4 pieces on the board in total. I could make a cf for 5 piece, 6, etc to 64). I’m not sure if there’s any advantage to doing that. - 64-bit number for the black pieces as row key, with 65-bit column names (white pieces + turn). I’ve read somewhere that there’s a rough limit of 2-billion columns, so this will be problematic for certain. This can also be broken into separate cfs, but I’m still going to hit the column limit Is there a better way to achieve what I’m trying to do, or will either of these approaches surprise me and work properly?
Re: Is it safe to stop a read repair and any suggestion on speeding up repairs
nit pick: nodetool repair is just called repair (or the Anti Entropy Service). Read Repair is something that happens during a read request. Short answer, yes it's safe to kill cassandra during a repair. It's one of the nice things about never mutating data. Longer answer: If nodetool compactionstats says there are no Validation compactions running (and the compaction queue is empty) and netstats says there is nothing streaming there is a a good chance the repair is finished or dead. If a neighbour dies during a repair the node it was started on will wait for 48 hours(?) until it times out. Check the logs on the machines for errors, particularly from the AntiEntropyService. And see what compactionstats is saying on all the nodes involved in the repair. Even Longer: um, 3 TB of data is *way* to much data per node, generally happy people have up to about 200 to 300GB per node. The reason for this recommendation is so that things like repair, compaction, node moves, etc are managable and because the loss of a single node has less of an impact. I would not recommend running a live system with that much data per node. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 03:51, Adi wrote: We have a 4 node 0.7.6 cluster. RF=2 , 3 TB data per node. A read repair was kicked off on node 4 last week and is still in progress. Later I kicked of read repair on node 2 a few days back. We were writing(read/write/updates/NO deletes) data while the repair was in progress but no data has been written for the past 3-4 days. I was hoping the repair should get done in that time-frame before proceeding with further writes/deletes. Would it be safe to stop it and kick it off per column family or do a full scan of all keys as suggested in an earlier discussion? Any other suggestion on hastening this repair. On both nodes the repair Thread is waiting at this stage for a long time(~60+ hours) java.lang.Thread.State: WAITING at java.lang.Object.wait(Native Method) - waiting on 580857f3 (a org.apache.cassandra.utils.SimpleCondition) at java.lang.Object.wait(Object.java:485) at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38) at org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:791) Locked ownable synchronizers: - None A CPU sampling for few minutes shows these methods as hot spots(mostly the top two) org.apache.cassandra.db.ColumnFamilyStore.isKeyInRemainingSSTables( ) org.apache.cassandra.utils.BloomFilter.getHashBuckets( ) org.apache.cassandra.io.sstable.SSTableIdentityIterator.echoData() netstats does not show anything streaming to/from any of the nodes. -Adi Pandit
Re: cassandra fatal error when compaction
Looks like nodetool drain has been run. Anything else in the logs ? - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 05:48, lebron james wrote: Why cassandra fall when i start comaction with nodetool on 35+gb database. all parameter are default. ERROR [pool-2-thread-1] 2011-07-21 15:25:36,622 Cassandra.java (line 3294) Internal error processing insert java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
Re: Cassandra 0.8.1: request for a sub-column still deserializes all sub-columns for that super column?
Yes - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:06, Oleg Tsvinev wrote: Hi All, Cassandra documentation here: http://www.datastax.com/docs/0.8/data_model/supercolumns states that: Any request for a sub-column deserializes all sub-columns for that super column, so you should avoid data models that rely on on large numbers of sub-columns. Is this still true? Thank you, Oleg
Re: Repair fails with java.io.IOError: java.io.EOFException
Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Stress test using Java-based stress utility
UnavailableException is raised server side when there is less than CL nodes UP when the request starts. It seems odd to get it in this case because the default replication factor used by stress test is 1. How many nodes do you have and have you made any changes to the RF ? Also check the server side logs as Kirk says. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 18:37, Kirk True wrote: Have you checked the logs on the nodes to see if there are any errors? On 7/21/11 10:43 PM, Nilabja Banerjee wrote: Hi All, I am following this following link http://www.datastax.com/docs/0.7/utilities/stress_java for a stress test. I am getting this notification after running this command xxx.xxx.xxx.xx= my ip contrib/stress/bin/stress -d xxx.xxx.xxx.xx Created keyspaces. Sleeping 1s for propagation. total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time Operation [44] retried 10 times - error inserting key 044 ((UnavailableException)) Operation [49] retried 10 times - error inserting key 049 ((UnavailableException)) Operation [7] retried 10 times - error inserting key 007 ((UnavailableException)) Operation [6] retried 10 times - error inserting key 006 ((UnavailableException)) Any idea why I am getting these things? Thank You -- Kirk True Founder, Principal Engineer mustardgrain.gif Expert Engineering Firepower About us: twitter.gif linkedin.gif
Re: b-tree
You can use something like Zoo Keeper to coordinate processes doing page splits. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 19:05, Eldad Yamin wrote: In order order to split the nodes. SimpleGeo have max 1,000 recods (i.e places) on each node in the tree, if the number is 1,000 they split the node. In order to avoid that more then 1 process will edit/split the node - transaction is needed. On Jul 22, 2011 1:01 AM, aaron morton aa...@thelastpickle.com wrote: But how will you be able to maintain it while it evolves and new data is added without transactions? What is the situation you think you need transactions for ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 00:06, Eldad Yamin wrote: Aaron, Nested set is exactly what I had in mind. But how will you be able to maintain it while it evolves and new data is added without transactions? Thanks! On Thu, Jul 21, 2011 at 1:44 AM, aaron morton aa...@thelastpickle.com wrote: Just throwing out a (half baked) idea, perhaps the Nested Set Model of trees would work http://en.wikipedia.org/wiki/Nested_set_model * Ever row would represent a set with a left and right encoded into the key * Members are inserted as columns into *every* set / row they are a member. So we are de-normalising and trading space for time. * May need to maintain a custom secondary index of the materialised sets. e.g. slice a row to get the first column = the left value you are interested in, that is the key for the set. I've not thought it through much further than that, a lot would depend on your data. The top sets may get very big, . Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 08:33, Jeffrey Kesselman wrote: Im not sure if I have an answer for you, anyway, but I'm curious A b-tree and a binary tree are not the same thing. A binary tree is a basic fundamental data structure, A b-tree is an approach to storing and indexing data on disc for a database. Which do you mean? On Wed, Jul 20, 2011 at 4:30 PM, Eldad Yamin elda...@gmail.com wrote: Hello, Is there any good way of storing a binary-tree in Cassandra? I wonder if someone already implement something like that and how accomplished that without transaction supports (while the tree keep evolving)? I'm asking that becouse I want to save geospatial-data, and SimpleGeo did it using b-tree: http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php Thanks! -- It's always darkest just before you are eaten by a grue.
Re: cassandra fatal error when compaction
Something has shutdown the mutation stage thread pool. This happens during drain or decommission / move. Restart the service and it should be ok. if it happens again without anyone running something like drain, decommission or move let us know. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 19:41, lebron james wrote: ERROR [pool-2-thread-3] 2011-07-22 10:34:59,102 Cassandra.java (line 3294) Internal error processing insert java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) at org.apache.cassandra.thrift.CassandraServer.internal_insert(CassandraServer.java:436) at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:444) at org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:3286) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [pool-2-thread-6] 2011-07-22 10:34:59,102 Cassandra.java (line 3294) Internal error processing insert java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) at org.apache.cassandra.thrift.CassandraServer.internal_insert(CassandraServer.java:436) at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:444) at org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:3286) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [pool-2-thread-3] 2011-07-22 10:34:59,102 Cassandra.java (line 3294) Internal error processing insert java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) at org.apache.cassandra.thrift.CassandraServer.internal_insert(CassandraServer.java:436) at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:444
Re: eliminate need to repair by using column TTL??
Read repair will only repair data that is read on the nodes that are up at that time, and does not guarantee that any changes it detects will be written back to the nodes. The diff mutations are async fire and forget messages which may go missing or be dropped or ignored by the recipient just like any other message. Also getting hit with a bunch of read repair operations is pretty painful. The normal read runs, the coordinator detects the digest mis-match, the read runs again from all nodes and they all have to return their full data (no digests this time), the coordinator detects the diffs, mutations are sent back to each node that needs them. All this happens sync to the read request when the CL ONE. Thats 2 reads with more network IO and up to RF mutations . The delete thing is important but repair also reduces the chance of reads getting hit with RR and gives me confidence when it's necessary to nuke a bad node. Your plan may work but it feels risky to me. You may end up with worse read performance and unpleasent emotions if you ever have to nuke a node. Others may disagree. Not ignoring the fact the repair can take a long time, fail, hurt performance etc. There are plans to improve it though. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 19:55, jonathan.co...@gmail.com wrote: One of the main reasons for regularly running repair is to make sure deletes are propagated in the cluster, i.e., data is not resurrected if a node never received the delete call. And repair-on-read takes care of repairing inconsistencies on-the-fly. So if I were to set a universal TTL on all columns - so everything would only live for a certain age, would I be able to get away without having to do regular repairs with nodetool? I realize this scenario would not be applicable for everyone, but our data model would allow us to do this. So could this be an alternative to running the (resource-intensive, long-running) repairs with nodetool? Thanks.
Re: Predictable low RW latency, SLABS and STW GC
Restarting the service will drop all the memmapped caches, cassandra caches are saved / persistent and you can also use memcachd if you want. Are you experiencing stop the world pauses? There are some things that can be done to reduce the chance of them happening. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jul 2011, at 05:34, Milind Parikh wrote: In order to be predicable @ big data scale, the intensity and periodicity of STW Garbage Collection has to be brought down. Assume that SLABS (Cass 2252) will be available in the main line at some time and assume that this will have the impact that other projects (hbase etc) are reporting. I womder whether avoiding GC by restarting the servers before GC will be a feasible approach (of course while knowing the workload) Regards Milind
Re: question on setup for writes into 2 datacenters
Quick reminder, with RF == 2 the QUORUM is 2 as well. So when using LOCAL_QUORUM with RF 2+2 you will effectively be using LOCAL_ALL which may not be what you want. As De La Soul sang, 3 is the magic number for minimum fault tolerance (QUORUM is then 2). Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jul 2011, at 10:04, Sameer Farooqui wrote: It sounds like what you're looking for is write consistency of local_quorum: http://www.datastax.com/docs/0.8/consistency/index#write-consistency local_quorum would mean the write has to be successful on a majority of nodes in DC1 (so 2) before it is considered successful. If you use just quorum write, it'll have to be committed to 3 replicas out of the 4 before it's considered successful. On Fri, Jul 22, 2011 at 1:57 PM, Dean Hiller d...@alvazan.com wrote: Ideally, we would want to have a replication factor of 4, and a minimum write consistency of 2 (which looking at the default in cassandra.yaml is to memory first with asynch to disk...perfect so far!!!) Now, obviously, I can get the partitioner setup to make sure I get 2 replicas in each data center. The next thing I would want to guarantee however is that if a write came into datacenter 1, it would write to the two nodes in datacenter 1 and asynchronously replicate to datacenter 2. Is this possible? Does cassandra already handle that or is there something I could do to get cassandra to do that? In this mode, I believe I can have both datacenters be live as well as be backup for the other not wasting resources. thanks, Dean
Re: select * from A join B using(common_id) where A.id == a and B.id == b
my fall-back approach is, since A and B do not change a lot, I'll pre-generate the join of A and B (not very large) keyed on A.id + B.id, then do the get(a+b) +1 materialise views / joins you know you want ahead of time. Trade space for time. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jul 2011, at 10:41, Yang wrote: this is a common pattern used in RDMS, is there some existing idiom to do it in cassandra ? if the size of select * from A where id == a is very large, and similarly for B, while the join of A.id == a and B.id==b is small, then doing a get() for both and then merging seems excessively slow. my fall-back approach is, since A and B do not change a lot, I'll pre-generate the join of A and B (not very large) keyed on A.id + B.id, then do the get(a+b) thanks Yang