Re: Is Cassandra a document based DB?
What are the advantages/disadvantages of Cassandra over HBase? Thanks Ran. -- View this message in context: http://n2.nabble.com/Is-Cassandra-a-document-based-DB-tp4653418p4653644.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
Re: Is Cassandra a document based DB?
On Mon, Mar 1, 2010 at 5:34 AM, HHB hubaghd...@yahoo.ca wrote: What are the advantages/disadvantages of Cassandra over HBase? Ease of setup: all nodes are the same. No single point of failure: all nodes are the same. Speed: http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf Richer model: supercolumns. Multi-datacenter awareness. There are likely other things I'm forgetting, but those stand out for me. -Brandon
Use cases for Cassandra
Hey, What are the typical use cases for Cassandra? How to know if I should use Cassandra or documents-based data bases like CouchDB? I'm working for an ISP (Internet Service Provider), do you think we can employ Cassandra? and for what? Thanks all for help and time. -- View this message in context: http://n2.nabble.com/Use-cases-for-Cassandra-tp4654086p4654086.html Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
compaction threshold
* We recommend adjusting the compaction threshold to 0, while the import is running. After the import, you need * to run `nodeprobe -host IP flush_binary Keyspace` on every node, as this will flush the remaining data still left * in memory to disk. Then it's recommended to adjust the compaction threshold to it's original value. The bulk loader mentions the above. What property is the compaction threshold?
Re: compaction threshold
I get: Min threshold must be at least 2 On Mon, Mar 1, 2010 at 8:55 AM, Brandon Williams dri...@gmail.com wrote: On Mon, Mar 1, 2010 at 10:53 AM, Sonny Heer sonnyh...@gmail.com wrote: * We recommend adjusting the compaction threshold to 0, while the import is running. After the import, you need * to run `nodeprobe -host IP flush_binary Keyspace` on every node, as this will flush the remaining data still left * in memory to disk. Then it's recommended to adjust the compaction threshold to it's original value. The bulk loader mentions the above. What property is the compaction threshold? setcompactionthreshold in bin/nodeprobe (or nodetool in 0.6) -Brandon
Re: Adjusting Token Spaces and Rebalancing Data
Hello Everyone, Jonathan, Thanks for your advice :-) I have started a loadbalance operation on a busy cassandra node. The http://wiki.apache.org/cassandra/Operations web page indicates that nodetool streams can be used to monitor the status of the load balancing operation. I can't seem to find the nodetool command for cassandra 0.5.0. Is this a separate package/tool? Thanks, Jon On Wed, Feb 24, 2010 at 8:17 PM, Jonathan Ellis jbel...@gmail.com wrote: nodeprobe loadbalance and/or nodeprobe move http://wiki.apache.org/cassandra/Operations On Wed, Feb 24, 2010 at 6:17 PM, Jon Graham sjclou...@gmail.com wrote: Hello, I have 6 node Cassandra 0.5.0 cluster using org.apache.cassandra.dht.OrderPreservingPartitioner with replication factor 3. I mistakenly set my tokens to the wrong values, and have all the data being stored on the first node (with replicas on the seconds and third nodes) Does Cassandra have any tools to reset the token values and re-distribute the data? Thanks for your help, Jon
Re: Adjusting Token Spaces and Rebalancing Data
nodetool is the 0.6 replacement for nodeprobe. the stream info is new in that version. (0.6 beta release is linked from http://wiki.apache.org/cassandra/GettingStarted) -Jonathan On Mon, Mar 1, 2010 at 12:40 PM, Jon Graham sjclou...@gmail.com wrote: Hello Everyone, Jonathan, Thanks for your advice :-) I have started a loadbalance operation on a busy cassandra node. The http://wiki.apache.org/cassandra/Operations web page indicates that nodetool streams can be used to monitor the status of the load balancing operation. I can't seem to find the nodetool command for cassandra 0.5.0. Is this a separate package/tool? Thanks, Jon On Wed, Feb 24, 2010 at 8:17 PM, Jonathan Ellis jbel...@gmail.com wrote: nodeprobe loadbalance and/or nodeprobe move http://wiki.apache.org/cassandra/Operations On Wed, Feb 24, 2010 at 6:17 PM, Jon Graham sjclou...@gmail.com wrote: Hello, I have 6 node Cassandra 0.5.0 cluster using org.apache.cassandra.dht.OrderPreservingPartitioner with replication factor 3. I mistakenly set my tokens to the wrong values, and have all the data being stored on the first node (with replicas on the seconds and third nodes) Does Cassandra have any tools to reset the token values and re-distribute the data? Thanks for your help, Jon
Storage format
I've been looking at the source, but not quite find the things I'm looking for, so I have a few questions. Are columns for a row stored in a serialized data structure on disk or stored individually and put into a data structure when the call is being made? Because of the slice query, does that mean that all columns have to be read in before any are being sent back? If that is the case, could it be more efficient to use rows instead of columns for storing for example indexes and you just want to get a few at a time? -- Regards Erik
Re: Adjusting Token Spaces and Rebalancing Data
Jonathan, Thanks for the quick reply. After starting a loadbalance operation for about 30 minutes, I can see 3 ColumnFamily-tmp-Data, Filter and Index files on a lightly loaded node. The Data file has a size of 2,147,483,647 (max signed int) on the node being loaded. I hope I didn't run out of Int space during the load balance operation. The file sizes don't seem to be changing in the column family folder of the newly loaded node. The system/LocationInfo files haven't changed in 20 minutes. The 3 ColumnFamily files still have the -tmp- name. Can I tell if the load balancing operaion is still running ok or if it has terminated? Is there a rough computation to determine how long the process should take? If the load balancing is successful, will the cluster ring information reflect the load balanacing changes? Thanks, Jon On Mon, Mar 1, 2010 at 10:47 AM, Jonathan Ellis jbel...@gmail.com wrote: nodetool is the 0.6 replacement for nodeprobe. the stream info is new in that version. (0.6 beta release is linked from http://wiki.apache.org/cassandra/GettingStarted) -Jonathan On Mon, Mar 1, 2010 at 12:40 PM, Jon Graham sjclou...@gmail.com wrote: Hello Everyone, Jonathan, Thanks for your advice :-) I have started a loadbalance operation on a busy cassandra node. The http://wiki.apache.org/cassandra/Operations web page indicates that nodetool streams can be used to monitor the status of the load balancing operation. I can't seem to find the nodetool command for cassandra 0.5.0. Is this a separate package/tool? Thanks, Jon On Wed, Feb 24, 2010 at 8:17 PM, Jonathan Ellis jbel...@gmail.com wrote: nodeprobe loadbalance and/or nodeprobe move http://wiki.apache.org/cassandra/Operations On Wed, Feb 24, 2010 at 6:17 PM, Jon Graham sjclou...@gmail.com wrote: Hello, I have 6 node Cassandra 0.5.0 cluster using org.apache.cassandra.dht.OrderPreservingPartitioner with replication factor 3. I mistakenly set my tokens to the wrong values, and have all the data being stored on the first node (with replicas on the seconds and third nodes) Does Cassandra have any tools to reset the token values and re-distribute the data? Thanks for your help, Jon
Re: Adjusting Token Spaces and Rebalancing Data
Thanks Jonathan. It seems like the load balance operation isn't moving. I haven't seen any data file time changes in 2 hours and no location file time changes in over an hour. I can see a tcp port # 7000 opened on the node where I ran the loadbalance command. It is connected to port 39033 on the node receiving the data. The CPU usage on both systems is very low. There are about 10 million records on the node where the load balance command was issued. My six node Cassandra ring consists of tokens for nodes 1-6 of: 0 (ascii 0x30) 6 B H O (the letter O) T The load balance target node initially had a token of 'H' (using ordered partitioning). The source node has a key of 0 (ascii 0x30). Most of the data on the source node has keys starting with '/'. Slash falls between tokens T and 0 in my ring so most of the data landed on the node with token 0 with replicas on the next 2 nodes. My token space is badly divided for the data I have already inserted. Does the initial token value of the load balance target node selected by Cassandra need to be cleared or set to a specific value before hand to accomodate the load balance data transfer? Would I have better luck decommissioning nodes 4,5,6 and trying to bootstrapping these nodes one at a time with better initial token values? Would the existing data on nodes 1,2,3 be moved to the boot strapped nodes? I am looking for a good way to move/split/re-balance data from nodes 1,2,3 to nodes 4, 5, 6 while achiving a better token space distribution. Thanks for your help, Jon On Mon, Mar 1, 2010 at 11:55 AM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Mar 1, 2010 at 1:44 PM, Jon Graham sjclou...@gmail.com wrote: Can I tell if the load balancing operaion is still running ok or if it has terminated? Is there a rough computation to determine how long the process should take? Not really, although you can guess from cpu/io usage. This is much improved in 0.6. If the load balancing is successful, will the cluster ring information reflect the load balanacing changes? Yes.
Re: Storage format
Sorry about that! Continuing: And in that case when using rows as indexes instead of columns we only need to read that specific row and might be more efficient in that case than to read a big row every time? -- Regards Erik
Re: Is Cassandra a document based DB?
In HBase you have table:row:family:key:val:version, which some people might consider richer Cassandra is actually table:family:row:key:val[:subval], where subvals are the columns stored in a supercolumn (which can be easily arranged by timestamp to give the versioned approach). -Original Message- From: Erik Holstad erikhols...@gmail.com Sent: Monday, March 1, 2010 3:49pm To: cassandra-user@incubator.apache.org Subject: Re: Is Cassandra a document based DB? On Mon, Mar 1, 2010 at 4:41 AM, Brandon Williams dri...@gmail.com wrote: On Mon, Mar 1, 2010 at 5:34 AM, HHB hubaghd...@yahoo.ca wrote: What are the advantages/disadvantages of Cassandra over HBase? Ease of setup: all nodes are the same. No single point of failure: all nodes are the same. Speed: http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf Richer model: supercolumns. I think that there are people that would be of a different opinion here. Cassandra has as I've understood it table:key:name:val and in cases the val is a serialized data structure. In HBase you have table:row:family:key:val:version, which some people might consider richer. Multi-datacenter awareness. There are likely other things I'm forgetting, but those stand out for me. -Brandon -- Regards Erik
In-Memory Storage (no disk)
Hi there - Is there a setting of storage config or some other *user-level* programmatic means that would cause Cassandra not to write to disk? \ - m.
Re: Storage format
On Mon, Mar 1, 2010 at 4:06 PM, Erik Holstad erikhols...@gmail.com wrote: So that is kinda of what I want to do, but I want to go from a row with multiple columns to multiple rows with one column Right, and I'm trying to tell you that this is a bad idea unless you are worried about exhausting your row must fit in ram at compaction time limit.
Process for removing an old CF in 0.5.0
Hi, I was just wondering what the process might be for removing an old column family in 0.5.0. Can I just update the config and restart the server? Does it require stopping the entire cluster at once or can it be done in a rolling fashion? Once I update the config can I just delete all the files with that column family name in the filename? Thanks, -Anthony -- Anthony Molinaro antho...@alumni.caltech.edu
Re: Process for removing an old CF in 0.5.0
On Mon, Mar 1, 2010 at 4:41 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: Hi, I was just wondering what the process might be for removing an old column family in 0.5.0. Can I just update the config and restart the server? Yes, but make sure your commitlog is flushed first (and that it stays empty). Does it require stopping the entire cluster at once or can it be done in a rolling fashion? Rolling is fine. Once I update the config can I just delete all the files with that column family name in the filename? Yes. -Jonathan
Re: Storage format
On Mon, Mar 1, 2010 at 4:49 PM, Erik Holstad erikhols...@gmail.com wrote: Haha! Thanks. Well I'm z little bit worried about this but since the indexes are pretty small I don't think it is going to be too bad. But was mostly thinking about performance and and having the index row as a bottleneck for writing, since the partition is per row. Writing N columns to 1 row is faster than writing 1 column to N rows, even when all N are coming from different clients. Our concurrency story there is excellent.
Re: Adjusting Token Spaces and Rebalancing Data
On Mon, Mar 1, 2010 at 3:18 PM, Jon Graham sjclou...@gmail.com wrote: Thanks Jonathan. It seems like the load balance operation isn't moving. I haven't seen any data file time changes in 2 hours and no location file time changes in over an hour. I can see a tcp port # 7000 opened on the node where I ran the loadbalance command. It is connected to port 39033 on the node receiving the data. The CPU usage on both systems is very low. There are about 10 million records on the node where the load balance command was issued. Did you check logs for exceptions? My six node Cassandra ring consists of tokens for nodes 1-6 of: 0 (ascii 0x30) 6 B H O (the letter O) T The load balance target node initially had a token of 'H' (using ordered partitioning). The source node has a key of 0 (ascii 0x30). Most of the data on the source node has keys starting with '/'. Slash falls between tokens T and 0 in my ring so most of the data landed on the node with token 0 with replicas on the next 2 nodes. My token space is badly divided for the data I have already inserted. Does the initial token value of the load balance target node selected by Cassandra need to be cleared or set to a specific value before hand to accomodate the load balance data transfer? No. Would I have better luck decommissioning nodes 4,5,6 and trying to bootstrapping these nodes one at a time with better initial token values? LoadBalance is basically sugar for decommission + bootstrap, so no. I am looking for a good way to move/split/re-balance data from nodes 1,2,3 to nodes 4, 5, 6 while achiving a better token space distribution. I would upgrade to the 0.6 beta and try loadbalance again. -Jonathan
Re: Storage format
On Mon, Mar 1, 2010 at 2:51 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Mar 1, 2010 at 4:49 PM, Erik Holstad erikhols...@gmail.com wrote: Haha! Thanks. Well I'm z little bit worried about this but since the indexes are pretty small I don't think it is going to be too bad. But was mostly thinking about performance and and having the index row as a bottleneck for writing, since the partition is per row. Writing N columns to 1 row is faster than writing 1 column to N rows, even when all N are coming from different clients. Our concurrency story there is excellent. That sounds good, and the same thing goes for reading, cause that is basically what I'm looking for, faster reads, not too worried about the writes. Thanks a lot!
Re: Storage format
Then you definitely want one row, range queries are slower than we'd like right now. (Ticket to fix that: https://issues.apache.org/jira/browse/CASSANDRA-821) On Mon, Mar 1, 2010 at 5:00 PM, Erik Holstad erikhols...@gmail.com wrote: On Mon, Mar 1, 2010 at 2:51 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Mar 1, 2010 at 4:49 PM, Erik Holstad erikhols...@gmail.com wrote: Haha! Thanks. Well I'm z little bit worried about this but since the indexes are pretty small I don't think it is going to be too bad. But was mostly thinking about performance and and having the index row as a bottleneck for writing, since the partition is per row. Writing N columns to 1 row is faster than writing 1 column to N rows, even when all N are coming from different clients. Our concurrency story there is excellent. That sounds good, and the same thing goes for reading, cause that is basically what I'm looking for, faster reads, not too worried about the writes. Thanks a lot!
Re: Adjusting Token Spaces and Rebalancing Data
Hello, I did find these exceptions. I issued the loadbalance command on node 192.168.2.10. INFO [MESSAGING-SERVICE-POOL:3] 2010-03-01 10:34:40,764 TcpConnection.java (line 315) Closing errored connection java.nio.channels.SocketChannel[connected local=/192.168.2.10:55973 remote=/ 192.168.2.13:7000] WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-03-01 10:34:40,964 MessagingService.java (line 555) Running on default stage - beware WARN [MESSAGING-SERVICE-POOL:1] 2010-03-01 10:34:40,964 TcpConnection.java (line 484) Problem reading from socket connected to : java.nio.channels.SocketChannel[connected local=/192.168.2.10:40758 remote=/ 192.168.2.13:7000] WARN [MESSAGING-SERVICE-POOL:1] 2010-03-01 10:34:40,964 TcpConnection.java (line 485) Exception was generated at : 03/01/2010 10:34:40 on thread MESSAGING-SERVICE-POOL:1 Reached an EOL or something bizzare occured. Reading from: /192.168.2.13BufferSizeRemaining: 16 java.io.IOException: Reached an EOL or something bizzare occured. Reading from: /192.168.2.13 BufferSizeRemaining: 16 at org.apache.cassandra.net.io.StartState.doRead(StartState.java:44) at org.apache.cassandra.net.io.ProtocolState.read(ProtocolState.java:39) at org.apache.cassandra.net.io.TcpReader.read(TcpReader.java:95) at org.apache.cassandra.net.TcpConnection$ReadWorkItem.run(TcpConnection.java:445) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO [MESSAGING-SERVICE-POOL:1] 2010-03-01 10:34:40,964 TcpConnection.java (line 315) Closing errored connection java.nio.channels.SocketChannel[connected local=/192.168.2.10:40758 remote=/ 192.168.2.13:7000] INFO [MESSAGE-STREAMING-POOL:1] 2010-03-01 10:35:23,171 TcpConnection.java (line 315) Closing errored connection java.nio.channels.SocketChannel[connected local=/192.168.2.10:56728 remote=/ 192.168.2.13:7000] INFO [MESSAGE-STREAMING-POOL:1] 2010-03-01 10:35:23,221 FileStreamTask.java (line 79) Exception was generated at : 03/01/2010 10:35:23 on thread MESSAGE-STREAMING-POOL:1 Value too large for defined data type java.io.IOException: Value too large for defined data type at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(Unknown Source) at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source) at org.apache.cassandra.net.TcpConnection.stream(TcpConnection.java:226) at org.apache.cassandra.net.FileStreamTask.run(FileStreamTask.java:55) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I can certainly upgrade to 0.6 and try a loadbalance there, do you still think it is advisable? All of my key/value entries are well under 1024 bytes but I have millions of them. Do you think I have a data corruption problem? Thanks, Jon On Mon, Mar 1, 2010 at 2:54 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Mar 1, 2010 at 3:18 PM, Jon Graham sjclou...@gmail.com wrote: Thanks Jonathan. It seems like the load balance operation isn't moving. I haven't seen any data file time changes in 2 hours and no location file time changes in over an hour. I can see a tcp port # 7000 opened on the node where I ran the loadbalance command. It is connected to port 39033 on the node receiving the data. The CPU usage on both systems is very low. There are about 10 million records on the node where the load balance command was issued. Did you check logs for exceptions? My six node Cassandra ring consists of tokens for nodes 1-6 of: 0 (ascii 0x30) 6 B H O (the letter O) T The load balance target node initially had a token of 'H' (using ordered partitioning). The source node has a key of 0 (ascii 0x30). Most of the data on the source node has keys starting with '/'. Slash falls between tokens T and 0 in my ring so most of the data landed on the node with token 0 with replicas on the next 2 nodes. My token space is badly divided for the data I have already inserted. Does the initial token value of the load balance target node selected by Cassandra need to be cleared or set to a specific value before hand to accomodate the load balance data transfer? No. Would I have better luck decommissioning nodes 4,5,6 and trying to bootstrapping these nodes one at a time with better initial token values? LoadBalance is basically sugar for decommission + bootstrap, so no. I am looking for a good way to move/split/re-balance data from nodes 1,2,3 to nodes 4, 5, 6 while achiving a better token space distribution. I would upgrade to the 0.6 beta and try loadbalance again. -Jonathan
Re: Adjusting Token Spaces and Rebalancing Data
On Mon, Mar 1, 2010 at 5:39 PM, Jon Graham sjclou...@gmail.com wrote: Reached an EOL or something bizzare occured. Reading from: /192.168.2.13 BufferSizeRemaining: 16 This one is harmless java.io.IOException: Value too large for defined data type at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(Unknown Source) at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source) at org.apache.cassandra.net.TcpConnection.stream(TcpConnection.java:226) at org.apache.cassandra.net.FileStreamTask.run(FileStreamTask.java:55) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) This one is killing you. Are you on windows? If so https://issues.apache.org/jira/browse/CASSANDRA-795 should fix it. That's in both 0.5.1 and 0.6 beta. -Jonathan
Error with Cassandra Only Example in contrib/client_only
Dear all, I tried to run ClientOnlyExample.java on contrib/client_only. But the code did not run. The error is: Exception in thread main java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.cassandra.locator.RackUnawareStrategy.getNaturalEndpoints(RackUnawareStrategy.java:58) at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:66) Could you give me some advice? Thank a lot for support. -- Best regards, JKnight
Re: Error with Cassandra Only Example in contrib/client_only
Could you give me the config file? Thanks On Mon, Mar 1, 2010 at 11:34 PM, Jonathan Ellis jbel...@gmail.com wrote: That means it doesn't know any of your other nodes. Probably you don't have it configured with a seed. On Mon, Mar 1, 2010 at 9:31 PM, JKnight JKnight beukni...@gmail.com wrote: Dear all, I tried to run ClientOnlyExample.java on contrib/client_only. But the code did not run. The error is: Exception in thread main java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.cassandra.locator.RackUnawareStrategy.getNaturalEndpoints(RackUnawareStrategy.java:58) at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:66) Could you give me some advice? Thank a lot for support. -- Best regards, JKnight -- Best regards, JKnight