Re: Consolidating records and TTL
As Tyler says, with atomic batches which are enabled by default the cluster will keep trying to replay the insert / deletes. Nodes check their local batch log for failed batches, ones where the coordinator did not acknowledge it had successfully completed, every 60 seconds. So there is a window where it’s possible for not all mutations in the batch to be completed. This could happen when a write timeout occurs when processing a batch of 2 rows; the request CL will not have been achieved on one or more of the rows. The coordinator will leave it up to the batch log to replay the request, and the client driver will (by default config) not retry. You can use a model like this. create table ledger ( account int, tx_id timeuuid, sub_total int, primary key (account, tx_id) ); create table account ( account int, total int, last_tx_id timeuuid, primary key (account) ); To get the total: select * from account where account = X; Then get the ledger entries you need select * from ledger where account = X and tx_id last_tx_id; This query will degrade when the partition size in the ledger table gets bigger, as it will need to read the column index (see column_index_size_in_kb in yaml). It will use that to find the first page that contains the rows we are interested in and then read forwards to the end of the row. It’s not the most efficient type of read but if you are going to delete ledger entries this *should* be able to skip over the tombstones without reading them. When you want to update the total in the account write to the account table and update both the total and the last_tx_id. You can then delete ledger entries if needed. Don’t forget to ensure that only one client thread is doing this at a time. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 5/06/2014, at 10:37 am, Tyler Hobbs ty...@datastax.com wrote: Just use an atomic batch that holds both the insert and deletes: http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2 On Tue, Jun 3, 2014 at 2:13 PM, Charlie Mason charlie@gmail.com wrote: Hi All. I have a system thats going to make possibly several concurrent changes to a running total. I know I could use a counter for this. However I have extra meta data I can store with the changes which would allow me to reply the changes. If I use a counter and it looses some writes I can't recover it as I will only have its current total not the extra meta data to know where to replay from. What I was planning to do was write each change of the value to a CQL table with a Time UUID as a row level primary key as well as a partition key. Then when I need to read the running total back I will do a query for all the changes and add them up to get the total. As there could be tens of thousands of these I want to have a period after which these are consolidated. Most won't be any where near that but a few will which I need to be able to support. So I was also going to have a consolidated total table which holds the UUID of the values consolidated up to. Since I can bound the query for the recent updates by the UUID I should be able to avoid all the tombstones. So if the read encounters any changes that can be consolidated it inserts a new consolidated value and deletes the newly consolidated changes. What I am slightly worried about is what happens if the consolidated value insert fails but the deletes to the change records succeed. I would be left with an inconsistent total indefinitely. I have come up with a couple of ideas: 1, I could make it require all nodes to acknowledge it before deleting the difference records. 2, May be I could have another period after its consolidated but before its deleted? 3, Is there anyway I could use the TTL to allow to it to be deleted after a period of time? Chances are another read would come in and fix the value. Anyone got any other suggestions on how I could implement this? Thanks, Charlie M -- Tyler Hobbs DataStax
Re: Increased Cassandra connection latency
You’ll need to provide some more information such as: * Do you have monitoring on the cassandra cluster that shows the request latency ? Data Stax OpsCentre is good starting point. * Is compaction keeping up ? Check with nodetool compactionstats * Is the GCInspector logging about long running ParNew ? (it only logs when it’s longer than 200ms) Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 23/05/2014, at 10:35 pm, Alexey Sverdelov alexey.sverde...@googlemail.com wrote: Hi all, I've noticed increased latency on our tomcat REST-service (average 30ms, max 2sec). We are using Cassandra 1.2.16 with official DataStax Java driver v1.0.3. Our setup: * 2 DCs * each DC: 7 nodes * RF=5 * Leveled compaction After cassandra restart on all nodes, the latencies are alright again (average 5ms, max 50ms). Any thoughts are greatly appreciated. Thanks, Alexey
Re: What % of cassandra developers are employed by Datastax?
The Cassandra Summit Bootcamp, Sep 12-13, immediately following the Summit, might be interesting for potential contributors. I’ll be there to help people get started. Looking forward to it. While DS are the biggest contributor in time and patches, there are several other well known people and companies contributing and committing. IMHO level of community activity and support over the last 5ish years has been and will continue to be critical to the success of Cassandra, both Apache and DSE. Which is a polite way of saying there is *always* something an individual can do to contribute to the health of the project. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 24/05/2014, at 7:28 am, Michael Shuler mich...@pbandjelly.org wrote: On 05/23/2014 01:23 PM, Peter Lin wrote: A separate but important consideration is long term health of a project. Many apache projects face this issue. When a project doesn't continually grow the contributors and committers, the project runs into issues in the long term. All open source projects see this, contributors and committers eventually leave, so it's important to continue to invite worthy contributors to become committers. The Cassandra Summit Bootcamp, Sep 12-13, immediately following the Summit, might be interesting for potential contributors. -- Michael
Re: Memory issue
As soon as it starts, the JVM is get killed because of memory issue. What is the memory issue that gets kills the JVM ? The log message below is simply a warning WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root. Is there anything in the system logs ? Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 24/05/2014, at 9:17 am, Robert Coli rc...@eventbrite.com wrote: On Fri, May 23, 2014 at 2:08 PM, opensaf dev opensaf...@gmail.com wrote: I have a different service which controls the cassandra service for high availability. IMO, starting or stopping a Cassandra node should never be a side effect of another system's properties. YMMV. https://issues.apache.org/jira/browse/CASSANDRA-2356 For some related comments. =Rob
Re: Can SSTables overlap with SizeTieredCompactionStrategy?
cold_reads_to_omit defaults to 0.0 which disabled the feature, so it may not have been responsible in this case. There are a couple of things that could explain the difference: * after nodetool compaction there was one SSTable, so one -Filter.db file rather than 8 that each had 700 entires. However 700 entries is not very many so this would have been a small size on disk. * Same story with the -Index.db files, they would have all had the same values but that would not have been very with big with 700 entries. However with the wide rows column indexes would have also been present in the -Index.db file. * Compression may have been better. In the when you have one SSTable all the columns for the row will be stored sequentially and it may have just had better compression. If most of the difference was in the -Data.db files I would guess compression, nodetool cfstats will tell you the compression ratio. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 23/05/2014, at 9:46 am, Phil Luckhurst phil.luckhu...@powerassure.com wrote: Hi Andreas, So does that mean it can compact the 'hottest' partitions into a new sstable but the old sstables may not immediately be removed so the same data could be in more that one sstable? That would certainly explain the difference we see when we manually run nodetool compact. Thanks Phil Andreas Finke wrote Hi Phil, I found an interesting blog entry that may address your problem. http://www.datastax.com/dev/blog/optimizations-around-cold-sstables It seems that compaction is skipped for stables which so mit satisfy a certain read rate. Please check. Kind regards Andreas Finke Java Developer Solvians IT-Solutions GmbH Phil Luckhurst wrote Definitely no TTL and records are only written once with no deletions. Phil DuyHai Doan wrote Are you sure there is no TTL set on your data? It might explain the shrink in sstable size after compaction. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Can-SSTables-overlap-with-SizeTieredCompactionStrategy-tp7594574p7594644.html Sent from the cassandra-user@.apache mailing list archive at Nabble.com. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Can-SSTables-overlap-with-SizeTieredCompactionStrategy-tp7594574p7594658.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
“between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working” Can you confirm or disprove? My reading of the code is that it will consider the part of a token range (from vnodes or initial tokens) that overlap with the provided token range. I’ve already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn’t work. Can you be more specific ? work only for ordered partitioners (in 1.2.15). it will work with ordered and unordered partitioners equally. The difference is probably what you consider to “working” to mean. The token ranges are handled the same, it’s the row in them that changes. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Hi Aaron, I’ve seen the code which you describe (working with splits and intersections) but that range is derived from keys and work only for ordered partitioners (in 1.2.15). I’ve already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn’t work. “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working” Can you confirm or disprove? WBR, Anton From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, May 19, 2014 1:58 AM To: Cassandra User Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
Re: CQL 3 and wide rows
In a CQL 3 table the only **column** names are the ones defined in the table, in the example below there are three column names. CREATE TABLE keyspace.widerow ( row_key text, wide_row_column text, data_column text, PRIMARY KEY (row_key, wide_row_column)); Check out, for example, http://www.datastax.com/dev/blog/schema-in-cassandra-1-1. Internally there may be more **cells** ( as we now call the internal columns). In the example above each value for row_key will create a single partition (as we now call internal storage engine rows). In each of those partitions there will be cells for each CQL 3 row that has the same row_key, those cells will use a Composite for the name. The first part of the composite will be the value of the wide_row_column and the second will be the literal name of the non primary key columns. IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than thrift models. But still - I do not see Iteration, so it looks to me that CQL 3 is limited when compared to CLI/Hector. Now days you can do pretty much everything you can in cli. Provide an example and we may be able to help. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/05/2014, at 8:18 am, Maciej Miklas mac.mik...@gmail.com wrote: Hi James, Clustering is based on rows. I think that you meant not clustering columns, but compound columns. Still all columns belong to single table and are stored within single folder on one computer. And it looks to me (but I’am not sure) that CQL 3 driver loads all column names into memory - which is confusing to me. From one side we have wide row, but we load whole into ram….. My understanding of wide row is a row that supports millions of columns, or similar things like map or set. In CLI you would generate column names (or use compound columns) to simulate set or map, in CQL 3 you would use some static names plus Map or Set structures, or you could still alter table and have large number of columns. But still - I do not see Iteration, so it looks to me that CQL 3 is limited when compared to CLI/Hector. Regards, Maciej On 19 May 2014, at 17:30, James Campbell ja...@breachintelligence.com wrote: Maciej, In CQL3 wide rows are expected to be created using clustering columns. So while the schema will have a relatively smaller number of named columns, the effect is a wide row. For example: CREATE TABLE keyspace.widerow ( row_key text, wide_row_column text, data_column text, PRIMARY KEY (row_key, wide_row_column)); Check out, for example, http://www.datastax.com/dev/blog/schema-in-cassandra-1-1. James From: Maciej Miklas mac.mik...@gmail.com Sent: Monday, May 19, 2014 11:20 AM To: user@cassandra.apache.org Subject: CQL 3 and wide rows Hi *, I’ve checked DataStax driver code for CQL 3, and it looks like the column names for particular table are fully loaded into memory, it this true? Cassandra should support wide rows, meaning tables with millions of columns. Knowing that, I would expect kind of iterator for column names. Am I missing something here? Regards, Maciej Miklas
Re: idempotent counters
Does anybody else use another technique for achieving this idempotency with counters? The idempotency problem with counters has to do with what will happen when you get a timeout. If you reply the write there is a chance of the increment been applied twice. This is inherent in the current design. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 1:07 am, Jabbar Azam aja...@gmail.com wrote: Hello, Do people use counters when they want to have idempotent operations in cassandra? I have a use case for using a counter to check for a count of objects in a partition. If the counter is more than some value then the data in the partition is moved into two different partitions. I can't work out how to do this splitting and recover if a problem happens during modification of the counter. http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2 explains that counters shouldn't be used if you want idempotency. I would agree, but the alternative is not very elegant. I would have to manully count the objects in a partition and then move the data and repeat the operation if something went wrong. It is less resource intensive to read a counter value to see if a partition needs splitting then to read all the objects in a partition. The counter value can be stored in its own table sorting in descending order of the counter value. Does anybody else use another technique for achieving this idempotency with counters? I'm using cassandra 2.0.7. Thanks Jabbar Azam
Re: Effect of number of keyspaces on write-throughput....
Each client is writing to a separate keyspace simultaneously. Hence, is there a lot of switching of keyspaces? I would think not. If the client app is using one keyspace per connection there should be no reason for the driver to change keyspaces. But, I observed that when using a single keyspace, the write throughout reduced slightly to 1800pkts/sec while I actually expected it to increase since there is no switching of contexts now. Why is this so? That’s a 5% change which is close enough to be ignored. I would guess that the clients are not doing anything that requires the driver to change the keyspace for the connection. Can you also kindly explain how factors like using a single v/s multiple keyspaces, distributing write requests to a single cassandra node v/s multiple cassandra nodes, etc. affect the write throughput? Normally you have one keyspace per application. And the best data models are ones where the throughput improves as the number of nodes increases. This happens when there are no “hot spots” where every / most web requests need to read or write to a particular row. In general you can improve throughput by having more client threads hitting more machines. You can expect 3,000 to 4,000 non counter writes per code per node. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 1:02 am, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, Thanks for the reply. Currently, each client is writing about 470 packets per second where each packet is 1500 bytes. I have four clients writing simultaneously to the cluster. Each client is writing to a separate keyspace simultaneously. Hence, is there a lot of switching of keyspaces? The total throughput is coming to around 1900 packets per second when using multiple keyspaces. This is because there are 4 clients and each one is writing around 470 pkts/sec. But, I observed that when using a single keyspace, the write throughout reduced slightly to 1800pkts/sec while I actually expected it to increase since there is no switching of contexts now. Why is this so? 470 packets is the maximum I can write from each client currently, since it is the limitation of my client program. I should also mention that these tests are being run on a single and double node clusters with all the write requests going only to a single cassandra server. Can you also kindly explain how factors like using a single v/s multiple keyspaces, distributing write requests to a single cassandra node v/s multiple cassandra nodes, etc. affect the write throughput? Are there any other factors that affect write throughput other than these? Because, a single cassandra node seems to be able to handle all these write requests as I am not able to see any significant improvement by distributing write requests among multiple nodes. Thanking you. On May 12, 2014 2:39 PM, Aaron Morton aa...@thelastpickle.com wrote: On the homepage of libQtCassandra, its mentioned that switching between keyspaces is costly when storing into Cassandra thereby affecting the write throughput. Is this necessarily true for other libraries like pycassa and hector as well? When using the thrift connection the keyspace is a part of the connection state, so changing keyspaces requires a round trip to the server. Not hugely expensive, but it adds up if you do it a lot. Can I increase the write throughput by configuring all the clients to store in a single keyspace instead of multiple keyspaces to increase the write throughput? You should expect to get 3,000 to 4,000 writes per core per node. What are you getting now? Cheers A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/05/2014, at 4:06 pm, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, I have an application that writes network packets to a Cassandra cluster from a number of client nodes. It uses the libQtCassandra library to access Cassandra. On the homepage of libQtCassandra, its mentioned that switching between keyspaces is costly when storing into Cassandra thereby affecting the write throughput. Is this necessarily true for other libraries like pycassa and hector as well? Can I increase the write throughput by configuring all the clients to store in a single keyspace instead of multiple keyspaces to increase the write throughput? Thankyou.
Re: Schema errors when bootstrapping / restarting node
I am able to fix this error by clearing out the schema_columns system table on disk. After that, a node can boot successfully. Does anyone have a clue what's going on here? Something has come corrupted in the system tables as you say. A less aggressive way to reset the local schema is to use nodetool resetlocalschema on the nodes that you suspect as having problems. ERROR [InternalResponseStage:5] 2014-05-05 23:56:03,786 CassandraDaemon.java (line 191) Exception in thread Thread[InternalResponseStage:5,5,main] org.apache.cassandra.db.marshal.MarshalException: cannot parse 'column1' as hex bytes at org.apache.cassandra.db.marshal.BytesType.fromString(BytesType.java:69) at org.apache.cassandra.config.ColumnDefinition.fromSchema(ColumnDefinition.java:231) at org.apache.cassandra.config.CFMetaData.addColumnDefinitionSchema(CFMetaData.java:1524) at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1456) This looks like a secondary index has been incorrectly defined via thrift. I would guess the comparator for the CF is BytesType and you have defined an index on a column and specified the column name as “column1” which is not a valid hex value. You should be able to fix this by dropping the index or dropping the CF. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 2:18 am, Adam Cramer a...@bn.co wrote: Hi All, I'm having some major issues bootstrapping a new node to my cluster. We are running 1.2.16, with vnodes enabled. When a new node starts up (with auto_bootstrap), it selects a host ID and finds the ring successfully: INFO 18:42:29,559 JOINING: waiting for ring information It successfully selects a set of tokens. Then the weird stuff begins. I get this error once, while the node is reading the system keyspace: ERROR 18:42:32,921 Exception in thread Thread[InternalResponseStage:1,5,main] java.lang.NullPointerException at org.apache.cassandra.utils.ByteBufferUtil.toLong(ByteBufferUtil.java:421) at org.apache.cassandra.cql.jdbc.JdbcLong.compose(JdbcLong.java:94) at org.apache.cassandra.db.marshal.LongType.compose(LongType.java:34) at org.apache.cassandra.cql3.UntypedResultSet$Row.getLong(UntypedResultSet.java:138) at org.apache.cassandra.db.SystemTable.migrateKeyAlias(SystemTable.java:199) at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:346) at org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:66) at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:47) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) But it doesn't stop the bootstrap process. The node successfully handshakes versions, and pauses before bootstrapping: INFO 18:42:59,564 JOINING: schema complete, ready to bootstrap INFO 18:42:59,565 JOINING: waiting for pending range calculation INFO 18:42:59,565 JOINING: calculation complete, ready to bootstrap INFO 18:42:59,565 JOINING: getting bootstrap token INFO 18:42:59,705 JOINING: sleeping 3 ms for pending range setup After 30 seconds, I get a flood of endless org.apache.cassandra.db.UnknownColumnFamilyException errors, and all other nodes in the cluster log the following endlessly: INFO [HANDSHAKE-/x.x.x.x] 2014-05-09 18:44:36,289 OutboundTcpConnection.java (line 418) Handshaking version with /x.x.x.x I suspect there may be something wrong with my schemas. Sometimes while restarting an existing node, the node will fail to restart, with the following error, again while reading the system keyspace: ERROR [InternalResponseStage:5] 2014-05-05 23:56:03,786 CassandraDaemon.java (line 191) Exception in thread Thread[InternalResponseStage:5,5,main] org.apache.cassandra.db.marshal.MarshalException: cannot parse 'column1' as hex bytes at org.apache.cassandra.db.marshal.BytesType.fromString(BytesType.java:69) at org.apache.cassandra.config.ColumnDefinition.fromSchema(ColumnDefinition.java:231) at org.apache.cassandra.config.CFMetaData.addColumnDefinitionSchema(CFMetaData.java:1524) at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1456) at org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:306) at org.apache.cassandra.db.DefsTable.mergeColumnFamilies(DefsTable.java:444) at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:356
Re: Query returns incomplete result
Calling execute the second time runs the query a second time, and it looks like the query mutates instance state during the pagination. What happens if you only call execute() once ? Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 8/05/2014, at 8:03 pm, Lu, Boying boying...@emc.com wrote: Hi, All, I use the astyanax 1.56.48 + Cassandra 2.0.6 in my test codes and do some query like this: query = keyspace.prepareQuery(..).getKey(…) .autoPaginate(true) .withColumnRange(new RangeBuilder().setLimit(pageSize).build()); ColumnListIndexColumnName result; result= query.execute().getResult(); while (!result.isEmpty()) { //handle result here result= query.execute().getResult(); } There are 2003 records in the DB, if the pageSize is set to 1100, I get only 2002 records back. and if the pageSize is set to 3000, I can get the all 2003 records back. Does anyone know why? Is it a bug? Thanks Boying
Re: Datacenter understanding question
Depends on how you have setup the replication. If you are using SimpleStrategy with RF 1, then there will be a single copy of each row in the cluster. If you are using the NetworkTopologyStrategy with RF 1 in each DC then there will be two copies of each row in the cluster. One in each DC. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/05/2014, at 3:55 am, Mark Farnan devm...@petrolink.com wrote: Yes they will From: ng [mailto:pipeli...@gmail.com] Sent: Tuesday, May 13, 2014 11:07 PM To: user@cassandra.apache.org Subject: Datacenter understanding question If I have configuration of two data center with one node each. Replication factor is also 1. Will these 2 nodes going to be mirrored/replicated?
Re: Cassandra counter column family performance
I get a lot of TExceptions What are the exceptions ? In general counters are slower than writes, but that does not lead them to fail like that. Check the logs for errors and/or messages from the GCInspector saying the garbage collection is going on. Cheers A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 9:51 pm, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I have a counter CF defined as pk text PRIMARY KEY, a counter, b counter, c counter, d counter After inserting a few million keys... 55 mil, the performance goes down the drain, 2-3 nodes in the cluster are on medium load, and when inserting batches of same lengths writes take longer and longer until the whole cluster becomes loaded and I get a lot of TExceptions... and the cluster becomes unresponsive. Did anyone have the same problem? Feel free to comment and share experiences about counter CF performance.
Re: Question about READS in a multi DC environment.
In this case I was not thinking about what was happening synchronous to client request, only that the request was hitting all nodes. You are right, when reading at LOCAL_ONE the coordinator will only be blocking for one response (the data response). Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 11:36 am, graham sanderson gra...@vast.com wrote: Yeah, but all the requests for data/digest are sent at the same time… responses that aren’t “needed” to complete the request are dealt with asynchronously (possibly causing repair). In the original trace (which is confusing because I don’t think the clocks are in sync)… I don’t see anything that makes me believe it is blocking for all 3 responses - It actually does reads on all 3 nodes even if only digests are required On May 12, 2014, at 12:37 AM, DuyHai Doan doanduy...@gmail.com wrote: Ins't read repair supposed to be done asynchronously in background ? On Mon, May 12, 2014 at 2:07 AM, graham sanderson gra...@vast.com wrote: You have a read_repair_chance of 1.0 which is probably why your query is hitting all data centers. On May 11, 2014, at 3:44 PM, Mark Farnan devm...@petrolink.com wrote: Im trying to understand READ load in Cassandra across a multi-datacenter cluster. (Specifically why it seems to be hitting more than one DC) and hope someone can help. From what Iím seeing here, a READ, with Consistency LOCAL_ONE, seems to be hitting All 3 datacenters, rather than just the one Iím connected to. I see 'Read 101 live and 0 tombstoned cells' from EACH of the 3 DCs in the trace, which seems, wrong. I have tried every Consistency level, same result. This also is same from my C# code via the DataStax driver, (where I first noticed the issue). Can someone please shed some light on what is occurring ? Specifically I dont' want a query on one DC, going anywhere near the other 2 as a rule, as in production, these DC's will be accross slower links. Query: (NOTE: Whilst this uses a kairosdb table, i'm just playing with queries against it as it has 100k columns in this key for testing). cqlsh:kairosdb consistency local_one Consistency level set to LOCAL_ONE. cqlsh:kairosdb select * from data_points where key = 0x6d61726c796e2e746573742e74656d7034000145b514a400726f6f6d3d6f6963653a limit 1000; ... Some return data rows listed here which I've removed CassandraQuery.txt Query Respose Trace: activity | timestamp | source | source_elapsed --+--++ execute_cql3_query | 07:18:12,692 | 192.168.25.111 | 0 Message received from /192.168.25.111 | 07:18:00,706 | 192.168.25.131 | 50 Executing single-partition query on data_points | 07:18:00,707 | 192.168.25.131 |760 Acquiring sstable references | 07:18:00,707 | 192.168.25.131 |814 Merging memtable tombstones | 07:18:00,707 | 192.168.25.131 |924 Bloom filter allows skipping sstable 191 | 07:18:00,707 | 192.168.25.131 | 1050 Bloom filter allows skipping sstable 190 | 07:18:00,707 | 192.168.25.131 | 1166 Key cache hit for sstable 189 | 07:18:00,707 | 192.168.25.131 | 1275 Seeking to partition beginning in data file | 07:18:00,707 | 192.168.25.131 | 1293 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 07:18:00,708 | 192.168.25.131 | 2173
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
Re: Disable reads during node rebuild
As of 2.0.7, driftx has added this long-requested feature. Thanks A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 9:36 am, Robert Coli rc...@eventbrite.com wrote: On Mon, May 12, 2014 at 10:18 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Is there a way to disable reads from a node while performing rebuild from another datacenter? I tried starting the node in write survery mode, but the nodetool rebuild command does not work in this mode. As of 2.0.7, driftx has added this long-requested feature. https://issues.apache.org/jira/browse/CASSANDRA-6961 Note that it is impossible to completely close the race window here as long as writes are incoming, this functionality just dramatically shortens it. =Rob
Re: How long are expired values actually returned?
Is this normal or am I doing something wrong?. probably this one. But the TTL is set based on the system clock on the server, first through would be to check the times are correct. If that fails, send over the schema and the insert. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 2:44 am, Sebastian Schmidt isib...@gmail.com wrote: Hi, I'm using the TTL feature for my application. In my tests, when using a TTL of 5, the inserted rows are still returned after 7 seconds, and after 70 seconds. Is this normal or am I doing something wrong?. Kind Regards, Sebastian
Re: How to balance this cluster out ?
This is not a problem with the token assignments. Here is the ideal assignments from the tools/bin/token-generator script DC #1: Node #1:0 Node #2: 56713727820156410577229101238628035242 Node #3: 113427455640312821154458202477256070484 You are pretty close, but the order of the nodes in the output is a little odd, would normally expect node 2 to be first. First step would be to check the logs on 1 to see if it’s failing at compaction, and to check if it’s holding a lot of hints. Then make sure repair is running so the data is distributed. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/05/2014, at 11:58 pm, Oleg Dulin oleg.du...@gmail.com wrote: I have a cluster that looks like this: Datacenter: us-east == Replicas: 2 Address RackStatus State LoadOwns Token 113427455640312821154458202477256070484 *.*.*.1 1b Up Normal 141.88 GB 66.67% 56713727820156410577229101238628035242 *.*.*.2 1a Up Normal 113.2 GB66.67% 210 *.*.*.3 1d Up Normal 102.37 GB 66.67% 113427455640312821154458202477256070484 Obviously, the first node in 1b has 40% more data than the others. If I wanted to rebalance this cluster, how would I go about that ? Would shifting the tokens accomplish what I need and which tokens ? Regards, Oleg
Re: Disable reads during node rebuild
I'm not able to replace a dead node using the ordinary procedure (boostrap+join), and would like to rebuild the replacement node from another DC. Normally when you want to add a new DC to the cluster the command to use is nodetool rebuild $DC_NAME .(with auto_bootstrap: false) That will get the node to stream data from the $DC_NAME The problem is that if I start a node with auto_bootstrap=false to perform the rebuild, it automatically starts serving empty reads (CL=LOCAL_ONE). When adding a new DC the nodes wont be processing reads, that is not the case for you. You should disable the client API’s to prevent the clients from calling the new nodes, use -Dcassandra.start_rpc=false and -Dcassandra.start_native_transport=false in cassandra-env.sh or appropriate settings in cassandra.yaml Disabling reads from other nodes will be harder. IIRC during bootstrap a different timeout (based on ring_delay) is used to detect if the bootstrapping node is down. However if the node is running and you use nodetool rebuild i’m pretty sure the normal gossip failure detectors will kick in. Which means you cannot disable gossip to prevent reads. Also we would want the node to be up for writes. But what you can do is artificially set the severity of the node high so the dynamic snitch will route around it. See https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/locator/DynamicEndpointSnitchMBean.java#L37 * Set the value to something high on the node you will be rebuilding, the number or cores on the system should do. (jmxterm is handy for this http://wiki.cyclopsgroup.org/jmxterm) * Check nodetool gossipinfo on the other nodes to see the SEVERITY app state has propagated. * Watch completed ReadStage tasks on the node you want to rebuild. If you have read repair enabled it will still get some traffic. * Do rebuild * Reset severity to 0 Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 5:18 am, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm not able to replace a dead node using the ordinary procedure (boostrap+join), and would like to rebuild the replacement node from another DC. The problem is that if I start a node with auto_bootstrap=false to perform the rebuild, it automatically starts serving empty reads (CL=LOCAL_ONE). Is there a way to disable reads from a node while performing rebuild from another datacenter? I tried starting the node in write survery mode, but the nodetool rebuild command does not work in this mode. Thanks, -- Paulo Motta Chaordic | Platform www.chaordic.com.br +55 48 3232.3200
Re: Question about READS in a multi DC environment.
read_repair_chance=1.00 AND There’s your problem. When read repair is active for a read request the coordinator will over read to all UP replicas. Your client request will only block waiting for the one request (the data request), the rest of the repair will happen in the background. Setting this to 1.0 will mean it’s active across the entire cluster for each read. Change read_repair_chance to 0 and set dclocal_read_repair_chance to 0.1 so that read repair will only happen local to the DC you are connected to. Hope that helps. A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/05/2014, at 5:37 pm, DuyHai Doan doanduy...@gmail.com wrote: Ins't read repair supposed to be done asynchronously in background ? On Mon, May 12, 2014 at 2:07 AM, graham sanderson gra...@vast.com wrote: You have a read_repair_chance of 1.0 which is probably why your query is hitting all data centers. On May 11, 2014, at 3:44 PM, Mark Farnan devm...@petrolink.com wrote: Im trying to understand READ load in Cassandra across a multi-datacenter cluster. (Specifically why it seems to be hitting more than one DC) and hope someone can help. From what Iím seeing here, a READ, with Consistency LOCAL_ONE, seems to be hitting All 3 datacenters, rather than just the one Iím connected to. I see 'Read 101 live and 0 tombstoned cells' from EACH of the 3 DCs in the trace, which seems, wrong. I have tried every Consistency level, same result. This also is same from my C# code via the DataStax driver, (where I first noticed the issue). Can someone please shed some light on what is occurring ? Specifically I dont' want a query on one DC, going anywhere near the other 2 as a rule, as in production, these DC's will be accross slower links. Query: (NOTE: Whilst this uses a kairosdb table, i'm just playing with queries against it as it has 100k columns in this key for testing). cqlsh:kairosdb consistency local_one Consistency level set to LOCAL_ONE. cqlsh:kairosdb select * from data_points where key = 0x6d61726c796e2e746573742e74656d7034000145b514a400726f6f6d3d6f6963653a limit 1000; ... Some return data rows listed here which I've removed CassandraQuery.txt Query Respose Trace: activity | timestamp | source | source_elapsed --+--++ execute_cql3_query | 07:18:12,692 | 192.168.25.111 | 0 Message received from /192.168.25.111 | 07:18:00,706 | 192.168.25.131 | 50 Executing single-partition query on data_points | 07:18:00,707 | 192.168.25.131 |760 Acquiring sstable references | 07:18:00,707 | 192.168.25.131 |814 Merging memtable tombstones | 07:18:00,707 | 192.168.25.131 |924 Bloom filter allows skipping sstable 191 | 07:18:00,707 | 192.168.25.131 | 1050 Bloom filter allows skipping sstable 190 | 07:18:00,707 | 192.168.25.131 | 1166 Key cache hit for sstable 189 | 07:18:00,707 | 192.168.25.131 | 1275 Seeking to partition beginning in data file | 07:18:00,707 | 192.168.25.131 | 1293 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 07:18:00,708 | 192.168.25.131 | 2173 Merging data from memtables and 1 sstables | 07:18:00,708 | 192.168.25.131 | 2195
Re: Effect of number of keyspaces on write-throughput....
On the homepage of libQtCassandra, its mentioned that switching between keyspaces is costly when storing into Cassandra thereby affecting the write throughput. Is this necessarily true for other libraries like pycassa and hector as well? When using the thrift connection the keyspace is a part of the connection state, so changing keyspaces requires a round trip to the server. Not hugely expensive, but it adds up if you do it a lot. Can I increase the write throughput by configuring all the clients to store in a single keyspace instead of multiple keyspaces to increase the write throughput? You should expect to get 3,000 to 4,000 writes per core per node. What are you getting now? Cheers A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/05/2014, at 4:06 pm, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, I have an application that writes network packets to a Cassandra cluster from a number of client nodes. It uses the libQtCassandra library to access Cassandra. On the homepage of libQtCassandra, its mentioned that switching between keyspaces is costly when storing into Cassandra thereby affecting the write throughput. Is this necessarily true for other libraries like pycassa and hector as well? Can I increase the write throughput by configuring all the clients to store in a single keyspace instead of multiple keyspaces to increase the write throughput? Thankyou.
Re: Cassandra MapReduce/Storm/ etc
Is there a good blog/article that describes how using MapReduce on Cassandra table ? The best way to get into cassandra and hadoop is to play with Cassandra DSE. It’s free for development, costs for production, and is an easy way to learn about hadoop integration without having to worry about the installation process. http://www.datastax.com/docs/datastax_enterprise3.1/solutions/about_hadoop If a database table is input source for MapReduce or Storm, for me , this is in the simple case, is translating to a full table scan of the input table, which can timeout and is generally not a recommended access pattern in Cassandra. The Hadoop integration is token aware, it splits the tasks to run local on the node. The tasks then scan over the token range local to the node. Hope that helps. A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 9:43 am, Manoj Khangaonkar khangaon...@gmail.com wrote: Hi, Searching for Cassandra with MapReduce, I am finding that the search results are really dated -- from version 0.7 2010/2011. Is there a good blog/article that describes how using MapReduce on Cassandra table ? From my naive understanding, Cassandra is all about partitioning. Querying is based on partitionkey + clustered column(s). Inputs to MapReduce is a sequence of Key,values. For Storm it is a stream of tuples. If a database table is input source for MapReduce or Storm, for me , this is in the simple case, is translating to a full table scan of the input table, which can timeout and is generally not a recommended access pattern in Cassandra. My initial reaction is that if I need to process data with MapReduce or Storm, reading it from Cassandra might not be the optimal way. Storing the output to Cassandra however does make sense. If anyone had links to blogs or personal experience in this area, I would appreciate if you can share it. regards
Re: Really need some advices on large data considerations
We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. If you want to get the most out of the raw disk space LCS is the way to go, remember it uses approximately twice the disk IO. From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. Which parts in particular ? Updating the schema or config ? OpsCentre has a rolling restart feature which can be handy when chef / puppet is deploying the config changes. Schema / gossip can take a little to propagate with high number of nodes. On a modern version you should be able to run 2 to 3 TB per node, maybe higher. The biggest concerns are going to be repair (the changes in 2.1 will help) and bootstrapping. I’d recommend testing a smaller cluster, say 12 nodes, with a high load per node 3TB. cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/05/2014, at 12:09 pm, Yatong Zhang bluefl...@gmail.com wrote: Hi, We're going to deploy a large Cassandra cluster in PB level. Our scenario would be: 1. Lots of writes, about 150 writes/second at average, and about 300K size per write. 2. Relatively very small reads 3. Our data will be never updated 4. But we will delete old data periodically to free space for new data We've learned that compaction strategy would be an important point cause we've ran into 'no space' trouble because of the 'sized tiered' compaction strategy. We've read http://wiki.apache.org/cassandra/LargeDataSetConsiderations and is this enough or update-to-date? From our experience changing any settings/schema during a large cluster is on line and has been running for some time is really really a pain. So we're gathering more info and expecting some more practical suggestions before we set up the cassandra cluster. Thanks and any help is of great appreciation
Re: Understanding about Cassandra read repair with QUORUM
I have following understanding about Cassandra read repair: Read Repair is an automatic process that reads from more nodes than necessary during a normal read and checks and repairs differences in the background. It’s different to “repair” or Anti Entropy that you run with nodetool repair. • If we write with QUORUM and read with QUORUM then we do not need to externally (nodetool) trigger read repair. You normally still want to run repair because it’s the way to ensure Tombstones are distributed. • Since we are reading + writing with QUORUM then it is safe to set read_repair_chance=0 dclocal_read_repair_chance=0 in column family definition. It’s safe, read repair does not affect consistency. It’s designed to reduce the chance that the server will need to repair an inconsistency during a read for a client. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/01/2014, at 11:31 am, chovatia jaydeep chovatia_jayd...@yahoo.co.in wrote: Hi, I have following understanding about Cassandra read repair: • If we write with QUORUM and read with QUORUM then we do not need to externally (nodetool) trigger read repair. • Since we are reading + writing with QUORUM then it is safe to set read_repair_chance=0 dclocal_read_repair_chance=0 in column family definition. Can someone please clarify? -jaydeep
Re: Problem in running cassandra-2.0.4 trigger example
But i am getting error: Bad Request: Key may not be empty My guess is the trigger is trying to create a row with an empty key. Add some logging to the trigger to see what it’s doing. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/01/2014, at 11:50 am, Thunder Stumpges thunder.stump...@gmail.com wrote: I'm not sure if this is your issue as I have not used these triggers before but shouldn't the invertedindex table have a different primary key than the primary table (either f2 or f3)? -Thunder On Jan 11, 2014, at 12:03 PM, Vidit Asthana vidit.astha...@gmail.com wrote: I am new to cassandra and trying to run the trigger example provided by cassandra on a pseudo cluster using instructions provided on https://github.com/apache/cassandra/tree/cassandra-2.0/examples/triggers But i am getting error: Bad Request: Key may not be empty Can someone tell me if my CREATE table is proper? What else can be wrong? I am doing following using cqlsh. • CREATE KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; • use keyspace1; • CREATE TABLE invertedindex ( f1 varchar , f2 varchar, f3 varchar, PRIMARY KEY(f1)); • CREATE TABLE table1 ( f1 varchar , f2 varchar, f3 varchar, PRIMARY KEY(f1)); • CREATE TRIGGER mytrigger ON table1 USING 'org.apache.cassandra.triggers.InvertedIndex'; • insert into table1 (f1,f2,f3) values ('aaa','bbb','ccc'); This is what i get in system.log: INFO [Thrift:1] 2014-01-11 14:48:09,875 InvertedIndex.java:67 - loaded property file, InvertedIndex.properties This is content of conf/InvertedIndex.properties file: keyspace=keyspace1 columnfamily=invertedindex Thanks in advance. Vidit
Re: Need ur expertise on Cassandra issue!!
Look at the logs for the cassandra servers, are nodes going down ? Are there any other errors ? Check for log messages about GCInspector, if there is a lot of GC nodes will start to flap up and down. It sounds like there is stability issue with cassandra, look there first to make sure it is always available. If you want to load 150GB of data from Hadoop to Cassandra a day I would suggest creating SSTables in Hadoop and bulk loading them into cassandra. This article is old buy it’s still relevant http://www.datastax.com/dev/blog/bulk-loading Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/01/2014, at 3:53 pm, Arun toarun...@gmail.com wrote: Hi , I need your help suggestions for our production issue. Details: -- we have 40 CF's in cassandra cluster for each datasource like below MusicData--Keyspace spotify_1-column family-Active spotify_2-column family-standby Daily we load data into this cluster using as below process: 1.Astyanix library to delete inactive version of CF datahere spotify_2 2. Hadoop Bulkload JAR -pushes data from Hadoop to cassandra into spotify_2 Data inflow rate 150GB per day . Datastax community version 1.1.9 with 9 nodes of 4 TB which are built on openstack with high end config. Problem: --- we're encountering the problem every week, the hadoop bulkload program is failing with java.io.IOException: Too many hosts failed: [/10.240.171.80, /10.240.171.76, /10.240.171.74, /10.240.171.73] at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:243 ) I can provide more details about the error if you need.with our initial analysis we came to know if we're deleting the deleted space for tombstoned blocks will be reclaimed in compaction process so we have increased storage capacity by adding new nodes but problem still persists. we need your expertise to comment on this production issue.please let me know if you need any information!! I will wait for your response !! -Arun
Re: upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL
Can you confirm that, cause we'll add a new DC with version 1.2.13 (read-only) and we'll upgarde other DCs to 1.2.13 weeks later. We made some tests and didn't notice anything. But we didn't test a node failure Depending on the other version you may not be able to run repair. All nodes have to use the same file version, file versions are here https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/io/sstable/Descriptor.java#L52 Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/01/2014, at 7:30 am, Robert Coli rc...@eventbrite.com wrote: On Mon, Jan 13, 2014 at 3:38 AM, Cyril Scetbon cyril.scet...@free.fr wrote: Can you confirm that, cause we'll add a new DC with version 1.2.13 (read-only) and we'll upgarde other DCs to 1.2.13 weeks later. We made some tests and didn't notice anything. But we didn't test a node failure In general adding nodes at a new version is not supported, whether a single node or an entire DC of nodes. =Rob
Re: various Cassandra performance problems when CQL3 is really used
I don't know. How do I find out? The only mention about query plan in Cassandra I found is your article on your site, from 2011 and considering version 0.8. See the help for TRACE in cqlsh My general approach is to solve problems with the read path by making changes to the write path. So I would normally say make a new table to store the data you want to read, or change the layout of a table to me more flexible. Can you provide the table definition and the query you are using ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/01/2014, at 9:48 am, Ondřej Černoš cern...@gmail.com wrote: Hi, thanks for the answer and sorry for the delay. Let me answer inline. On Wed, Dec 18, 2013 at 4:53 AM, Aaron Morton aa...@thelastpickle.com wrote: * select id from table where token(id) token(some_value) and secondary_index = other_val limit 2 allow filtering; Filtering absolutely kills the performance. On a table populated with 130.000 records, single node Cassandra server (on my i7 notebook, 2GB of JVM heap) and secondary index built on column with low cardinality of its value set this query takes 156 seconds to finish. Yes, this is why you have to add allow_filtering. You are asking the nodes to read all the data that matches and filter in memory, that’s a SQL type operation. Your example query is somewhat complex and I doubt it could get decent performance, what does the query plan look like? I don't know. How do I find out? The only mention about query plan in Cassandra I found is your article on your site, from 2011 and considering version 0.8. The example query gets computed in a fraction of the time if I perform just the fetch of all rows matching the token function and perform the filtering client side. IMHO you need to do further de-normalisation, you will get the best performance when you select rows by their full or part primary key. I denormalize all the way I can. The problem is I need to support paging and filtering at the same time. The API I must support allows filtering by example and paging - so how should I denormalize? Should I somehow manage pages of primary row keys manually? Or should I have manual secondary index and page somehow in the denormalized wide row? The trouble goes even further, even this doesn't perform well: select id from table where token(id) token(some_value) and pk_cluster = 'val' limit N; where id and pk_cluster are primary key (CQL3 table). I guess this should be ordered row query and ordered column slice query, so where is the problem with performance? By the way, the performance is order of magnitude better if this patch is applied: That looks like it’s tuned to your specific need, it would ignore the max results included in the query It is tuned, it only demonstrates the heuristics doesn't work well. * select id from table; As we saw in the trace log, the query - although it queries just row ids - scans all columns of all the rows and (probably) compares TTL with current time (?) (we saw hundreds of thousands of gettimeofday(2)). This means that if the table somehow mixes wide and narrow rows, the performance suffers horribly. Select all rows from a table requires a range scan, which reads all rows from all nodes. It should never be used production. The trouble is I just need to perform it, sometimes. I know what the problem with the query is, but I have just a couple of thousands records - 150.000 - the datasets can all be stored in memory, SSTables can be fully mmapped. There is no reason for this query to be slow in this case. Not sure what you mean by “scans all columns from all rows” a select by column name will use a SliceByNamesReadCommand which will only read the required columns from each SSTable (it normally short circuits though and read from less). The query should fetch only IDs, it checks TTLs of columns though. That is the point. Why does it do it? if there is a TTL the ExpiringColumn.localExpirationTime must be checked, if there is no TTL it will no be checked. It is a standard CQL3 table with ID, couple of columns and a CQL3 collection. I didn't do anything with TTL on the table and it's columns. As Cassandra checks all the columns in selects, performance suffers badly if the collection is of any interesting size. This is not true, could you provide an example where you think this is happening ? We saw it in the trace log. It happened in the select ID from table query. The table had a collection column. Additionally, we saw various random irreproducible freezes, high CPU consumption when nothing happens (even with trace log level set no activity was reported) and highly inpredictable performance characteristics after nodetool flush and/or major
Re: Cassandra mad GC
c3.4xlarge long par new on a machine like this is not normal. Do you have a custom comparator or are you using triggers ? Do you have a data model that creates a lot of tombstones ? Try to return the settings to default and then tune from there, that includes returning to the default JVM GC settings. If for no other reason than other people will be able to offer advice. Have you changed the compaction_throughput ? Put it back if you have. If you have enabled multi_threaded compaction disable it. Consider setting concurrent_compactors to 4 or 8 to reduce compaction churn. If you have increased in_memory_compaction_limit put it back. Cassandra logs Can you provide some of the log messages from GCInspector ? How long are the pauses ? Is there a lot of CMS or ParNew ? Do you have monitoring in place ? Is CMS able to return the heap to a low value e.g. 3Gb ? cpu load 1000% Is this all from cassandra ? try jvmtop (https://code.google.com/p/jvmtop/) to see what cassandra threads are doing. It’s a lot easier to tune a system with fewer non default settings. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 16/01/2014, at 8:22 am, Arya Goudarzi gouda...@gmail.com wrote: It is not a good idea to change settings without identifying the root cause. Chances are what you did masked the problem a bit for you, but the problem is still there, isn't it? On Wed, Jan 15, 2014 at 1:11 AM, Dimetrio dimet...@flysoft.ru wrote: I set G1 because GS started to work wrong(dropped messages) with standard GC settings. In my opinion, Cassandra started to work more stable with G1 (it's getting less count of timeouts now) but it's not ideally yet. I just want cassandra to works fine. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-mad-GC-tp7592248p7592257.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. -- Cheers, -Arya
Re: Nodetool ring
Owns is how much of the entire, cluster wide, data set the node has. In both your examples every node has a full copy of the data. If you have 6 nodes and RF 3 they would have 50%. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 3/01/2014, at 6:00 pm, Vivek Mishra mishra.v...@gmail.com wrote: Yes. On Fri, Jan 3, 2014 at 12:57 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jan 2, 2014 at 10:48 AM, Vivek Mishra mishra.v...@gmail.com wrote: Thanks for your quick reply. Even with 2 data center with 3 data nodes each i am seeing 100% on both data center nodes. Do you have RF=3 in both? =Rob
Re: Cassandra consuming too much memory in ubuntu as compared to within windows, same machine.
When Xms and Xmx are the same like this the JVM allocates all the memory, and then on Linux cassandra will ask the OS to lock that memory so it cannot be paged out. On windows it’s probably getting paged out. If you only have 4GB on the box, you probably do not want to run cassandra with 4GB. Try 2Gb so there is room for other things. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/01/2014, at 9:03 am, Erik Forkalsud eforkals...@cj.com wrote: On 01/04/2014 08:04 AM, Ertio Lew wrote: ... my dual boot 4GB(RAM) machine. ... -Xms4G -Xmx4G - You are allocating all your ram to the java heap. Are you using the same JVM parameters on the windows side? You can try to lower the heap size or add ram to your machine. - Erik -
Re: massive spikes in read latency
The spikes in latency don’t seem to be correlated to an increase in reads. The cluster’s workload is usually handling a maximum workload of 4200 reads/sec per node, with writes being significantly less, at ~200/sec per node. Usually it will be fine with this, with read latencies at around 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will shoot through the roof. Could there be errant requests coming in from the app ? e.g. something asking for 1’000s of columns ? Or something that hits a row that has a lot of tombstones ? Take a look at nodetool cfhistograms to see if you have any outlier wide rows. Also the second column, sstables, will tell you how many sstables were touched by reads which. High numbers, above 4, let you know there are some wide rows out there. In 2.0 and later 1.2 releases nodetool cfstats will also include information about the number tombstones touched in a read. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 8/01/2014, at 2:15 am, Jason Wee peich...@gmail.com wrote: /** * Verbs it's okay to drop if the request has been queued longer than the request timeout. These * all correspond to client requests or something triggered by them; we don't want to * drop internal messages like bootstrap or repair notifications. */ public static final EnumSetVerb DROPPABLE_VERBS = EnumSet.of(Verb.BINARY, Verb._TRACE, Verb.MUTATION, Verb.READ_REPAIR, Verb.READ, Verb.RANGE_SLICE, Verb.PAGED_RANGE, Verb.REQUEST_RESPONSE); The short term solution would probably increase the timeout in your yaml file but i suggest you get the monitoring graphs (ping internode, block io) ready so it will give better indication which might be the exact problem. Jason On Tue, Jan 7, 2014 at 2:30 AM, Blake Eggleston bl...@shift.com wrote: That’s a good point. CPU steal time is very low, but I haven’t observed internode ping times during one of the peaks, I’ll have to check that out. Another thing I’ve noticed is that cassandra starts dropping read messages during the spikes, as reported by tpstats. This indicates that there’s too many queries for cassandra to handle. However, as I mentioned earlier, the spikes aren’t correlated to an increase in reads. On Jan 5, 2014, at 3:28 PM, Blake Eggleston bl...@shift.com wrote: Hi, I’ve been having a problem with 3 neighboring nodes in our cluster having their read latencies jump up to 9000ms - 18000ms for a few minutes (as reported by opscenter), then come back down. We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with cassandra reading and writing to 2 raided ssds. I’ve added 2 nodes to the struggling part of the cluster, and aside from the latency spikes shifting onto the new nodes, it has had no effect. I suspect that a single key that lives on the first stressed node may be being read from heavily. The spikes in latency don’t seem to be correlated to an increase in reads. The cluster’s workload is usually handling a maximum workload of 4200 reads/sec per node, with writes being significantly less, at ~200/sec per node. Usually it will be fine with this, with read latencies at around 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will shoot through the roof. The disks aren’t showing serious use, with read and write rates on the ssd volume at around 1350 kBps and 3218 kBps, respectively. Each cassandra process is maintaining 1000-1100 open connections. GC logs aren’t showing any serious gc pauses. Any ideas on what might be causing this? Thanks, Blake
Re: nodetool cleanup / TTL
Is there some other mechanism for forcing expired data to be removed without also compacting? (major compaction having obvious problematic side effects, and user defined compaction being significant work to script up). Tombstone compactions may help here https://issues.apache.org/jira/browse/CASSANDRA-3442 They cannot be forced, but if there is nothing else to compact they will look for single sstables to compact. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 8/01/2014, at 11:18 pm, Sylvain Lebresne sylv...@datastax.com wrote: Is there some other mechanism for forcing expired data to be removed without also compacting? (major compaction having obvious problematic side effects, and user defined compaction being significant work to script up). Online scrubs will, as a side effect, purge expired tombstones *when possible* (even expired data cannot be removed if it possibly overwrite some older data in some other sstable than the one scubbed). Please don't take that as me saying that this is a guarantee of scrub: it is just one of its current implementation side effect and it might very well change tomorrow. -- Sylvain
Re: upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL
We avoid mixing versions for a long time, but we always upgrade one node and check the application is happy before proceeding. e.g. wait for 30 minutes before upgrading the others. If you snapshot before upgrading, and have to roll back after 30 minutes you can roll back to the snapshot and use repair to fix the data on disk. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/01/2014, at 7:24 am, Robert Coli rc...@eventbrite.com wrote: On Wed, Jan 8, 2014 at 1:17 AM, Jiri Horky ho...@avast.com wrote: I am specifically interested whether is possible to upgrade just one node and keep it running like that for some time, i.e. if the gossip protocol is compatible in both directions. We are a bit afraid to upgrade all nodes to 1.2.13 at once in a case we would need to rollback. This not not officially supported. It will probably work for these particular versions, but it is not recommended. The most serious potential issue is an inability to replace the new node if it fails. There's also the problem of not being able to repair until you're back on the same versions. And other, similar, undocumented edge cases... =Rob
Re: MUTATION messages dropped
I ended up changing memtable_flush_queue_size to be large enough to contain the biggest flood I saw. As part of the flush process the “Switch Lock” is taken to synchronise around the commit log. This is a reentrant Read Write lock, the flush path takes the write lock and write path takes the read part. When flushing a CF the write lock is taken, the commit log is updated, and memtable is added to the flush queue. If the queue is full then the write lock will be held blocking the write threads from taking the read lock. There are a few reasons why the queue may be full, the simple one is the disk IO is not fast enough. Others are that the commit log segments are too small, there are lots of CF’s and/or lots of secondary indexes, or nodetoo flush is called frequently. Increasing the size of the queue is a good work around, and the correct approach if you have a lot of CF’s and/or secondary indexes. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/12/2013, at 6:03 am, Ken Hancock ken.hanc...@schange.com wrote: I ended up changing memtable_flush_queue_size to be large enough to contain the biggest flood I saw. I monitored tpstats over time using a collection script and an analysis script that I wrote to figure out what my largest peaks were. In my case, all my mutation drops correlated with hitting the maximum memtable_flush_queue_size and then mutations drops stopped as soon as the queue size dropped below the max. I threw the scripts up on github in case they're useful... https://github.com/hancockks/tpstats On Fri, Dec 20, 2013 at 1:08 AM, Alexander Shutyaev shuty...@gmail.com wrote: Thanks for you answers. srmore, We are using v2.0.0. As for GC I guess it does not correlate in our case, because we had cassandra running 9 days under production load and no dropped messages and I guess that during this time there were a lot of GCs. Ken, I've checked the values you indicated. Here they are: node1 6498 node2 6476 node3 6642 I guess this is not good :) What can we do to fix this problem? 2013/12/19 Ken Hancock ken.hanc...@schange.com We had issues where the number of CF families that were being flushed would align and then block writes for a very brief period. If that happened when a bunch of writes came in, we'd see a spike in Mutation drops. Check nodetool tpstats for FlushWriter all time blocked. On Thu, Dec 19, 2013 at 7:12 AM, Alexander Shutyaev shuty...@gmail.com wrote: Hi all! We've had a problem with cassandra recently. We had 2 one-minute periods when we got a lot of timeouts on the client side (the only timeouts during 9 days we are using cassandra in production). In the logs we've found corresponding messages saying something about MUTATION messages dropped. Now, the official faq [1] says that this is an indicator that the load is too high. We've checked our monitoring and found out that 1-minute average cpu load had a local peak at the time of the problem, but it was like 0.8 against 0.2 usual which I guess is nothing for a 2 core virtual machine. We've also checked java threads - there was no peak there and their count was reasonable ~240-250. Can anyone give us a hint - what should we monitor to see this high load and what should we tune to make it acceptable? Thanks in advance, Alexander [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages -- Ken Hancock | System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC Office: +1 (978) 889-3329 | ken.hanc...@schange.com | hancockks | hancockks This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International. -- Ken Hancock | System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC Office: +1 (978) 889-3329 | ken.hanc...@schange.com | hancockks | hancockks This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.
Re: Astyanax - multiple key search with pagination
You will need to paginate the list of keys to read in your app. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/12/2013, at 12:58 pm, Parag Patel parag.pa...@fusionts.com wrote: Hi, I’m using Astyanax and trying to do search for multiple keys with pagination. I tried “.getKeySlice” with a list a of primary keys, but it doesn’t allow pagination. Does anyone know how to tackle this issue with Astyanax? Parag
Re: Broken pipe with Thrift
One question, which is confusing , it's a server side issue or client side? Check the server log for errors to make sure it’s not a server side issue. Also check if there could be something in network that is killing long lived connections. Check the thrift lib the client is using is the same as the one in the cassandra lib on the server. Can you do some simple tests using cqlsh from the client machine? That would eliminate the client driver. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 25/12/2013, at 4:35 am, Steven A Robenalt srobe...@stanford.edu wrote: In our case, the issue was on the server side, but since you're on the 1.2.x branch, it's not likely to be the same issue. Hopefully, somone else who is using the 1.2.x branch will have more insight than I do. On Mon, Dec 23, 2013 at 11:52 PM, Vivek Mishra mishra.v...@gmail.com wrote: Hi Steven, One question, which is confusing , it's a server side issue or client side? -Vivek On Tue, Dec 24, 2013 at 12:30 PM, Vivek Mishra mishra.v...@gmail.com wrote: Hi Steven, Thanks for your reply. We are using version 1.2.9. -Vivek On Tue, Dec 24, 2013 at 12:27 PM, Steven A Robenalt srobe...@stanford.edu wrote: Hi Vivek, Which release are you using? We had an issue with 2.0.2 that was solved by a fix in 2.0.3. On Mon, Dec 23, 2013 at 10:47 PM, Vivek Mishra mishra.v...@gmail.com wrote: Also to add. It works absolutely fine on single node. -Vivek On Tue, Dec 24, 2013 at 12:15 PM, Vivek Mishra mishra.v...@gmail.com wrote: Hi, I have a 6 node, 2DC cluster setup. I have configured consistency level to QUORUM. But very often i am getting Broken pipe com.impetus.client.cassandra.CassandraClientBase (CassandraClientBase.java:1926) - Error while executing native CQL query Caused by: . org.apache.thrift.transport.TTransportExceptionjava.net.SocketException: Broken pipe at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransportjava:147) at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156) at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:65) at org.apache.cassandra.thrift.Cassandra$Client.send_execute_cql3_query(Cassandra.java:1556) at org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1546) I am simply reading few records from a column family(not huge amount of data) Connection pooling and socket time out is properly configured. I have even modified read_request_timeout_in_ms request_timeout_in_ms write_request_timeout_in_ms in cassandra.yaml to higher value. any idea? Is it an issue at server side or with client API? -Vivek -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu
Re: querying time series from hadoop
So now i will try to patch my cassandra 1.2.11 installation but i just wanted to ask you guys first, if there is any other solution that does not involve a release. That patch in CASSANDRA-6311 is for 2.0 you cannot apply it to 1.2 but when i am using the java driver, the driver already uses row key for token statements and i cannot execute the query above, therefore it does a full scan of rows. The ColumnFamilyRecordReader is designed to read lots of rows, not a single row. You should be able to use the java driver from a hadoop task though to read a single row. Can you provide some more info on what you are doing ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 26/12/2013, at 9:56 pm, mete efk...@gmail.com wrote: Hello folks, i have come up with a basic time series cql schema based on the articles here: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra so simply put its something like: rowkey, timestamp, col3, col4 etc... where rowkey and timestamp are compound keys. Where i am having issues is to efficiently query this data structure. When i use cqlsh and query it is perfectly fine: select * from table where rowkey='row key' and date xxx and date = yyy but when i am using the java driver, the driver already uses row key for token statements and i cannot execute the query above, therefore it does a full scan of rows. The issue that i am having is discussed here: http://stackoverflow.com/questions/19189649/composite-key-in-cassandra-with-pig i have gone through the relevant jira issues 6151 and 6311. This behaviour is supposed to be fixed in 2.0.x but so far it is not there. So now i will try to patch my cassandra 1.2.11 installation but i just wanted to ask you guys first, if there is any other solution that does not involve a release. i assume that this is somewhat a common use case, the articles i referred seems to be old enough and unless i am missing something obvious i cannot query this schema efficiently with the current version (1.2.x or 2.0.x) Does anyone has a similar issue? Any pointers are welcome. Regards Mete
Re: Offline migration: Random-Murmur
I wrote a small (yet untested) utility, which should be able to read SSTable files from disk and write them into a cassandra cluster using Hector. Consider using the SSTableSimpleUnsortedWriter (see http://www.datastax.com/dev/blog/bulk-loading) to create the SSTables you can then bulk load them into the destination system.This will be much faster. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 29/12/2013, at 6:26 am, Edward Capriolo edlinuxg...@gmail.com wrote: Internally we have a tool that does get range slice on the souce cluster and replicates to destination. Remeber that writes are itempotemt. Our tool can optionally only replicate data between two timestamps, allowing incremental transfers. So if you get your application writing new data to both clusters you can run a range scanning program to copy all the data. On Monday, December 23, 2013, horschi hors...@gmail.com wrote: Interesting you even dare to do a live migration :-) Do you do all Murmur-writes with the timestamp from the Random-data? So that all migrated data is written with timestamps from the past. On Mon, Dec 23, 2013 at 3:59 PM, Rahul Menon ra...@apigee.com wrote: Christian, I have been planning to migrate my cluster from random to murmur3 in a similar manner. I intend to use pycassa to read and then write to the newer cluster. My only concern would be ensuring the consistency of already migrated data as the cluster ( with random ) would be constantly serving the production traffic. I was able to do this on a non prod cluster, but production is a different game. I would also like to hear more about this, especially if someone was able to successfully do this. Thanks Rahul On Mon, Dec 23, 2013 at 6:45 PM, horschi hors...@gmail.com wrote: Hi list, has anyone ever tried to migrate a cluster from Random to Murmur? We would like to do so, to have a more standardized setup. I wrote a small (yet untested) utility, which should be able to read SSTable files from disk and write them into a cassandra cluster using Hector. This migration would be offline of course and would only work for smaller clusters. Any thoughts on the topic? kind regards, Christian PS: The reason for doing so are not performance. It is to simplify operational stuff for the years to come. :-) -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.
Re: cassandra monitoring
JMX is doing it's thing on the cassandra node and is running on port 8081 Have you set the JMX port for the cluster in Ops Centre ? The default JMX port has been 7199 for a while. Off the top of the my head it’s in the same area where you specify the initial nodes in the cluster, maybe behind an “Advanced” button. The Ops Centre agent talks to the server to find out what JMX port it should use to talk to the local Cassandra install. Also check the logs in /var/log/datastax Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 30/12/2013, at 2:21 am, Tim Dunphy bluethu...@gmail.com wrote: Hi all, I'm attempting to configure datastax agent so that opscenter can monitor cassandra. I am running cassandra 2.0.3 and opscenter-4.0.1-2.noarch running. Cassandra is running on a centos 5.9 host and the opscenter host is running on centos 6.5 A ps shows the agent running [root@beta:~] #ps -ef | grep datastax | grep -v grep root 2166 1 0 03:31 ?00:00:00 /bin/bash /usr/share/datastax-agent/bin/datastax_agent_monitor 106 2187 1 0 03:31 ?00:01:37 /etc/alternatives/javahome/bin/java -Xmx40M -Xms40M -Djavax.net.ssl.trustStore=/var/lib/datastax-agent/ssl/agentKeyStore -Djavax.net.ssl.keyStore=/var/lib/datastax-agent/ssl/agentKeyStore -Djavax.net.ssl.keyStorePassword=opscenter -Dagent-pidfile=/var/run/datastax-agent/datastax-agent.pid -Dlog4j.configuration=/etc/datastax-agent/log4j.properties -jar datastax-agent-4.0.2-standalone.jar /var/lib/datastax-agent/conf/address.yaml And the service itself claims that it is running: [root@beta:~] #service datastax-agent status datastax-agent (pid 2187) is running... On the cassandra node I have ports 61620 and 61621 open on the firewall. But if I do an lsof and look for those ports I see no activity there. [root@beta:~] #lsof -i :61620 [root@beta:~] #lsof -i :61621 And a netstat turns up nothing either: [root@beta:~] #netstat -tapn | egrep (datastax|ops) So I guess it should come as no surprise that the opscenter interface reports the node as down. And trying to reinstall the agent remotely by clicking the 'fix' link errors out: g is null If you need to make changes, you can press Retry and the installations will be retried. And also I got on another attempt: Cannot call method 'getRequstStatus' of null. I'm really wondering why I'm doing wrong here, and how I can work my way out of this quagmire. It would be beyond awesome to actually get this working! I've also attempted to get Cassandra Cluster Admin working. JMX is doing it's thing on the cassandra node and is running on port 8081. CCA is running on the same host as the opscenter. But cca gives me this error once I log in: Cassandra Cluster Admin Logout Fatal error: Uncaught exception 'TTransportException' with message 'TSocket: timed out reading 4 bytes from beta.jokefire.com:9160' in /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TSocket.php:268 Stack trace: #0 /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TTransport.php(87): TSocket-read(4) #1 /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TFramedTransport.php(135): TTransport-readAll(4) #2 /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TFramedTransport.php(102): TFramedTransport-readFrame() #3 /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TTransport.php(87): TFramedTransport-read(4) #4 /var/www/Cassandra-Cluster-Admin/include/thrift/protocol/TBinaryProtocol.php(300): TTransport-readAll(4) #5 /var/www/Cassandra-Cluster-Admin/include/thrift/protocol/TBinaryProtocol.php(192): TBinaryProtocol-readI32(NULL) #6 /var/www/Cassandra-Cluster-Admin/include/thrift/packages/cassandra/cassandra.Cassandra.client.php(1017): TBinaryProtocol-readMessageBegin(NULL, 0, 0) # in /var/www/Cassandra-Cluster-Admin/include/thrift/transport/TSocket.php on line 268 Any advice I could get on my CCA problem and /or my Opcenter problem would be great and appreciated. Thanks Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
Re: Cleanup and old files
Check the SSTable is actually in use by cassandra, if it’s missing a component or otherwise corrupt it will not be opened at run time and so not included in all the fun games the other SSTables get to play. If you have the last startup in the logs check for an “Opening… “ message or an ERROR about the file. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 30/12/2013, at 1:28 pm, David McNelis dmcne...@gmail.com wrote: I am currently running a cluster with 1.2.8. One of my larger column families on one of my nodes has keyspace-tablename-ic--Data.db with a modify date in August. Since august we have added several nodes (with vnodes), with the same number of vnodes as all the existing nodes. As a result, (we've since gone from 15 to 21 nodes), then ~32% of my data of the original 15 nodes should have been essentially balanced out to the 6 new nodes. (1/15 + 1/16 + 1/21). When I run a cleanup, however, the old data files never get updated, and I can't believe that they all should have remained the same. The only recently updated files in that data directory are secondary index sstable files. Am I doing something wrong here? Am I thinking about this wrong? David
Re: Commitlog replay makes dropped and recreated keyspace and column family rows reappear
mmm, my bad there. First schema changes are always flushed to disk, so the commit log is not really an issue. Second when the commit log replays it just processes the mutations, the Drop keyspace” message comes from MigrationManager.announceKeyspaceDrop() and is not called. If you can reproduce this in a simple way please create a ticket at https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 19/12/2013, at 2:42 am, Desimpel, Ignace ignace.desim...@nuance.com wrote: I did the test again to get the log information. There is a “Drop keyspace” message at the time I drop the keyspace. That actually must be working since after the drop, I do not get any records back. But starting from the time of restart, I do not get any “Drop keyspace” message in the log. I get the following lines (only part of log here ): ………. 2013-12-18 14:30:19.385 Initializing system_traces.sessions 2013-12-18 14:30:19.387 Initializing system_traces.events 2013-12-18 14:30:19.394 Replaying ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026304.log, ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026305.log 2013-12-18 14:30:19.414 Replaying ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026304.log 2013-12-18 14:30:20.291 CFS(Keyspace='CodeStructure', ColumnFamily='Labels') liveRatio is 10.79257274718398 (just-counted was 10.79257274718398). calculation took 720ms for 6128 columns 2013-12-18 14:30:20.331 CFS(Keyspace='CodeStructure', ColumnFamily='Class') liveRatio is 9.787147977470557 (just-counted was 9.574295954941116). calculation took 39ms for 1236 columns 2013-12-18 14:30:20.454 CFS(Keyspace='CodeStructure', ColumnFamily='ClassMethod') liveRatio is 10.415524860171194 (just-counted was 10.415524860171194). calculation took 122ms for 6630 columns 2013-12-18 14:30:21.294 Finished reading ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026304.log 2013-12-18 14:30:21.294 Replaying ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026305.log 2013-12-18 14:30:21.294 Finished reading ../../../../data/cac.cassandra.cac/dbcommitlog/CommitLog-3-1387372026305.log 2013-12-18 14:30:21.298 Enqueuing flush of Memtable-ReverseIntegerFunction@663725448(270/2700 serialized/live bytes, 10 ops) 2013-12-18 14:30:21.298 Writing Memtable-ReverseIntegerFunction@663725448(270/2700 serialized/live bytes, 10 ops) ……more flushing of my memtables ……... Log replay complete, 42237 replayed mutations 2013-12-18 14:30:25.428 Cassandra version: 2.0.2-SNAPSHOT 2013-12-18 14:30:25.428 Thrift API version: 19.38.0 …… Regards, Ignace Desimpel Do you have the logs from after the restart ? Did it include a Drop Keyspace…” INFO level message ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com From: Desimpel, Ignace Sent: dinsdag 3 december 2013 14:45 To: user@cassandra.apache.org Subject: Commitlog replay makes dropped and recreated keyspace and column family rows reappear Hi, I have the impression that there is an issue with dropping a keyspace and then recreating the keyspace (and column families), combined with a restart of the database My test goes as follows: Create keyspace K and column families C. Insert rows X0 column family C0 Query for X0 : found rows : OK Drop keyspace K Query for X0 : found no rows : OK Create keyspace K and column families C. Insert rows X1 column family C1 Query for X0 : not found : OK Query for X1 : found : OK Stop the Cassandra database Start the Cassandra database Query for X1 : found : OK Query for X0 : found : NOT OK ! Did someone tested this scenario? Using : CASSANDRA VERSION 2.02, thrift, java 1.7.x, centos Ignace Desimpel
Re: Writes during schema migration
It depends a little on the nature of the change, but you need some coordination between the schema change and your code. e.g. add new column, change code to write to it or add new column, change code to use new column and not old column, remove old column. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 19/12/2013, at 3:02 am, Ben Hood 0x6e6...@gmail.com wrote: Hi, I was wondering if anybody knows any best practices of how to apply a schema migration across a cluster. I've been reading this article: http://www.datastax.com/dev/blog/the-schema-management-renaissance to see what is happening under the covers. However the article doesn't seem to talk about concurrent write access during the migration process. I'm naively assuming that you'd need to block all writes to the cluster before the migration is started. This is would be firstly because of potential consistency issues amongst the cluster nodes. But this would also be because you'd need two versions of your app to running at the same time. Does anybody have any experience with doing this kind of thing? Cheers, Ben
Re: How to tune cassandra to avoid OOM
Cassandra version is : apache-cassandra-1.2.4 The latest 1.2 version is 1.2.13, you really should be on that. commitlog_total_space_in_mb: 16 commitlog_segment_size_in_mb: 16 Reducing the total commit log size to 16 MB is a very bad idea, you should return it to 4096 and the segment size to 32. The commit log is kept on disk and has no impact on the memory footprint. Reducing the size will cause much more disk IO. It’s kind of unusual to go OOM in 1.2+, but I’ve seen it happen with large number of SSTables (30k+) and LCS. Also wide rows, or lots of tombstones, and bad queries can result in a lot of premature tenuring. Finally custom comparators can create a lot of garbage or a low powered CPU may not be able to keep up. How many cores do you have ? You may want to make these changes to reduce how quickly objects are tenured, also pay attention to how low the total heap use get’s to after CMS. JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=4 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=2 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50” Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 19/12/2013, at 4:47 pm, Lee Mighdoll l...@underneath.ca wrote: I'd suggest setting some cassandra jvm parameters so that you can analyze a heap dump and peek through the gc logs. That'll give you some clues e.g. if the memory problem is growing steadily or suddenly, and clues from a peek at which object are using the memory. -XX:+HeapDumpOnOutOfMemoryError And if you don't want to wait six days for another failure, you can collect a heap sooner with jmap -F. -Xloggc:/path/to/where/to/put/the/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure Cheers, Lee On Wed, Dec 18, 2013 at 6:52 PM, Shammi Jayasinghe sha...@wso2.com wrote: Hi, We are facing with a problem on Cassandra tuning. In that we have faced with following OOM scenario[1], after running the system for 6 days. We have tuned the cassandra with following values. These values also obtained by going through huge number of testing cycles. But still it has gone OOM. I would like to know if someone can help on identifying tuning parameters. In this server , we have given 6GB for the Xmx value and the total memory in the server is 8GB. Cassandra version is : apache-cassandra-1.2.4 Tuning parameters: flush_largest_memtables_at: 0.5 reduce_cache_sizes_at: 0.85 reduce_cache_capacity_to: 0.6 commitlog_total_space_in_mb: 16 commitlog_segment_size_in_mb: 16 As i mentioned in the above parameters ( Flush_largest_memtable_at to 0,5) , i feel that it has not be affected to the server. Is there any way that we can check whether it is affected as expected to the server ? [1]WARN 19:16:50,355 Heap is 0.9971737408184552 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN 19:18:19,784 Flushing CFS(Keyspace='QpidKeySpace', ColumnFamily='DestinationSubscriptionsCountRow') to relieve memory pressure ERROR 19:20:50,316 Exception in thread Thread[ReadStage:63,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.ByteBuffer.wrap(ByteBuffer.java:350) at java.nio.ByteBuffer.wrap(ByteBuffer.java:373) at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:391) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392) at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371) at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:84) at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:73) at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:370) at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:325) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:151) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:48) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:90) at org.apache.cassandra.db.filter.QueryFilter$2
Re: Best way to measure write throughput...
nodetool proxyhistograms shows the throughput for the node, nodetool cfhistograms shows it for a single node. If you want to get an overview install something like Ops Centre http://www.datastax.com/what-we-offer/products-services/datastax-opscenter Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 19/12/2013, at 8:46 pm, Jason Wee peich...@gmail.com wrote: Hello, you could also probably do it in your application? Just sample with an interval of time and that should give some indication of throughput. HTH /Jason On Thu, Dec 19, 2013 at 12:11 AM, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, Could you please suggest to me the best way to measure write-throughput in Cassandra. I basically have an application that stores network packets to a Cassandra cluster. Which is the best way to measure write performance, especially write-throughput, in terms of number of packets stored into Cassandra per second or something similar to this??? Can I measure this using nodetool? Thanks. -- Regards, BNSK.
Re: Cassandra pytho pagination
First approach: Sounds good. Second approach ( I used in production ): If the row gets big enough this will have bad performance. A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 19/12/2013, at 10:28 am, Kumar Ranjan winnerd...@gmail.com wrote: I am using pycassa. So, here is how I solved this issue. Will discuss 2 approaches. First approach didn't work out for me. Thanks Aaron for your attention. First approach: - Say if column_count = 10 - collect first 11 rows, sort first 10, send it to user (front end) as JSON object and last=11th_column - User then calls for page 2, with prev = 1st_column_id, column_start = 11th_column and column_count = 10 - This way, I can traverse, next page and previous page. - Only issue with this approach is, I don't have all columns in super column sorted. So this did not work. Second approach ( I used in production ): - fetch all super columns for a row key - Sort this in python using sorted and lambda function based on column values. - Once sorted, I prepare buckets and each bucked size is of page size/column count. Also filter out any rogue data if needed - Store page by page results in Redis with keys such as 'row_key|page_1|super_column' and keep refreshing redis periodically. I am sure, there must be a better and brighter approach but for now, 2nd approach is working. Thoughts ?? On Tue, Dec 17, 2013 at 9:19 PM, Aaron Morton aa...@thelastpickle.com wrote: CQL3 and thrift do not support an offset clause, so you can only really support next / prev page calls to the database. I am trying to use xget with column_count and buffer_size parameters. Can someone explain me, how does it work? From doc, my understanding is that, I can do something like, What client are you using ? xget is not a standard cassandra function. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/12/2013, at 4:56 am, Kumar Ranjan winnerd...@gmail.com wrote: Hey Folks, I need some ideas about support implementing of pagination on the browser, from the backend. So python code (backend) gets request from frontend with page=1,2,3,4 and so on and count_per_page=50. I am trying to use xget with column_count and buffer_size parameters. Can someone explain me, how does it work? From doc, my understanding is that, I can do something like, total_cols is total columns for that key. count is what user sends me. .xget('Twitter_search', hh, column_count=total_cols, buffer_size=count): Is my understanding correct? because its not working for page 2 and so on? Please enlighten me with suggestions. Thanks.
Re: Cassandra pytho pagination
Is there something wrong with it? Here 1234555665_53323232 and 2344555665_53323232 are super columns. Also, If I have to represent this data with new composite comparator, How will I accomplish that? Composite types via pycassa http://pycassa.github.io/pycassa/assorted/composite_types.html?highlight=composite Create a composite of where the super column value is the first part and the second part is the column name, this is basically what cql3 does. You will have to make all columns the same type though. Or use CQL 3, it works well for these sorts of models. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/12/2013, at 7:22 am, Kumar Ranjan winnerd...@gmail.com wrote: Rob - I got a question following your advice. This is how, I define my column family validators = { 'approved':'UTF8Type', 'tid': 'UTF8Type', 'iid': 'UTF8Type', 'score': 'IntegerType', 'likes': 'IntegerType', 'retweet': 'IntegerType', 'favorite':'IntegerType', 'screen_name': 'UTF8Type', 'created_date':'UTF8Type', 'expanded_url':'UTF8Type', 'embedly_data':'BytesType', } SYSTEM_MANAGER.create_column_family('KeySpaceNNN', 'Twitter_Instagram', default_validation_class='UTF8Type', super=True, comparator='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validator) Actual data representation: 'row_key': {'1234555665_53323232': {'approved': 'false', 'tid': 123, 'iid': 34, 'score': 2, likes: 50, retweets: 45, favorite: 34, screen_name:'goodname'}, '2344555665_53323232': {'approved': 'false', 'tid': 134, 'iid': 34, 'score': 2, likes: 50, retweets: 45, favorite: 34, screen_name:'newname'}. . } Is there something wrong with it? Here 1234555665_53323232 and 2344555665_53323232 are super columns. Also, If I have to represent this data with new composite comparator, How will I accomplish that? Please let me know. Regards. On Wed, Dec 18, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Dec 18, 2013 at 1:28 PM, Kumar Ranjan winnerd...@gmail.com wrote: Second approach ( I used in production ): - fetch all super columns for a row key Stock response mentioning that super columns are anti-advised for use, especially in brand new code. =Rob
Re: WriteTimeoutException instead of UnavailableException
But in some cases, from one certain node, I get an WriteTimeoutException for a few minutes until an UnavailableException. It's like the coordinator don't know the status of the cluster. Any clue why is this happening? Depending on how the node goes down there can be a delay in other nodes knowing it is down. If you stop gossip (nodetool disablegossip) the node will cancel the gossip thread (without interrupting), wait two seconds, then inform other nodes it’s leaving gossip. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 18/12/2013, at 8:56 am, Demian Berjman dberj...@despegar.com wrote: Question. I have a 5 node cluster (local with ccm). A keyspace with rf: 3. Three nodes are down. I run nodetool ring in the two living nodes and both see the other three nodes down. Then i do an insert with cs quorum and get an UnavailableException. It's ok. I am using Datastax java driver v 2.0.0-rc2. But in some cases, from one certain node, I get an WriteTimeoutException for a few minutes until an UnavailableException. It's like the coordinator don't know the status of the cluster. Any clue why is this happening? Thanks,
Re: WriteTimeoutException on Lightweight transactions
Some background…. http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 You can also get a timeout during the prepare phase, well anytime you are waiting on other node really. The WriteTimeoutException returned from the server includes a writeType (https://github.com/apache/cassandra/blob/cassandra-2.0.0-beta1/src/java/org/apache/cassandra/exceptions/WriteTimeoutException.java#L27) that will say if it CAS during the prepare and propose phases and simple when trying to commit. it’s also on the WriteTimeoutException in the driver. if it says CAS then we did not get to start the write. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/12/2013, at 10:05 am, Demian Berjman dberj...@despegar.com wrote: Hi. I am using Cassandra 2.0.3 with Datastax Java client. I execute an insert query: Insert insert = QueryBuilder.insertInto(demo_cl,demo_table).value(id, id).value(col1, transactions).ifNotExists(); session.execute(insert.setConsistencyLevel(ConsistencyLevel.QUORUM); Then, i force a shutdown on one node and get: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency SERIAL (2 replica were required but only 1 acknowledged the write) Then i read the row and i got not results. It seems that it was not inserted. What happened to the 1 acknowledged the write? It's lost? It's like a rollback? Thanks,
Re: Issue upgrading from 1.2 to 2.0.3
If this is still a concern can you post the output from nodetool gossipinfo ? It will give the details of the nodes think of the other ones. A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/12/2013, at 11:38 am, Parag Patel parag.pa...@fusionts.com wrote: Thanks for that link. Our 1.2 version is 1.2.12 Our 2.0.3 nodes were restarted once. Before restart, it was the 1.2.12 binary, after it was the 2.0.3. Immediately after the node was back in the cluster, we ran nodetool upgradesstables. We haven’t restarted since. Is a restart required for each node? From: Robert Coli [mailto:rc...@eventbrite.com] Sent: Thursday, December 19, 2013 4:17 PM To: user@cassandra.apache.org Subject: Re: Issue upgrading from 1.2 to 2.0.3 On Thu, Dec 19, 2013 at 1:03 PM, Parag Patel parag.pa...@fusionts.com wrote: We are in the process of upgrading 1.2 to 2.0.3. ... Please help as this will prevent us from pushing into production. (as a general commentary : https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ ) specific feedback on your question : Did the 2.0.3 nodes see the 1.2.x (which 1.2.x?) nodes after the first restart? =Rob
Re: Improving write performance in Cassandra and a few related issues...
Thanks for the reply. By packet drops I mean, the client is not able to read the shared memory as fast as the software switch is writing into it.. What is the error you are getting on the client ? Also, I would like to know if in general , distribution of write requests to different Casaandra nodes instead of only to one, leads to increased write performance in Cassandra. In general yes, clients should distribute their writes. Is there any particular way in which write performance can be measured, preferably from the client??? Logging at the client level ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 18/12/2013, at 5:02 pm, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Thanks for the reply. By packet drops I mean, the client is not able to read the shared memory as fast as the software switch is writing into it.. I doubt its the issue with the client but can you in particular issues that could cause this type of scenario? Also, I would like to know if in general , distribution of write requests to different Casaandra nodes instead of only to one, leads to increased write performance in Cassandra. Is there any particular way in which write performance can be measured, preferably from the client??? On Dec 18, 2013 8:30 AM, Aaron Morton aa...@thelastpickle.com wrote: rite throughput is remaining at around 460 pkts/sec or sometimes even falling below that rate as against the expected rate of around 920 pkts/sec. Is it some kind of limitation of Cassandra or am I doing something wrong??? There is nothing in cassandra that would make that happen. Double check your client. I also see an increase in packet drops when I try to store the packets from both the hosts into the same keyspace. The packets are getting collected properly followed by intervals in which they are being dropped in both the systems, at the same time. Could this be some kind of a buffer issue??? What do you mean by packet drops ? Do you mean dropped messages in cassandra ? Also, can write throughput be increased by distributing the write requests between the 2 Cassandra nodes instead of sending the requests to a single node? Currently, I dont see any improvement even if I distribute the write requests to different hosts. How can I improve the write performance overall? Normally we expect 3k to 4k non counter writes per core per node, if you are not seeing that it may be configuration or the client. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/12/2013, at 7:51 pm, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, I am a newbie to the Cassandra world and have a few doubts which I wanted to clarify. I am having a software switch that stores netflow packets into a shared memory segment and a daemon that reads that memory segment and stores them into a 2-node Cassandra cluster. Currently, I am storing the packets from 2 hosts into 2 different keyspaces, hence only writes and no reads. The write throughput is coming to around 460 pkts/sec in each of the keyspaces. But, when I try to store the packets into the same keyspace, I observe that the write throughput is remaining at around 460 pkts/sec or sometimes even falling below that rate as against the expected rate of around 920 pkts/sec. Is it some kind of limitation of Cassandra or am I doing something wrong??? I also see an increase in packet drops when I try to store the packets from both the hosts into the same keyspace. The packets are getting collected properly followed by intervals in which they are being dropped in both the systems, at the same time. Could this be some kind of a buffer issue??? The write requests from both the systems are sent to the same node which is also the seed node. I am mostly using the default Cassandra configuration with replication_factor set to 1 and without durable_writes. The systems are i5s with 4 gb RAM. The data model is: each second is the row key with all the packets collected in that second as the columns. Also, can write throughput be increased by distributing the write requests between the 2 Cassandra nodes instead of sending the requests to a single node? Currently, I dont see any improvement even if I distribute the write requests to different hosts. How can I improve the write performance overall? Thanks. -- Regards, BNSK.
Re: Unable to create collection inside collection
Could anybody suggest me how do I achieve it in Cassandra. It’s not supported. You may want to model the feeschedule as a table. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:09 pm, Santosh Shet santosh.s...@vista-one-solutions.com wrote: Hi, I am not able to create collection inside another collection in Cassandra. Please find screenshot below image001.png In the above screenshot, I am trying to create column named feeschedule with type Map and Map have values which is of type List. Could anybody suggest me how do I achieve it in Cassandra. My Cassandra version details are given below: cqlsh version- cqlsh 4.1.0 Cassandra version – 2.0.2 Thanks in advance, Regards Santosh Shet Software Engineer | VistaOne Solutions Direct India : +91 80 30273829 | Mobile India : +91 8105720582 Skype : santushet
Re: Cassandra pytho pagination
CQL3 and thrift do not support an offset clause, so you can only really support next / prev page calls to the database. I am trying to use xget with column_count and buffer_size parameters. Can someone explain me, how does it work? From doc, my understanding is that, I can do something like, What client are you using ? xget is not a standard cassandra function. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/12/2013, at 4:56 am, Kumar Ranjan winnerd...@gmail.com wrote: Hey Folks, I need some ideas about support implementing of pagination on the browser, from the backend. So python code (backend) gets request from frontend with page=1,2,3,4 and so on and count_per_page=50. I am trying to use xget with column_count and buffer_size parameters. Can someone explain me, how does it work? From doc, my understanding is that, I can do something like, total_cols is total columns for that key. count is what user sends me. .xget('Twitter_search', hh, column_count=total_cols, buffer_size=count): Is my understanding correct? because its not working for page 2 and so on? Please enlighten me with suggestions. Thanks.
Re: Cassandra data update for a row
'twitter_row_key': OrderedDict([('411186035495010304', u'{score: 0, tid: 411186035495010304, created_at: Thu Dec 12 17:29:24 + 2013, favorite: 0, retweet: 0, approved: true}'),]) How can I set approved to 'false' ?? It looks like the value of the 411186035495010304 column is a string, to cassandra that’s an opaque typer we do not make partial updates to. If you need to update the values individually they need to be stored in columns. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/12/2013, at 8:18 am, Kumar Ranjan winnerd...@gmail.com wrote: Hey Folks, I have a row like this. 'twitter_row_key' is the row key and 411186035495010304 is column. Rest is values for 411186035495010304 column. See below. 'twitter_row_key': OrderedDict([('411186035495010304', u'{score: 0, tid: 411186035495010304, created_at: Thu Dec 12 17:29:24 + 2013, favorite: 0, retweet: 0, approved: true}'),]) How can I set approved to 'false' ?? When I try insert for row key 'twitter_row_key' and column 411186035495010304, it overwrites the whole data and new row becomes like this 'twitter_row_key': OrderedDict([('411186035495010304', u'{approved: true}'),]) Any thoughts guys?
Re: Get all the data for x number of seconds from CQL?
Is it possible to get all the data for last 5 seconds or 10 seconds or 30 seconds by using the id column? Not using the current table. Try this CREATE TABLE test1 ( day integer, timestamp integer, count integer, record_name text, record_valueblob, PRIMARY KEY (day, timestamp, record_name) ) Store the day as MMDD and the timestamp as before, you can then do queries like select * from test1 where day = 20131218 and timestamp X and timestamp y; Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/12/2013, at 11:28 am, Techy Teck comptechge...@gmail.com wrote: Below is my CQL table - CREATE TABLE test1 ( id text, record_name text, record_value blob, PRIMARY KEY (id, record_name) ) here id column will have data like this - timestamp.count And here timestamp is in milliseconds but rounded up to nearest seconds. So as an example, data in `id column` will be like this - 138688293.1 And a single row in the above table will be like this - 138688293.1 | event_name | hello-world Now my question is - Is it possible to get all the data for last 5 seconds or 10 seconds or 30 seconds by using the id column? I am running Cassandra 1.2.9
Re: Write performance with 1.2.12
With a single node I get 3K for cassandra 1.0.12 and 1.2.12. So I suspect there is some network chatter. I have started looking at the sources, hoping to find something. 1.2 is pretty stable, I doubt there is anything in there that makes it run slower than 1.0. It’s probably something in your configuration or network. Compare the local write time from nodetool cfhistograms and the request latency from nodetool proxyhistograms. Writes request latency should be a bit below 1ms and local write latency should be around .5 ms or better. if there is a wider difference between the two it’s wait time + network time. As a general rule you should get around 3k to 4k writes per second per core. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/12/2013, at 8:06 pm, Rahul Menon ra...@apigee.com wrote: Quote from http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you’re upgrading and using the new cassandra.yaml file, be sure to change the partitioner back to RandomPartitioner On Thu, Dec 12, 2013 at 10:57 PM, srmore comom...@gmail.com wrote: On Thu, Dec 12, 2013 at 11:15 AM, J. Ryan Earl o...@jryanearl.us wrote: Why did you switch to RandomPartitioner away from Murmur3Partitioner? Have you tried with Murmur3? # partitioner: org.apache.cassandra.dht.Murmur3Partitioner partitioner: org.apache.cassandra.dht.RandomPartitioner Since I am comparing between the two versions I am keeping all the settings same. I see Murmur3Partitioner has some performance improvement but then switching back to RandomPartitioner should not cause performance to tank, right ? or am I missing something ? Also, is there an easier way to update the data from RandomPartitioner to Murmur3 ? (upgradesstable ?) On Fri, Dec 6, 2013 at 10:36 AM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00
Re: Bulkoutputformat
Request did not complete within rpc_timeout. The node is overloaded and did not return in time. Check the logs for errors or excessive JVM GC and try selecting less data. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/12/2013, at 10:06 am, varun allampalli vshoori.off...@gmail.com wrote: Thanks Rahul..article was insightful On Fri, Dec 13, 2013 at 12:25 AM, Rahul Menon ra...@apigee.com wrote: Here you go http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html On Fri, Dec 13, 2013 at 7:19 AM, varun allampalli vshoori.off...@gmail.com wrote: Hi Aaron, It seems like you answered the question here. https://groups.google.com/forum/#!topic/nosql-databases/vjZA5vdycWA Can you give me the link to the blog which you mentioned http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ Thanks in advance Varun On Thu, Dec 12, 2013 at 3:36 PM, varun allampalli vshoori.off...@gmail.com wrote: Thanks Aaron, I was able to generate sstables and load using sstableloader. But after loading the tables when I do a select query I get this, the table has only one record. Is there anything I am missing or any logs I can look at. Request did not complete within rpc_timeout. On Wed, Dec 11, 2013 at 7:58 PM, Aaron Morton aa...@thelastpickle.com wrote: If you don’t need to use Hadoop then try the SSTableSimpleWriter and sstableloader , this post is a little old but still relevant http://www.datastax.com/dev/blog/bulk-loading Otherwise AFAIK BulkOutputFormat is what you want from hadoop http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote: Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
Re: Improving write performance in Cassandra and a few related issues...
rite throughput is remaining at around 460 pkts/sec or sometimes even falling below that rate as against the expected rate of around 920 pkts/sec. Is it some kind of limitation of Cassandra or am I doing something wrong??? There is nothing in cassandra that would make that happen. Double check your client. I also see an increase in packet drops when I try to store the packets from both the hosts into the same keyspace. The packets are getting collected properly followed by intervals in which they are being dropped in both the systems, at the same time. Could this be some kind of a buffer issue??? What do you mean by packet drops ? Do you mean dropped messages in cassandra ? Also, can write throughput be increased by distributing the write requests between the 2 Cassandra nodes instead of sending the requests to a single node? Currently, I dont see any improvement even if I distribute the write requests to different hosts. How can I improve the write performance overall? Normally we expect 3k to 4k non counter writes per core per node, if you are not seeing that it may be configuration or the client. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/12/2013, at 7:51 pm, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, I am a newbie to the Cassandra world and have a few doubts which I wanted to clarify. I am having a software switch that stores netflow packets into a shared memory segment and a daemon that reads that memory segment and stores them into a 2-node Cassandra cluster. Currently, I am storing the packets from 2 hosts into 2 different keyspaces, hence only writes and no reads. The write throughput is coming to around 460 pkts/sec in each of the keyspaces. But, when I try to store the packets into the same keyspace, I observe that the write throughput is remaining at around 460 pkts/sec or sometimes even falling below that rate as against the expected rate of around 920 pkts/sec. Is it some kind of limitation of Cassandra or am I doing something wrong??? I also see an increase in packet drops when I try to store the packets from both the hosts into the same keyspace. The packets are getting collected properly followed by intervals in which they are being dropped in both the systems, at the same time. Could this be some kind of a buffer issue??? The write requests from both the systems are sent to the same node which is also the seed node. I am mostly using the default Cassandra configuration with replication_factor set to 1 and without durable_writes. The systems are i5s with 4 gb RAM. The data model is: each second is the row key with all the packets collected in that second as the columns. Also, can write throughput be increased by distributing the write requests between the 2 Cassandra nodes instead of sending the requests to a single node? Currently, I dont see any improvement even if I distribute the write requests to different hosts. How can I improve the write performance overall? Thanks. -- Regards, BNSK.
Re: Cassandra 1.2 : OutOfMemoryError: unable to create new native thread
Try using jstack to see if there are a lot of threads there. Are you using vNodea and Hadoop ? https://issues.apache.org/jira/browse/CASSANDRA-6169 Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 17/12/2013, at 2:48 am, Maciej Miklas mac.mik...@gmail.com wrote: the cassandra-env.sh has option JVM_OPTS=$JVM_OPTS -Xss180k it will give this error if you start cassandra with java 7. So increase the value, or remove option. Regards, Maciej On Mon, Dec 16, 2013 at 2:37 PM, srmore comom...@gmail.com wrote: What is your thread stack size (xss) ? try increasing that, that could help. Sometimes the limitation is imposed by the host provider (e.g. amazon ec2 etc.) Thanks, Sandeep On Mon, Dec 16, 2013 at 6:53 AM, Oleg Dulin oleg.du...@gmail.com wrote: Hi guys! I beleive my limits settings are correct. Here is the output of ulimits -a: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1547135 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 10 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 32768 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited However, I just had a couple of cassandra nodes go down over the weekend for no apparent reason with the following error: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any input is greatly appreciated. -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match
-tmp- files will sit in the data dir, if there was an error creating them during compaction or flushing to disk they will sit around until a restart. Check the logs for errors to see if compaction was failing on something. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 17/12/2013, at 12:28 pm, Narendra Sharma narendra.sha...@gmail.com wrote: No snapshots. I restarted the node and now the Load in ring is in sync with the disk usage. Not sure what caused it to go out of sync. However, the Live SStable count doesn't match exactly with the number of data files on disk. I am going through the Cassandra code to understand what could be the reason for the mismatch in the sstable count and also why there is no reference of some of the data files in system.log. On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua aba...@247-inc.com wrote: Do you have any snapshots on the nodes where you are seeing this issue? Snapshots will link to sstables which will cause them not be deleted. -Arindam From: Narendra Sharma [mailto:narendra.sha...@gmail.com] Sent: Sunday, December 15, 2013 1:15 PM To: user@cassandra.apache.org Subject: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match We have 8 node cluster. Replication factor is 3. For some of the nodes the Disk usage (du -ksh .) in the data directory for CF doesn't match the Load reported in nodetool ring command. When we expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was okay. Over period of last 2-3 weeks the disk usage has gone up. We increased the RF from 2 to 3 2 weeks ago. I am not sure if increasing the RF is causing this issue. For one of the nodes that I analyzed: 1. nodetool ring reported load as 575.38 GB 2. nodetool cfstats for the CF reported: SSTable count: 28 Space used (live): 572671381955 Space used (total): 572671381955 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned 46 4. 'du -ksh .' in the data folder for CF returned 720G The above numbers indicate that there are some sstables that are obsolete and are still occupying space on disk. What could be wrong? Will restarting the node help? The cassandra process is running for last 45 days with no downtime. However, because the disk usage is high, we are not able to run full compaction. Also, I can't find reference to each of the sstables on disk in the system.log file. For eg I have one data file on disk as (ls -lth): 86G Nov 20 06:14 I have system.log file with first line: INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line 101) Logging initialized The 86G file must be a result of some compaction. I see no reference of data file in system.log file between 11/18 to 11/25. What could be the reason for that? The only reference is dated 11/29 when the file was being streamed to another node (new node). How can I identify the obsolete files and remove them? I am thinking about following. Let me know if it make sense. 1. Restart the node and check the state. 2. Move the oldest data files to another location (to another mount point) 3. Restart the node again 4. Run repair on the node so that it can get the missing data from its peers. I compared the numbers of a healthy node for the same CF: 1. nodetool ring reported load as 662.95 GB 2. nodetool cfstats for the CF reported: SSTable count: 16 Space used (live): 670524321067 Space used (total): 670524321067 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned 16 4. 'du -ksh .' in the data folder for CF returned 625G -Naren -- Narendra Sharma Software Engineer http://www.aeris.com http://narendrasharma.blogspot.com/ -- Narendra Sharma Software Engineer http://www.aeris.com http://narendrasharma.blogspot.com/
Re: various Cassandra performance problems when CQL3 is really used
* select id from table where token(id) token(some_value) and secondary_index = other_val limit 2 allow filtering; Filtering absolutely kills the performance. On a table populated with 130.000 records, single node Cassandra server (on my i7 notebook, 2GB of JVM heap) and secondary index built on column with low cardinality of its value set this query takes 156 seconds to finish. Yes, this is why you have to add allow_filtering. You are asking the nodes to read all the data that matches and filter in memory, that’s a SQL type operation. Your example query is somewhat complex and I doubt it could get decent performance, what does the query plan look like? IMHO you need to do further de-normalisation, you will get the best performance when you select rows by their full or part primary key. By the way, the performance is order of magnitude better if this patch is applied: That looks like it’s tuned to your specific need, it would ignore the max results included in the query. * select id from table; As we saw in the trace log, the query - although it queries just row ids - scans all columns of all the rows and (probably) compares TTL with current time (?) (we saw hundreds of thousands of gettimeofday(2)). This means that if the table somehow mixes wide and narrow rows, the performance suffers horribly. Select all rows from a table requires a range scan, which reads all rows from all nodes. It should never be used production. Not sure what you mean by “scans all columns from all rows” a select by column name will use a SliceByNamesReadCommand which will only read the required columns from each SSTable (it normally short circuits though and read from less). if there is a TTL the ExpiringColumn.localExpirationTime must be checked, if there is no TTL it will no be checked. As Cassandra checks all the columns in selects, performance suffers badly if the collection is of any interesting size. This is not true, could you provide an example where you think this is happening ? Additionally, we saw various random irreproducible freezes, high CPU consumption when nothing happens (even with trace log level set no activity was reported) and highly inpredictable performance characteristics after nodetool flush and/or major compaction. What was the HW platform and what was the load ? Typically freezes in the server correlate to JVM GC, the JVM GC can also be using the CPU. If you have wide rows or make large reads you may run into more JVM GC issues. nodetool flush will (as it says) flush all the tables to disk, if you have a lot tables and/or a lot of secondary indexes this can cause the switch lock to be held preventing write threads from progressing. Once flush threads stop waiting on the flush queue the lock will be released. See the help for memtable_flush_queue_size in the yaml file. major compaction is not recommended to be used in production. If you are seeing it cause performance problems I would guess it is related to JVM GC and/or the disk IO is not able to keep up. When used it creates a single SSTable for each table which will not be compacted again until (default) 3 other large SSTables are created or you run major compaction again. For this reason it is not recommended. Conclusions: - do not use collections - do not use secondary indexes - do not use filtering - have your rows as narrow as possible if you need any kind of all row keys traversal These features all have a use, but it looks like you leaned on them heavily while creating a relational model. Specially the filtering, you have to explicitly enable it to prevent the client sending queries that will take a long time. The only time row key traversal is used normally is reading data through hadoop. You should always strive to read row(s) from a table by the full or partial primary key. With these conclusions in mind, CQL seems redundant, plain old thrift may be used, joins should be done client side and/or all indexes need to be handled manually. Correct? No. CQL provide a set of functionality not present in the thrift API. Joins and indexes should generally be handled by denormlaising the data during writes. It sounds like your data model was too relational, you need to denormalise and read rows by primary key. Secondary indexes are useful when you have a query pattern that is used infrequently. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 18/12/2013, at 3:47 am, Ondřej Černoš cern...@gmail.com wrote: Hi all, we are reimplementing a legacy interface of an inventory-like service (currently built on top of mysql) on Cassandra and I thought I would share some findings with the list. The interface semantics is given and cannot be changed. We chose Cassandra due to its multiple datacenter capabilities
Re: AddContractPoint /VIP
What is the good practice to put in the code as addContactPoint ie.,how many servers ? I use the same nodes as the seed list nodes for that DC. The idea of the seed list is that it’s a list of well known nodes, and it’s easier operationally to say we have one list of well known nodes that is used by the servers and the clients. 1) I am also thinking to put this way here I am not sure this good or bad if i conigure 4 serves into one VIP ( virtual IP/virtual DNS) and specifying that DSN in the code as ContactPoint, so that that VIP is smart enough to route to different nodes. Too complicated. 2) Is that problem if i use multiple Data centers in future ? You only need to give the client the local seeds, it will discover all the nodes. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 7:12 am, chandra Varahala hadoopandcassan...@gmail.com wrote: Greetings, I have 4 node cassandra cluster that will grow upt to 10 nodes,we are using CQL Java client to access the data. What is the good practice to put in the code as addContactPoint ie.,how many servers ? 1) I am also thinking to put this way here I am not sure this good or bad if i conigure 4 serves into one VIP ( virtual IP/virtual DNS) and specifying that DSN in the code as ContactPoint, so that that VIP is smart enough to route to different nodes. 2) Is that problem if i use multiple Data centers in future ? thanks Chandra
Re: Write performance with 1.2.12
Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen
Re: OOMs during high (read?) load in Cassandra 1.2.11
Do you have the back trace for from the heap dump so we can see what the array was and what was using it ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 4:41 am, Klaus Brunner klaus.brun...@gmail.com wrote: 2013/12/9 Nate McCall n...@thelastpickle.com: Do you have any secondary indexes defined in the schema? That could lead to a 'mega row' pretty easily depending on the cardinality of the value. That's an interesting point - but no, we don't have any secondary indexes anywhere. From the heap dump, it's fairly evident that it's not a single huge row but actually many rows. I'll keep watching if this occurs again, or if the compaction fixed it for good. Thanks, Klaus
Re: Data Modelling Information
create table messages( body text, username text, tags settext PRIMARY keys(username,tags) ) This statement is syntactically invalid, also you cannot use a collection type in the primary key. 1) I should be able to query by username and get all the messages for a particular username yes. 2) I should be able to query by tags and username ( likes select * from messages where username='xya' and tags in ('awesome','phone')) No. 3) I should be able to query all messages by day and order by desc and limit to some value No. Could you guys please let me know if creating a secondary index on the tags field? No, it’s not supported. Or what would be the best way to model this data. You need to describe the problem and how you want to read the data. I suggest taking a look at the data modelling videos from Patrick here http://planetcassandra.org/Learn/CassandraCommunityWebinars Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 8:57 am, Shrikar archak shrika...@gmail.com wrote: Hi Data Model Experts, I have a few questions with data modelling for a particular application. example create table messages( body text, username text, tags settext PRIMARY keys(username,tags) ) Requirements 1) I should be able to query by username and get all the messages for a particular username 2) I should be able to query by tags and username ( likes select * from messages where username='xya' and tags in ('awesome','phone')) 3) I should be able to query all messages by day and order by desc and limit to some value Could you guys please let me know if creating a secondary index on the tags field? Or what would be the best way to model this data. Thanks, Shrikar
Re: Nodetool repair exceptions in Cassandra 2.0.2
[2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 the 10.x.x.48 node sent a tree response (merkle tree) to this node that did not contain the tree. This node then killed the repair session. Look for log messages on 10.x.x.48 that correlate with the repair session ID above. They may look like logger.error(Failed creating a merkle tree for + desc + , + initiator + (see log for details)”); or logger.info(String.format([repair #%s] Sending completed merkle tree to %s for %s/%s, desc.sessionId, initiator, desc.keyspace, desc.columnFamily)); Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 10/12/2013, at 12:57 pm, Laing, Michael michael.la...@nytimes.com wrote: My experience is that you must upgrade to 2.0.3 ASAP to fix this. Michael On Mon, Dec 9, 2013 at 6:39 PM, David Laube d...@stormpath.com wrote: Hi All, We are running Cassandra 2.0.2 and have recently stumbled upon an issue with nodetool repair. Upon running nodetool repair on each of the 5 nodes in the ring (one at a time) we observe the following exceptions returned to standard out; [2013-12-08 11:04:02,047] Repair session ff16c510-5ff7-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #ff16c510-5ff7-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 [2013-12-08 11:04:02,063] Repair session 284c8b40-5ff8-11e3-97c0-5973cc397f8f for range (-109256956528331396,-89316884701275697] failed with error org.apache.cassandra.exceptions.RepairException: [repair #284c8b40-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,070] Repair session 399e7160-5ff8-11e3-97c0-5973cc397f8f for range (8901153810410866970,8915879751739915956] failed with error org.apache.cassandra.exceptions.RepairException: [repair #399e7160-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (8901153810410866970,8915879751739915956]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,072] Repair session 3ea73340-5ff8-11e3-97c0-5973cc397f8f for range (1149084504576970235,1190026362216198862] failed with error org.apache.cassandra.exceptions.RepairException: [repair #3ea73340-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1149084504576970235,1190026362216198862]] Validation failed in /10.x.x.103 [2013-12-08 11:04:02,091] Repair session 6f0da460-5ff8-11e3-97c0-5973cc397f8f for range (-5407189524618266750,-5389231566389960750] failed with error org.apache.cassandra.exceptions.RepairException: [repair #6f0da460-5ff8-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (-5407189524618266750,-5389231566389960750]] Validation failed in /10.x.x.103 [2013-12-09 23:16:36,962] Repair session 7efc2740-6127-11e3-97c0-5973cc397f8f for range (1246984843639507027,1266616572749926276] failed with error org.apache.cassandra.exceptions.RepairException: [repair #7efc2740-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family1, (1246984843639507027,1266616572749926276]] Validation failed in /10.x.x.48 [2013-12-09 23:16:36,986] Repair session a8c44260-6127-11e3-97c0-5973cc397f8f for range (-109256956528331396,-89316884701275697] failed with error org.apache.cassandra.exceptions.RepairException: [repair #a8c44260-6127-11e3-97c0-5973cc397f8f on keyspace_name/col_family2, (-109256956528331396,-89316884701275697]] Validation failed in /10.x.x.210 The /var/log/cassandra/system.log shows similar info as above with no real explanation as to the root cause behind the exception(s). There also does not appear to be any additional info in /var/log/cassandra/cassandra.log. We have tried restoring a recent snapshot of the keyespace in question to a separate staging ring and the repair runs successfully and without exception there. This is even after we tried insert/delete on the keyspace in the separate staging ring. Has anyone seen this behavior before and what can we do to resolve this? Any assistance would be greatly appreciated. Best regards, -Dave
Re: setting PIG_INPUT_INITIAL_ADDRESS environment . variable in Oozie for cassandra ...¿?
Caused by: java.io.IOException: PIG_INPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS environment variable not set at org.apache.cassandra.hadoop.pig.CassandraStorage.setLocation(CassandraStorage.java:314) at org.apache.cassandra.hadoop.pig.CassandraStorage.getSchema(CassandraStorage.java:358) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151) ... 35 more Have you checked these are set ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/12/2013, at 4:00 am, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: Hi, I have an error with pig action in oozie 4.0.0 using cassandraStorage. (cassandra 1.2.10) I can run pig scripts right with cassandra. but whe I try to use cassandraStorage to load data I have this error: Run pig script using PigRunner.run() for Pig version 0.8+ Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 Run pig script using PigRunner.run() for Pig version 0.8+ 2013-12-10 12:24:39,084 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 2013-12-10 12:24:39,084 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25 2013-12-10 12:24:39,095 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log 2013-12-10 12:24:39,095 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/hadoop-ec2-user/mapred/local/taskTracker/ec2-user/jobcache/job_201312100858_0007/attempt_201312100858_0007_m_00_0/work/pig-job_201312100858_0007.log 2013-12-10 12:24:39,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.228.243.18:9000 2013-12-10 12:24:39,501 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.228.243.18:9000 2013-12-10 12:24:39,510 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.228.243.18:9001 2013-12-10 12:24:39,510 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.228.243.18:9001 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage 2013-12-10 12:24:40,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245: file testCassandra.pig, line 7, column 7 Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:155) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:110) at org.apache.pig.newplan.logical.relational.LOStore.getSchema(LOStore.java:68) at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.validate(SchemaAliasVisitor.java:60) at org.apache.pig.newplan.logical.visitor.SchemaAliasVisitor.visit(SchemaAliasVisitor.java:84) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1617) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1611) at org.apache.pig.PigServer$Graph.access$200(PigServer.java:1334) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:362) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:430) at org.apache.pig.PigRunner.run(PigRunner.java:49) at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283) at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223) at org.apache.oozie.action.hadoop.LauncherMain.run
Re: Exactly one wide row per node for a given CF?
Querying the table was fast. What I didn’t do was test the table under load, nor did I try this in a multi-node cluster. As the number of columns in a row increases so does the size of the column index which is read as part of the read path. For background and comparisons of latency see http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html or my talk on performance at the SF summit last year http://thelastpickle.com/speaking/2012/08/08/Cassandra-Summit-SF.html While the column index has been lifted to the -Index.db component AFAIK it must still be fully loaded. Larger rows take longer to go through compaction, tend to cause more JVM GC and have issue during repair. See the in_memory_compaction_limit_in_mb comments in the yaml file. During repair we detect differences in ranges of rows and stream them between the nodes. If you have wide rows and a single column is our of sync we will create a new copy of that row on the node, which must then be compacted. I’ve seen the load on nodes with very wide rows go down by 150GB just by reducing the compaction settings. IMHO all things been equal rows in the few 10’s of MB work better. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 11/12/2013, at 2:41 am, Robert Wille rwi...@fold3.com wrote: I have a question about this statement: When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then. I tested a data model that I created. Here’s the schema for the table in question: CREATE TABLE bdn_index_pub ( tree INT, pord INT, hpath VARCHAR, PRIMARY KEY (tree, pord) ); As a test, I inserted 100 million records. tree had the same value for every record, and I had 100 million values for pord. hpath averaged about 50 characters in length. My understanding is that all 100 million strings would have been stored in a single row, since they all had the same value for the first component of the primary key. I didn’t look at the size of the table, but it had to be several gigs (uncompressed). Contrary to what Aaron says, I do want to know what happens, because I didn’t experience any issues with this table during my test. Inserting was fast. The last batch of records inserted in approximately the same amount of time as the first batch. Querying the table was fast. What I didn’t do was test the table under load, nor did I try this in a multi-node cluster. If this is bad, can somebody suggest a better pattern? This table was designed to support a query like this: select hpath from bdn_index_pub where tree = :tree and pord = :start and pord = :end. In my application, most trees will have less than a million records. A handful will have 10’s of millions, and one of them will have 100 million. If I need to break up my rows, my first instinct would be to divide each tree into blocks of say 10,000 and change tree to a string that contains the tree and the block number. Something like this: 17:0, 0, ‘/’ … 17:0, , ’/a/b/c’ 17:1,1, ‘/a/b/d’ … I’d then need to issue an extra query for ranges that crossed block boundaries. Any suggestions on a better pattern? Thanks Robert From: Aaron Morton aa...@thelastpickle.com Reply-To: user@cassandra.apache.org Date: Tuesday, December 10, 2013 at 12:33 AM To: Cassandra User user@cassandra.apache.org Subject: Re: Exactly one wide row per node for a given CF? But this becomes troublesome if I add or remove nodes. What effectively I want is to partition on the unique id of the record modulus N (id % N; where N is the number of nodes). This is exactly the problem consistent hashing (used by cassandra) is designed to solve. If you hash the key and modulo the number of nodes, adding and removing nodes requires a lot of data to move. I want to be able to randomly distribute a large set of records but keep them clustered in one wide row per node. Sounds like you should revisit your data modelling, this is a pretty well known anti pattern. When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then. It’s a bad idea and you should take another look at the data model. If you have to do it, you can try the ByteOrderedPartitioner which uses the row key as a token, given you total control of the row placement. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 8:32 pm, Vivek Mishra
Re:
SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) CompositeType is a type composed of other types, see http://pycassa.github.io/pycassa/assorted/composite_types.html?highlight=compositetype Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 6:15 am, Kumar Ranjan winnerd...@gmail.com wrote: Hey Folks, So I am creating, column family using pycassaShell. See below: validators = { 'approved': 'BooleanType', 'text': 'UTF8Type', 'favorite_count':'IntegerType', 'retweet_count': 'IntegerType', 'expanded_url': 'UTF8Type', 'tuid': 'LongType', 'screen_name': 'UTF8Type', 'profile_image': 'UTF8Type', 'embedly_data': 'CompositeType', 'created_at':'UTF8Type', } SYSTEM_MANAGER.create_column_family('Narrative','Twitter_search_test', comparator_type='CompositeType', default_validation_class='UTF8Type', key_validation_class='UTF8Type', column_validation_classes=validators) I am getting this error: InvalidRequestException: InvalidRequestException(why='Invalid definition for comparator org.apache.cassandra.db.marshal.CompositeType.' My data will look like this: 'row_key' : { 'tid' : { 'expanded_url': u'http://instagram.com/p/hwDj2BJeBy/', 'text': '#snowinginNYC Makes me so happy\xe2\x9d\x840brittles0 \xe2\x9b\x84 @ Grumman Studios http://t.co/rlOvaYSfKa', 'profile_image': u'https://pbs.twimg.com/profile_images/3262070059/1e82f895559b904945d28cd3ab3947e5_normal.jpeg', 'tuid': 339322611, 'approved': 'true', 'favorite_count': 0, 'screen_name': u'LonaVigi', 'created_at': u'Wed Dec 11 01:10:05 + 2013', 'embedly_data': {u'provider_url': u'http://instagram.com/', u'description': ulonavigi's photo on Instagram, u'title': u'#snwinginNYC Makes me so happy\u2744@0brittles0 \u26c4', u'url': u'http://distilleryimage7.ak.instagram.com/5b880dec61c711e3a50b129314edd3b_8.jpg', u'thumbnail_width': 640, u'height': 640, u'width': 640, u'thumbnail_url': u'http://distilleryimage7.ak.instagram.com/b880dec61c711e3a50b1293d14edd3b_8.jpg', u'author_name': u'lonavigi', u'version': u'1.0', u'provider_name': u'Instagram', u'type': u'poto', u'thumbnail_height': 640, u'author_url': u'http://instagram.com/lonavigi'}, 'tid': 410577192746500096, 'retweet_count': 0 } }
Re: Cyclop - CQL3 web based editor
thanks, looks handy. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 6:16 am, Parth Patil parthpa...@gmail.com wrote: Hi Maciej, This looks great! Thanks for building this. On Wed, Dec 11, 2013 at 12:45 AM, Murali muralidharan@gmail.com wrote: Hi Maciej, Thanks for sharing it. On Wed, Dec 11, 2013 at 2:09 PM, Maciej Miklas mac.mik...@gmail.com wrote: Hi all, This is the Cassandra mailing list, but I've developed something that is strictly related to Cassandra, and some of you might find it useful, so I've decided to send email to this group. This is web based CQL3 editor. The idea is, to deploy it once and have simple and comfortable CQL3 interface over web - without need to install anything. The editor itself supports code completion, not only based on CQL syntax, but also based database content - so for example the select statement will suggest tables from active keyspace, or in where closure only columns from table provided after select from The results are displayed in reversed table - rows horizontally and columns vertically. It seems to be more natural for column oriented database. You can also export query results to CSV, or add query as browser bookmark. The whole application is based on wicket + bootstrap + spring and can be deployed in any web 3.0 container. Here is the project (open source): https://github.com/maciejmiklas/cyclop Have a fun! Maciej -- Thanks, Murali 99025-5 -- Best, Parth
Re: CLUSTERING ORDER CQL3
You need to specify all the clustering key components in the CLUSTERING ORDER BY clause create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (cid ASC, ts DESC); cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 10:44 am, Shrikar archak shrika...@gmail.com wrote: Hi All, My Usecase I want query result by ordered by timestamp DESC. But I don't want timestamp to be the second column in the primary key as that will take of my querying capability for example create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (ts DESC); Queries required: I want the result for all the below queries to be in DESC order of timestamp select * from demo where oid = 100; select * from demo where oid = 100 and cid = 10; select * from demo where oid = 100 and cid = 100 and ts minTimeuuid('something'); I am trying to create this table with CLUSTERING ORDER IN CQL and getting this error cqlsh:viralheat create table demo(oid int,cid int,ts timeuuid,PRIMARY KEY (oid,cid,ts)) WITH CLUSTERING ORDER BY (ts desc); Bad Request: Missing CLUSTERING ORDER for column cid In this document it mentions that we can have multple keys for cluster ordering. any one know how to do that? Go here Datastax doc If I make the timestamp the second column then I cant have queries likes select * from demo where oid = 100 and cid = 100 and ts minTimeuuid('something'); Thanks, Shrikar
Re: Bulkoutputformat
If you don’t need to use Hadoop then try the SSTableSimpleWriter and sstableloader , this post is a little old but still relevant http://www.datastax.com/dev/blog/bulk-loading Otherwise AFAIK BulkOutputFormat is what you want from hadoop http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote: Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
Re: efficient way to store 8-bit or 16-bit value?
What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. blob is a byte array or you could use the varint, a variable length integer, but you probably want the blob. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 1:33 pm, Andrey Ilinykh ailin...@gmail.com wrote: Column metadata is about 20 bytes. So, there is no big difference if you save 1 or 4 bytes. Thank you, Andrey On Wed, Dec 11, 2013 at 2:42 PM, onlinespending onlinespend...@gmail.com wrote: What do people recommend I do to store a small binary value in a column? I’d rather not simply use a 32-bit int for a single byte value. Can I have a one byte blob? Or should I store it as a single character ASCII string? I imagine each is going to have the overhead of storing the length (or null termination in the case of a string). That overhead may be worse than simply using a 32-bit int. Also is it possible to partition on a single character or substring of characters from a string (or a portion of a blob)? Something like: CREATE TABLE test ( id text, value blob, PRIMARY KEY (string[0:1]) )
Re: Write performance with 1.2.12
It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. Is this the local write latency or the cluster wide write request latency ? What sort of numbers are you seeing ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 3:39 pm, srmore comom...@gmail.com wrote: Thanks Aaron On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. You are right, had to revert it back to default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. I don't see any GC activity in logs. Tried to control the compaction by reducing the number of threads, did not help much. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00
Re: user / password authentication advice
Not sure if you are asking about the authentication authorisation in cassandra or how to implemented the same using cassandra. info on the cassandra authentication and authorisation is here http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/security/securityTOC.html Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 4:31 pm, onlinespending onlinespend...@gmail.com wrote: Hi, I’m using Cassandra in an environment where many users can login to use an application I’m developing. I’m curious if anyone has any advice or links to documentation / blogs where it discusses common implementations or best practices for user and password authentication. My cursory search online didn’t bring much up on the subject. I suppose the information needn’t even be specific to Cassandra. I imagine a few basic steps will be as follows: user types in username (e.g. email address) and password this is verified against a table storing username and passwords (encrypted in some way) a token is return to the app / web browser to allow further transactions using secure token (e.g. cookie) Obviously I’m only scratching the surface and it’s the detail and best practices of implementing this user / password authentication that I’m curious about. Thank you, Ben
Re: Repair hangs - Cassandra 1.2.10
I changed logging to debug level, but still nothing is logged. Again - any help will be appreciated. There is nothing at the ERROR level on any machine ? check nodetool compactionstats to see if a validation compaction is running, the repair may be waiting on this. check nodetool netstats to see if streams are being exchanged, then check the logs on those machines. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 10:24 pm, Tamar Rosen ta...@correlor.com wrote: Update - I am still experiencing the above issues, but not all the time. I was able to run repair (on this keyspace) from node 2 and from node 4, but now a different keyspace hangs on these nodes, and I am still not able to run repair on node 1. It seems random. I changed logging to debug level, but still nothing is logged. Again - any help will be appreciated. Tamar On Mon, Dec 2, 2013 at 11:53 AM, Tamar Rosen ta...@correlor.com wrote: Hi, On AWS, we had a 2 node cluster with RF 2. We added 2 more nodes, then changed RF to 3 on all our keyspaces. Next step was to run nodetool repair, node by node. (In the meantime, we found that we must use CL quorum, which is affecting our application's performance). Started with node 1, which is one of the old nodes. Ran: nodetool repair -pr It seemed to be progressing fine, running keyspace by keyspace, for about an hour, but then it hung. The last messages in the output are: [2013-12-01 11:18:24,577] Repair command #4 finished [2013-12-01 11:18:24,594] Starting repair command #5, repairing 230 ranges for keyspace correlor_customer_766 It stayed like this for almost 24 hours. Then we read about the possibility of this being related to not upgrading sstables, so we killed the process. We were not sure whether we had run upgrade sstables (we upgraded from 1.2.4 a couple of months ago) So: Ran upgradesstables on a specific table in the keyspace that repair got stuck on. (this was fast) nodetool upgradesstables correlor_customer_766 users Ran repair on that same table. nodetool repair correlor_customer_766 users -pr This is again hanging. The first and only output from this process is: [2013-12-02 08:22:41,221] Starting repair command #6, repairing 230 ranges for keyspace correlor_customer_766 Nothing else happened for more than an hour. Any help and advice will be greatly appreciated. Tamar Rosen correlor.com
Re: Murmur Long.MIN_VALUE token allowed?
AFAIK any value that is a valid output from murmor3 is a valid token. The Murmur3Partitioner set’s min and max to long min and max… public static final LongToken MINIMUM = new LongToken(Long.MIN_VALUE); public static final long MAXIMUM = Long.MAX_VALUE; Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 5/12/2013, at 12:38 am, horschi hors...@gmail.com wrote: Hi, I just realized that I can move a node to Long.MIN_VALUE: 127.0.0.1 rack1 Up Normal 1011.58 KB 100.00% -9223372036854775808 Is that really a valid token for Murmur3Partitioner ? I thought that Long.MIN_VALUE (like -1 for Random) is not a regular token. Shouldn't be only used for token-range-scans? kind regards, Christian
Re: Exactly one wide row per node for a given CF?
But this becomes troublesome if I add or remove nodes. What effectively I want is to partition on the unique id of the record modulus N (id % N; where N is the number of nodes). This is exactly the problem consistent hashing (used by cassandra) is designed to solve. If you hash the key and modulo the number of nodes, adding and removing nodes requires a lot of data to move. I want to be able to randomly distribute a large set of records but keep them clustered in one wide row per node. Sounds like you should revisit your data modelling, this is a pretty well known anti pattern. When rows get above a few 10’s of MB things can slow down, when they get above 50 MB they can be a pain, when they get above 100MB it’s a warning sign. And when they get above 1GB, well you you don’t want to know what happens then. It’s a bad idea and you should take another look at the data model. If you have to do it, you can try the ByteOrderedPartitioner which uses the row key as a token, given you total control of the row placement. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 8:32 pm, Vivek Mishra mishra.v...@gmail.com wrote: So Basically you want to create a cluster of multiple unique keys, but data which belongs to one unique should be colocated. correct? -Vivek On Tue, Dec 3, 2013 at 10:39 AM, onlinespending onlinespend...@gmail.com wrote: Subject says it all. I want to be able to randomly distribute a large set of records but keep them clustered in one wide row per node. As an example, lets say I’ve got a collection of about 1 million records each with a unique id. If I just go ahead and set the primary key (and therefore the partition key) as the unique id, I’ll get very good random distribution across my server cluster. However, each record will be its own row. I’d like to have each record belong to one large wide row (per server node) so I can have them sorted or clustered on some other column. If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 at the time of creation and have the partition key set to this value. But this becomes troublesome if I add or remove nodes. What effectively I want is to partition on the unique id of the record modulus N (id % N; where N is the number of nodes). I have to imagine there’s a mechanism in Cassandra to simply randomize the partitioning without even using a key (and then clustering on some column). Thanks for any help.
Re: Exactly one wide row per node for a given CF?
Basically this desire all stems from wanting efficient use of memory. Do you have any real latency numbers you are trying to tune ? Otherwise this sounds a little like premature optimisation. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 5/12/2013, at 6:16 am, onlinespending onlinespend...@gmail.com wrote: Pretty much yes. Although I think it’d be nice if Cassandra handled such a case, I’ve resigned to the fact that it cannot at the moment. The workaround will be to partition on the LSB portion of the id (giving 256 rows spread amongst my nodes) which allows room for scaling, and then cluster each row on geohash or something else. Basically this desire all stems from wanting efficient use of memory. Frequently accessed keys’ values are kept in RAM through the OS page cache. But the page size is 4KB. This is a problem if you are accessing several small records of data (say 200 bytes), since each record only occupies a small % of a page. This is why it’s important to increase the probability that neighboring data on the disk is relevant. Worst case would be to read in a full 4KB page into RAM, of which you’re only accessing one record that’s a couple hundred bytes. All of the other unused data of the page is wastefully occupying RAM. Now project this problem to a collection of millions of small records all indiscriminately and randomly scattered on the disk, and you can easily see how inefficient your memory usage will become. That’s why it’s best to cluster data in some meaningful way, all in an effort to increasing the probability that when one record is accessed in that 4KB block that its neighboring records will also be accessed. This brings me back to the question of this thread. I want to randomly distribute the data amongst the nodes to avoid hot spotting, but within each node I want to cluster the data meaningfully such that the probability that neighboring data is relevant is increased. An example of this would be having a huge collection of small records that store basic user information. If you partition on the unique user id, then you’ll get nice random distribution but with no ability to cluster (each record would occupy its own row). You could partition on say geographical region, but then you’ll end up with hot spotting when one region is more active than another. So ideally you’d like to randomly assign a node to each record to increase parallelism, but then cluster all records on a node by say geohash since it is more likely (however small that may be) that when one user from a geographical region is accessed other users from the same region will also need to be accessed. It’s certainly better than having some random user record next to the one you are accessing at the moment. On Dec 3, 2013, at 11:32 PM, Vivek Mishra mishra.v...@gmail.com wrote: So Basically you want to create a cluster of multiple unique keys, but data which belongs to one unique should be colocated. correct? -Vivek On Tue, Dec 3, 2013 at 10:39 AM, onlinespending onlinespend...@gmail.com wrote: Subject says it all. I want to be able to randomly distribute a large set of records but keep them clustered in one wide row per node. As an example, lets say I’ve got a collection of about 1 million records each with a unique id. If I just go ahead and set the primary key (and therefore the partition key) as the unique id, I’ll get very good random distribution across my server cluster. However, each record will be its own row. I’d like to have each record belong to one large wide row (per server node) so I can have them sorted or clustered on some other column. If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 at the time of creation and have the partition key set to this value. But this becomes troublesome if I add or remove nodes. What effectively I want is to partition on the unique id of the record modulus N (id % N; where N is the number of nodes). I have to imagine there’s a mechanism in Cassandra to simply randomize the partitioning without even using a key (and then clustering on some column). Thanks for any help.
Re: Raid Issue on EC2 Datastax ami, 1.2.11
Thanks for the update Philip, other people have reported high await on a single volume previously but I don’t think it’s been blamed on noisy neighbours. It’s interesting that you can have noisy neighbours for IO only. Out of interest was there much steal reported in top or iostat ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 6/12/2013, at 4:42 am, Philippe Dupont pdup...@teads.tv wrote: Hi again, I have much more in formations on this case : We did further investigations on the nodes affected and did find some await problems on one of the 4 disk in raid: http://imageshack.com/a/img824/2391/s7q3.jpg Here was the iostat of the node : http://imageshack.us/a/img7/7282/qq3w.png You can see that the write and read throughput are exactly the same on the 4 disks of the instance. So the raid0 looks good enough. Yet, the global await, r_await and w_await are 3 to 5 times bigger on xvde disk than in other disks. We reported this to amazon support, and there is their answer : Hello, I deeply apologize for any inconvenience this has been causing you and thank you for the additional information and screenshots. Using the instance you based your iostat on (i-), I have looked into the underlying hardware it is currently using and I can see it appears to have a noisy neighbor leading to the higher await time on that particular device. Since most AWS services are multi-tenant, situations can arise where one customer's resource has the potential to impact the performance of a different customer's resource that reside on the same underlying hardware (a noisy neighbor). While these occurrences are rare, they are nonetheless inconvenient and I am very sorry for any impact it has created. I have also looked into the initial instance referred to when the case was created (i-xxx) and cannot see any existing issues (neighboring or otherwise) as to any I/O performance impacts; however, at the time the case was created, evidence on our end suggests there was a noisy neighbor then as well. Can you verify if you are still experiencing above average await times on this instance? If you would like to mitigate the impact of encountering noisy neighbors, you can look into our Dedicated Instance option; Dedicated Instances launch on hardware dedicated to only a single customer (though this can feasibly lead to a situation where a customer is their own noisy neighbor). However, this is an option available only to instances that are being launched into a VPC and may require modification of the architecture of your use-case. I understand the instances belonging to your cluster in question have been launched into EC2-Classic, I just wanted to bring this your attention as a possible solution. You can read more about Dedicated Instances here: http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the performance impact you have been experiencing due to having noisy neighbors. We understand the frustration and are always actively working to increase capacity so the effects of noisy neighbors is lessened. I hope this information has been useful and if you have any additional questions whatsoever, please do not hesitate to ask! To conclude, the only other solution to avoid VPC and Reserved Instance is to replace this instance by a new one, hoping to not having other Noisy neighbors... I hope that will help someone. Philippe 2013/11/28 Philippe DUPONT pdup...@teads.tv Hi, We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge based on datastax AMI with 4 storage in raid0 mode. Here is the ticket we opened with amazon support : This raid is created using the datastax public AMI : ami-b2212dc6. Sources are also available here : https://github.com/riptano/ComboAMI As you can see in the screenshot attached (http://imageshack.com/a/img854/4592/xbqc.jpg) randomly but frequently one of the storage get fully used (100%) but 3 others are standing in low use. Because of this, the node becomes slow and the whole cassandra cluster is impacted. We are losing data due to writes fails and availability for our customers. it was in this state for one hour, and we decided to restart it. We already removed 3 other instances because of this same issue. (see other screenshots) http://imageshack.com/a/img824/2391/s7q3.jpg http://imageshack.com/a/img10/556/zzk8.jpg Amazon support took a close look at the instance as well as it's underlying hardware for any potential health issues and both seem to be healthy. Have someone already experienced something like this ? Should I contact the AMI author better? Thanks a lot, Philippe.
Re: Unable to run hadoop_cql3_word_count examples
InvalidRequestException(why:consistency level LOCAL_ONE not compatible with replication strategy (org.apache.cassandra.locator.SimpleStrategy)) at The LOCAL_ONE consistency level can only be used with the NetworkTopologyStrategy. I had a quick look and the code does not use LOCAL_ONE, did you make a change? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 3/12/2013, at 10:03 pm, Parth Patil parthpa...@gmail.com wrote: Hi, I am new to Cassandra and I am exploring the Hadoop integration (MapReduce) provided by Cassandra. I am trying to run the hadoop examples provided in the cassandra's repo under examples/hadoop_cql3_word_count. I am using the cassandra-2.0 branch. I have a single node cassandra running locally. I was able to run the ./bin/word_count_setup step successfully but when I run the ./bin/word_count step I am getting the following error : java.lang.RuntimeException at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:661) at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.(CqlPagingRecordReader.java:297) at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:163) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: InvalidRequestException(why:consistency level LOCAL_ONE not compatible with replication strategy (org.apache.cassandra.locator.SimpleStrategy)) at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:52627) at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:52604) at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:52519) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1785) at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1770) at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:631) ... 6 more Has anyone seen this before ? Am I missing something ? -- Best, Parth
Re: Commitlog replay makes dropped and recreated keyspace and column family rows reappear
Do you have the logs from after the restart ? Did it include a Drop Keyspace…” INFO level message ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 2:44 am, Desimpel, Ignace ignace.desim...@nuance.com wrote: Hi, I have the impression that there is an issue with dropping a keyspace and then recreating the keyspace (and column families), combined with a restart of the database My test goes as follows: Create keyspace K and column families C. Insert rows X0 column family C0 Query for X0 : found rows : OK Drop keyspace K Query for X0 : found no rows : OK Create keyspace K and column families C. Insert rows X1 column family C1 Query for X0 : not found : OK Query for X1 : found : OK Stop the Cassandra database Start the Cassandra database Query for X1 : found : OK Query for X0 : found : NOT OK ! Did someone tested this scenario? Using : CASSANDRA VERSION 2.02, thrift, java 1.7.x, centos Ignace Desimpel
Re: CQL workaround for modifying a primary key
I just tested this with 1.2.9 and DROP TABLE took a snapshot and moved the existing files out of the dir. Do you have some more steps to reproduce ? Cheers A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 4/12/2013, at 11:23 am, Ike Walker ike.wal...@flite.com wrote: What is the best practice for modifying the primary key definition of a table in Cassandra 1.2.9? Say I have this table: CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) ); I want to add a new column named version and include that column in the primary key. CQL will let me add the column, but you can't change the primary key for an existing table. So I drop the table and recreate it: DROP TABLE temperature; CREATE TABLE temperature ( weatherstation_id text, version int, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,version,event_time) ); But then I start getting errors like this: java.io.FileNotFoundException: /var/lib/cassandra/data/test/temperature/test-temperature-ic-8316-Data.db (No such file or directory) So I guess the drop table doesn't actually delete the data, and I end up with a problem like this: https://issues.apache.org/jira/browse/CASSANDRA-4857 What's a good workaround for this, assuming I don;t want to change the name of my table? Should I just truncate the table, then drop it and recreate it? Thanks. -Ike Walker
Re: While inserting data into Cassandra using Hector client
Hector is designed to use Column Families created via the thrift interface, e.g. using cassandra-cli Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 25/11/2013, at 8:51 pm, Santosh Shet santosh.s...@vista-one-solutions.com wrote: Hi, I am getting below error while inserting data into Cassandra using Hector client. me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:Not enough bytes to read value of component 0) I am facing this problem after upgrading Cassandra from 1.2.3 to version 2.0.2. Earlier I was able to insert data using same code. Below are the scripts used to create keyspace and table. CREATE KEYSPACE demo_one WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1}; CREATE TABLE investmentvehicle(key text PRIMARY KEY); Could you provide some inputs to troubleshoot this issue. Thanks, Santosh Shet Software Engineer | VistaOne Solutions Direct India : +91 80 30273829 | Mobile India : +91 8105720582 Skype : santushet
Re: Multiple writers writing to a cassandra node...
I am a newbie to the Cassandra world. I would like to know if its possible for two different nodes to write to a single Cassandra node Yes. Currently, I am getting a IllegalRequestException, what (): Default TException on the first system, What is the full error stack ? Occasionally, also hitting frame size has negative value thrift exception when the traffic is high and packets are getting stored very fast. On the client or the server ? Can you post the full error stack ? Currently using Cassandra 2.0.0 with libQtCassandra library. Please upgrade to 2.0.3. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 26/11/2013, at 4:42 am, Krishna Chaitanya bnsk1990r...@gmail.com wrote: Hello, I am a newbie to the Cassandra world. I would like to know if its possible for two different nodes to write to a single Cassandra node. I have a packet collector software which runs in two different systems. I would like both of them to write the packets to a single node(same keyspace and columnfamily). Currently using Cassandra 2.0.0 with libQtCassandra library. Currently, I am getting a IllegalRequestException, what (): Default TException on the first system, the moment I try to store from the second system, but the second system works fine. When I restart the program on the first system, the second system gets the exception and the first one works fine. Occasionally, also hitting frame size has negative value thrift exception when the traffic is high and packets are getting stored very fast. Can someone please point out what I am doing wrong? Thanks in advance..
Re: Nodetool cleanup
I hope I get this right :) Thanks for contributing :) a repair will trigger a mayor compaction on your node which will take up a lot of CPU and IO performance. It needs to do this to build up the data structure that is used for the repair. After the compaction this is streamed to the different nodes in order to repair them. It does not trigger a major compaction, that’s what we call running compaction on the command line and compacting all SSTables into one big one. it will flush all the data to disk that will create some additional compaction. The major concern is that s a disk IO intensive operation, it reads all the data and writes data to new SSTables (a one to one mapping). If you have all nodes doing this at the same time there may be some degraded performance. And as it’s all nodes it’s not possible for the Dynamic Snitch to avoid nodes if they are overloaded. Cleanup is less intensive than repair, but it’s still a good idea to stagger it. If you need to run it on all machines (or you have very powerful machines) it’s probably going to be OK. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 26/11/2013, at 5:14 am, Artur Kronenberg artur.kronenb...@openmarket.com wrote: Hi Julien, I hope I get this right :) a repair will trigger a mayor compaction on your node which will take up a lot of CPU and IO performance. It needs to do this to build up the data structure that is used for the repair. After the compaction this is streamed to the different nodes in order to repair them. If you trigger this on every node simultaneously you basically take the performance away from your cluster. I would expect cassandra still to function, just way slower then before. Triggering it node after node will leave your cluster with more resources to handle incoming requests. Cheers, Artur On 25/11/13 15:12, Julien Campan wrote: Hi, I'm working with Cassandra 1.2.2 and I have a question about nodetool cleanup. In the documentation , it's writted Wait for cleanup to complete on one node before doing the next I would like to know, why we can't perform a lot of cleanup in a same time ? Thanks
Re: Intermittent connection error
The inability to truncate is actually my bigger problem. If I could truncate tables, then I wouldn't have to create so many sessions, and the frequency of this error would be at tolerable levels. Can you truncate through cqlsh ? Running this program occasionally produces the following output: Looks like a node is getting evicted from the pool, try turning the logging level up to DEBUG see if it says anything. For DS driver specific questions you may have better luck using the mail list here https://github.com/datastax/java-driver Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 22/11/2013, at 9:16 am, Robert Wille rwi...@fold3.com wrote: Sure: package com.footnote.tools.cassandra; import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Cluster.Builder; import com.datastax.driver.core.Session; public class Test { public static void main(String[] args) { try { Builder builder = Cluster.builder(); Cluster c = builder.addContactPoint(cas121.devf3.com).withPort(9042).build(); Session s = c.connect(rwille); s.execute(select rhpath from browse_document_tree); s.shutdown(); } catch (Exception e) { e.printStackTrace(); } System.exit(0); } } Running this program occasionally produces the following output: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/rwille/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/rwille/workspace_fold3/dev-backend/extern/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] - Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. - [Control connection] Cannot connect to any host, scheduling retry com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:271) at com.datastax.driver.core.Session$Manager.setKeyspace(Session.java:461) at com.datastax.driver.core.Cluster.connect(Cluster.java:178) at com.footnote.tools.cassandra.Test.main(Test.java:17) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:96) at com.datastax.driver.core.Session$Manager.execute(Session.java:513) at com.datastax.driver.core.Session$Manager.executeQuery(Session.java:549) at com.datastax.driver.core.Session$Manager.setKeyspace(Session.java:455) ... 2 more It isn't very often that it fails. I had to run it about 20 times before it got an error. However, because I cannot truncate, I have resorted to dropping and recreating my schema for every unit test. I often have a random test case fail with this same error. The inability to truncate is actually my bigger problem. If I could truncate tables, then I wouldn't have to create so many sessions, and the frequency of this error would be at tolerable levels. Thanks in advance. Robert From: Turi, Ferenc (GE Power Water, Non-GE) ferenc.t...@ge.com Reply-To: user@cassandra.apache.org Date: Thursday, November 21, 2013 12:26 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Intermittent connection error Hi, Please attach the source to have deeper look at it. Ferenc From: Robert Wille [mailto:rwi...@fold3.com] Sent: Thursday, November 21, 2013 7:11 PM To: user@cassandra.apache.org Subject: Intermittent connection error I intermittently get the following error when I try to execute my first query after connecting: Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException
Re: 1.1.11: system keyspace is filling up
What happens if they are not being successfully delivered ? Will they eventually TTL-out ? They have a TTL set to the gc_grace_seconds on the CF at the time of the write. I’ve also seen hints build up in multi DC systems due to timeouts on the coordinator. i.e. the remote nodes are up, co-ordinator starts the writes, remote nodes process the request (no dropped messages), but the response is lost. These are tracked as timeouts on the MessagingServiceMBean. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 22/11/2013, at 6:00 pm, Rahul Menon ra...@apigee.com wrote: Oleg, The system keyspace is not replicated it is local to the node. You should check your logs to see if there are Timeouts from streaming hints, i believe the default value to stream hints it 10 seconds. When i ran into this problem i truncated hints to clear out the space and then ran a repair so ensure that all the data was consistant across all nodes, even if there was a failure. -rm On Tue, Nov 5, 2013 at 6:29 PM, Oleg Dulin oleg.du...@gmail.com wrote: What happens if they are not being successfully delivered ? Will they eventually TTL-out ? Also, do I need to truncate hints on every node or is it replicated ? Oleg On 2013-11-04 21:34:55 +, Robert Coli said: On Mon, Nov 4, 2013 at 11:34 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have a dual DC setup, 4 nodes, RF=4 in each. The one that is used as primary has its system keyspace fill up with 200 gigs of data, majority of which is hints. Why does this happen ? How can I clean it up ? If you have this many hints, you probably have flapping / frequent network partition, or very overloaded nodes. If you compare the number of hints to the number of dropped messages, that would be informative. If you're hinting because you're dropping, increase capacity. If you're hinting because of partition, figure out why there's so much partition. WRT cleaning up hints, they will automatically be cleaned up eventually, as long as they are successfully being delivered. If you need to manually clean them up you can truncate system.hints keyspace. =Rob -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Large system.Migration CF after upgrade to 1.1
We have noticed that a cluster we upgraded to 1.1.6 (from 1.0.*) still has a single large (~4GB) row in system.Migrations on each cluster node. There is some code in there to drop that CF at startup, but I’m not sure on the requirements for it to run. if the time stamps have not been updated in a while copy them out of the way and restart. We are also seeing heap pressure / Full GC issues when we do schema updates to this cluster How much memory does the machine have and how is the JVM configured ? On pre 1.1 that is often a result of memory pressure from the bloom filters and compression meta data being on the JVM heap. Do you have a lot (i.e. 500Million ) rows per node ? Check how small CMS can get the heap, it may be the case that it just cannot reduce it further. As a work around you can: increase the heap, increase bloom_filter_fp_chance (per cf) and index_interval (yaml). My talk called “In case of emergency break glass” at the summit in SF this year talks about this http://thelastpickle.com/speaking/2013/06/11/Speaking-Cassandra-Summit-SF-2013.html Long term moving to 1.2 will help. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 23/11/2013, at 10:35 am, Andrew Cooper andrew.coo...@nisc.coop wrote: We have noticed that a cluster we upgraded to 1.1.6 (from 1.0.*) still has a single large (~4GB) row in system.Migrations on each cluster node. We are also seeing heap pressure / Full GC issues when we do schema updates to this cluster. If the two are related, is it possible to somehow remove/truncate the system.Migrations CF? If I understand correctly, version 1.1 no longer uses this CF, instead using the system.schema_* CF's. We have multiple clusters and clusters which were built from scratch at version 1.1 or 1.2 do no have data in system.Migrations. I would appreciate any advice and I can provide more details if needed. -Andrew Andrew Cooper National Information Solutions Cooperative® 3201 Nygren Drive NW Mandan, ND 58554 + e-mail: andrew.coo...@nisc.coop ( phone: 866.999.6472 ext 6824 ( direct: 701-667-6824
Re: How to set Cassandra config directory path
I noticed when I gave the path directly to cassandra.yaml, it works fine. Can't I give the directory path here, as mentioned in the doc? Documentation is wrong, the -Dcassandra.config param is used for the path of the yaml file not the config directory. I’ve emailed d...@datastax.com to let them know. What I really want to do is to give the cassandra-topology.properties path to Cassandra. Set the CASSANDRA_CONF env var in cassandra-in.sh Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 6:15 am, Bhathiya Jayasekara tobhathi...@gmail.com wrote: Hi all, I'm trying to set conf directory path to Cassandra. According to [1], I can set it using a system variable as cassandra.config=directory But it doesn't seem to work for me when I give conf directory path. I get following exception. [2013-11-20 22:24:38,273] ERROR {org.apache.cassandra.config.DatabaseDescriptor} - Fatal configuration error org.apache.cassandra.exceptions.ConfigurationException: Cannot locate /home/bhathiya/cassandra/conf/etc at org.apache.cassandra.config.DatabaseDescriptor.getStorageConfigURL(DatabaseDescriptor.java:117) at org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:134) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:126) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:216) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:446) at org.wso2.carbon.cassandra.server.CassandraServerController$1.run(CassandraServerController.java:48) at java.lang.Thread.run(Thread.java:662) Cannot locate /home/bhathiya/cassandra/conf/etc Fatal configuration error; unable to start server. See log for stacktrace. I noticed when I gave the path directly to cassandra.yaml, it works fine. Can't I give the directory path here, as mentioned in the doc? What I really want to do is to give the cassandra-topology.properties path to Cassandra. [1] http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/tools/toolsCUtility_t.html Thanks, Bhathiya
Re: Cannot TRUNCATE
If it’s just a test system nuke it and try again :) Was there more than one node at any time ? Does nodetool status show only one node ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 7:45 am, Robert Wille rwi...@fold3.com wrote: I've got a single node with all empty tables, and truncate fails with the following error: Unable to complete request: one or more nodes were unavailable. Everything else seems fine. I can insert, update, delete, etc. The only thing in the logs that looks relevant is this: INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:36:59,064 OutboundTcpConnection.java (line 386) Handshaking version with /192.168.98.121 INFO [HANDSHAKE-/192.168.98.121] 2013-11-20 11:37:04,064 OutboundTcpConnection.java (line 395) Cannot handshake version with /192.168.98.121 I'm running Cassandra 2.0.2. I get the same error in cqlsh as I do with the java driver. Thanks Robert
Re: Config changes to leverage new hardware
However, for both writes and reads there was virtually no difference in the latencies. What sort of latency were you getting ? I’m still not very sure where the current *write* bottleneck is though. What numbers are you getting ? Could the bottle neck be the client ? Can it send writes fast enough to saturate the nodes ? As a rule of thumb you should get 3,000 to 4,000 (non counter) writes per second per core. Sample iostat data (captured every 10s) for the dedicated disk where commit logs are written is below. Does this seem like a bottle neck? Does not look too bad. Another interesting thing is that the linux disk cache doesn’t seem to be growing in spite of a lot of free memory available. Things will only get paged in when they are accessed. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 12:42 pm, Arindam Barua aba...@247-inc.com wrote: Thanks for the suggestions Aaron. As a follow up, we ran a bunch of tests with different combinations of these changes on a 2-node ring. The load was generated using cassandra-stress, run with default values to write 30 million rows, and read them back. However, for both writes and reads there was virtually no difference in the latencies. The different combinations attempted: 1. Baseline test with none of the below changes. 2. Grabbing the TLAB setting from 1.2 3. Moving the commit logs too to the 7 disk RAID 0. 4. Increasing the concurrent_read to 32, and concurrent_write to 64 5. (3) + (4), i.e. moving commit logs to the RAID + increasing concurrent_read and concurrent_write config to 32 and 64. The write latencies were very similar, except them being ~3x worse for the 99.9th percentile and above for scenario (5) above. The read latencies were also similar, with (3) and (5) being a little worse for the 99.99th percentile. Overall, not making any changes, i.e. (1) performed as well or slightly better than any of the other changes. Running cassandra-stress on both the old and new hardware without making any config changes, the write performance was very similar, but the new hardware did show ~10x improvement in the read for the 99.9th percentile and higher. After thinking about this, the reason why we were not seeing any difference with our test framework was perhaps the nature of the test where we write the rows, and then do a bunch of reads to read the rows that were just written immediately following. The data is read back from the memtables, and never from the disk/sstables. Hence the new hardware’s increased RAM and size of the disk cache or higher number of disks never helps. I’m still not very sure where the current *write* bottleneck is though. The new hardware has 32 cores vs 8 cores of the old hardware. Moving the commit log from a dedicated disk to a 7 RAID-0 disk system (where it would be shared by other data though) didn’t make a difference too. (unless the extra contention on the RAID nullified the positive effects of the RAID). Sample iostat data (captured every 10s) for the dedicated disk where commit logs are written is below. Does this seem like a bottle neck? When the commit logs are written the await/svctm ratio is high. Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util 0.00 8.09 0.04 8.85 0.00 0.0715.74 0.00 0.12 0.03 0.02 0.00 768.03 0.00 9.49 0.00 3.04 655.41 0.04 4.52 0.33 0.31 0.00 8.10 0.04 8.85 0.00 0.0715.75 0.00 0.12 0.03 0.02 0.00 752.65 0.00 10.09 0.00 2.98 604.75 0.03 3.00 0.26 0.26 Another interesting thing is that the linux disk cache doesn’t seem to be growing in spite of a lot of free memory available. The total disk cache used reported by ‘free’ is less than the size of the sstables written with over 100 GB unused RAM. Even in production, where we have the older hardware running with 32 GB RAM for a long time now, looking at 5 hosts in 1 DC, only 2.5 GB to 8 GB was used for the disk cache. The Cassandra java process uses the 8 GB allocated to it, and at least 10-15 GB on all the hosts is not used at all. Thanks, Arindam From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, November 06, 2013 8:34 PM To: Cassandra User Subject: Re: Config changes to leverage new hardware Running Cassandra 1.1.5 currently, but evaluating to upgrade to 1.2.11 soon. You will make more use of the extra memory moving to 1.2 as it moves bloom filters and compression data off heap. Also grab the TLAB setting from cassandra-env.sh in v1.2 As of now, our performance tests (our application specific as well
Re: Is there any open source software for automatized deploy C* in PRD?
Thanks, But I suppose it’s just for Debian? Am I right? There are debian and rpm packages, and people deploy them or the binary packages with with chef and similar tools. It may be easier to answer your question if you describe the specific platform / needs. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 10:35 pm, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 boole.z@newegg.com wrote: Thanks, But I suppose it’s just for Debian? Am I right? Any others? Best Regards, Boole Guo Software Engineer, NESC-SH.MIS +86-021-51530666*41442 Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042) 发件人: Mike Adamson [mailto:mikeat...@gmail.com] 发送时间: 2013年11月21日 17:16 收件人: user@cassandra.apache.org 主题: Re: Is there any open source software for automatized deploy C* in PRD? Hi Boole, Have you tried chef? There is this cookbook for deploying cassandra: http://community.opscode.com/cookbooks/cassandra MikeA On 21 November 2013 01:33, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 boole.z@newegg.com wrote: Hi all, Is there any open source software for automatized deploy C* in PRD? Best Regards, Boole Guo Software Engineer, NESC-SH.MIS +86-021-51530666*41442 Floor 19, KaiKai Plaza, 888, Wanhangdu Rd, Shanghai (200042) ONCE YOU KNOW, YOU NEWEGG. CONFIDENTIALITY NOTICE: This email and any files transmitted with it may contain privileged or otherwise confidential information. It is intended only for the person or persons to whom it is addressed. If you received this message in error, you are not authorized to read, print, retain, copy, disclose, disseminate, distribute, or use this message any part thereof or any information contained therein. Please notify the sender immediately and delete all copies of this message. Thank you in advance for your cooperation. 保密注意:此邮件及其附随文件可能包含了保密信息。该邮件的目的是发送给指定收件人。如果您非指定收件人而错误地收到了本邮件,您将无权阅读、打印、保存、复制、泄露、传播、分发或使用此邮件全部或部分内容或者邮件中包含的任何信息。请立即通知发件人,并删除该邮件。感谢您的配合!
Re: Migration Cassandra 2.0 to Cassandra 2.0.2
Mr Coli What's the difference between deploy binaries and the binary package ? I upload the binary package on the Apache Cassandra Homepage, Am I wrong ? Yes you can use the instructions here for the binary package http://wiki.apache.org/cassandra/DebianPackaging When you use the binary package it creates the directory locations, installs the init scripts and makes it a lot easier to start and stop cassandra. I recommend using them. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 21/11/2013, at 11:06 pm, Bonnet Jonathan. jonathan.bon...@externe.bnpparibas.com wrote: Thanks Mr Coli and Mr Wee for your answears, Mr Coli What's the difference between deploy binaries and the binary package ? I upload the binary package on the Apache Cassandra Homepage, Am I wrong ? Mr Wee i think you hit the right way, cause my lib directory in my Cassandra_Home are different between the two versions. In the Home for the old version /produits/cassandra/install_cassandra/apache-cassandra-2.0.0/lib i have: [cassandra@s00vl9925761 lib]$ ls -ltr total 14564 -rw-r- 1 cassandra cassandra 123898 Aug 28 15:07 thrift-server-0.3.0.jar -rw-r- 1 cassandra cassandra 42854 Aug 28 15:07 thrift-python-internal-only-0.7.0.zip -rw-r- 1 cassandra cassandra 55066 Aug 28 15:07 snaptree-0.1.jar -rw-r- 1 cassandra cassandra 1251514 Aug 28 15:07 snappy-java-1.0.5.jar -rw-r- 1 cassandra cassandra 270552 Aug 28 15:07 snakeyaml-1.11.jar -rw-r- 1 cassandra cassandra8819 Aug 28 15:07 slf4j-log4j12-1.7.2.jar -rw-r- 1 cassandra cassandra 26083 Aug 28 15:07 slf4j-api-1.7.2.jar -rw-r- 1 cassandra cassandra 134133 Aug 28 15:07 servlet-api-2.5-20081211.jar -rw-r- 1 cassandra cassandra 1128961 Aug 28 15:07 netty-3.5.9.Final.jar -rw-r- 1 cassandra cassandra 80800 Aug 28 15:07 metrics-core-2.0.3.jar -rw-r- 1 cassandra cassandra 134748 Aug 28 15:07 lz4-1.1.0.jar -rw-r- 1 cassandra cassandra 481534 Aug 28 15:07 log4j-1.2.16.jar -rw-r- 1 cassandra cassandra 347531 Aug 28 15:07 libthrift-0.9.0.jar -rw-r- 1 cassandra cassandra 16046 Aug 28 15:07 json-simple-1.1.jar -rw-r- 1 cassandra cassandra 91183 Aug 28 15:07 jline-1.0.jar -rw-r- 1 cassandra cassandra 17750 Aug 28 15:07 jbcrypt-0.3m.jar -rw-r- 1 cassandra cassandra5792 Aug 28 15:07 jamm-0.2.5.jar -rw-r- 1 cassandra cassandra 765648 Aug 28 15:07 jackson-mapper-asl-1.9.2.jar -rw-r- 1 cassandra cassandra 228286 Aug 28 15:07 jackson-core-asl-1.9.2.jar -rw-r- 1 cassandra cassandra 96046 Aug 28 15:07 high-scale-lib-1.1.2.jar -rw-r- 1 cassandra cassandra 1891110 Aug 28 15:07 guava-13.0.1.jar -rw-r- 1 cassandra cassandra 66843 Aug 28 15:07 disruptor-3.0.1.jar -rw-r- 1 cassandra cassandra 91982 Aug 28 15:07 cql-internal-only-1.4.0.zip -rw-r- 1 cassandra cassandra 54345 Aug 28 15:07 concurrentlinkedhashmap-lru-1.3.jar -rw-r- 1 cassandra cassandra 25490 Aug 28 15:07 compress-lzf-0.8.4.jar -rw-r- 1 cassandra cassandra 284220 Aug 28 15:07 commons-lang-2.6.jar -rw-r- 1 cassandra cassandra 30085 Aug 28 15:07 commons-codec-1.2.jar -rw-r- 1 cassandra cassandra 36174 Aug 28 15:07 commons-cli-1.1.jar -rw-r- 1 cassandra cassandra 1695790 Aug 28 15:07 apache-cassandra-thrift-2.0.0.jar -rw-r- 1 cassandra cassandra 71117 Aug 28 15:07 apache-cassandra-clientutil-2.0.0.jar -rw-r- 1 cassandra cassandra 3265185 Aug 28 15:07 apache-cassandra-2.0.0.jar -rw-r- 1 cassandra cassandra 1928009 Aug 28 15:07 antlr-3.2.jar drwxr-x--- 2 cassandra cassandra4096 Oct 1 14:16 licenses In my new home i have /produits/cassandra/install_cassandra/apache-cassandra-2.0.2/lib: [cassandra@s00vl9925761 lib]$ ls -ltr total 9956 -rw-r- 1 cassandra cassandra 123920 Oct 24 09:21 thrift-server-0.3.2.jar -rw-r- 1 cassandra cassandra 52477 Oct 24 09:21 thrift-python-internal-only-0.9.1.zip -rw-r- 1 cassandra cassandra 55066 Oct 24 09:21 snaptree-0.1.jar -rw-r- 1 cassandra cassandra 1251514 Oct 24 09:21 snappy-java-1.0.5.jar -rw-r- 1 cassandra cassandra 270552 Oct 24 09:21 snakeyaml-1.11.jar -rw-r- 1 cassandra cassandra 26083 Oct 24 09:21 slf4j-api-1.7.2.jar -rw-r- 1 cassandra cassandra 22291 Oct 24 09:21 reporter-config-2.1.0.jar -rw-r- 1 cassandra cassandra 1206119 Oct 24 09:21 netty-3.6.6.Final.jar -rw-r- 1 cassandra cassandra 82123 Oct 24 09:21 metrics-core-2.2.0.jar -rw-r- 1 cassandra cassandra 165505 Oct 24 09:21 lz4-1.2.0.jar -rw-r- 1 cassandra cassandra 217054 Oct 24 09:21 libthrift-0.9.1.jar -rw-r- 1 cassandra cassandra 16046 Oct 24 09:21 json-simple-1.1.jar -rw-r- 1 cassandra cassandra 91183 Oct 24 09:21 jline-1.0.jar -rw-r- 1 cassandra cassandra 17750 Oct 24 09:21 jbcrypt-0.3m.jar -rwxrwxrwx 1 cassandra
Re: Error: Unable to search across multiple secondary index types
java.lang.RuntimeException: java.lang.RuntimeException: Unable to search across multiple secondary index types A query that used two secondary indexed columns would require query plan to determine the most efficient approach. We don’t support features like that. I would expect an empty response, but instead I get Request did not complete within rpc_timeout.” info on cqlsh interface and there is an error in cassandra logs: That sounds like a bug, you should have gotten an error. Could you raise a bug on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/11/2013, at 10:22 pm, sielski siel...@man.poznan.pl wrote: Hello, I’ve installed Cassandra 2.0.2 and I’m trying to query a cassandra table using a SELECT statement with two WHERE clauses on columns with secondary indexes but Cassandra throws an error as in the subject. It’s easy to reproduce this problem. My table structure is as follows: CREATE TABLE test (c1 VARCHAR, c2 VARCHAR, c3 VARCHAR, PRIMARY KEY (c1, c2); CREATE INDEX test_i1 ON test (c2); CREATE INDEX test_i2 ON test (c3); Then I execute a simple query on an empty table: SELECT * FROM test WHERE c2='whatever' AND c3 ='whatever' ALLOW FILTERING; I would expect an empty response, but instead I get Request did not complete within rpc_timeout.” info on cqlsh interface and there is an error in cassandra logs: ERROR 09:57:36,394 Exception in thread Thread[ReadStage:35,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Unable to search across multiple secondary index types at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1931) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.RuntimeException: Unable to search across multiple secondary index types at org.apache.cassandra.db.index.SecondaryIndexManager.search(SecondaryIndexManager.java:535) at org.apache.cassandra.db.ColumnFamilyStore.search(ColumnFamilyStore.java:1649) at org.apache.cassandra.db.RangeSliceCommand.executeLocally(RangeSliceCommand.java:135) at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1414) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1927) Is it a bug or there is a reason why I cannot execute such a query on this model? I saw an issue https://issues.apache.org/jira/browse/CASSANDRA-5851 which is similar to mine but it’s marked as resolved in 2.0.0 and I’m using the most recent version. — Regards, Krzysztof Sielski
Re: DESIGN QUESTION: Need to update only older data in cassandra
The problems occurs during the day where updates can be sent that possibly contain older data then the nightly batch update. If you have a an application level sequence for updates (I used that term to avoid saying timestamp) you could use it as the cassandra timestamp. As long as you know it increases it’s fine. You can specify the timestamp for a column via either thrift or cql3. When the updates come in during the day if they have the older time stamp just send the write and it will be ignored. Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 17/11/2013, at 8:45 am, Lawrence Turcotte lawrence.turco...@gmail.com wrote: that is, data consists of of an account id with a timestamp column that indicates when the account was updated. This is not to be confused with row insertion/update times tamp maintained by Cassandra for conflict resolution within the Cassanda Nodes. Furthermore the account has about 200 columns and updates occur nightly in batch mode where roughly 300-400 million updates are sent. The problems occurs during the day where updates can be sent that possibly contain older data then the nightly batch update. As such the requirement to first look at the account update time stamp in the database and comparing the proposed update time stamp to determine whether to update or not. The idea here is that a read before update in Cassandra is generally not a good idea. To alleviate this problem I was thinking of either maintaining a separate Cassandra db with only two columns of account id and update time stamp and using this as a look up before updating or setting a stored procedure within the main database to do the read and update if the data within the database is older. UPDATE Account SET some columns WHERE lastUpdateTimeStamp proposedUpdateTimeStamp. I am kind of leaning towards the separate database or keys pace as a simple look up to determine whether to update the data in the main Cassandra database, that is the database that contain the 200 columns of account data. If this is the best choice then I would like to explore the pros and cons of creating a separate Cassandra Node cluster for look up of account update time stamps vs just adding another key space within the main Cassandra database in terms of performance implications. In this account and time stamp only database I would need to also update the time stamp when the main database would be updated. Any thoughts are welcome Lawrence
Re: Disaster recovery question
The first particular test we tried What as the disk_failure_policy setting ? 1) There were NO errors in the log on the node where we removed the commit log SSD drive - this surprised us (of course our ops monitoring would detect the downed disk too, but we hope to be able to look for ERROR level logging in system.log to cause alerts also) Can you reproduce this without needing to physically pull the drive ? Obviously there should be an error or warning there. Even if the disk_failure_policy says to ignore it should still log. 2) The node with no commit log disk just kept writing to memtables, but: 3) This was causing major CMS GC issues which eventually caused the node to appear down (nodetool status) to all other nodes, and indeed it itself saw all other nodes as down. That said dynamic snitch and latency detection in clients seemed to prevent this being much of a problem other than it seems potentially undesirable from a server side standpoint. The commit log has a queue that is 1024 * num processes long. If the write thread can get into this queue it will proceed (when using periodic commit log), so if there was no error I would expect writes to work for a little. But eventually this queue will get full and the write threads will not be able to proceed. The queue for the Mutation stage is essentially unbounded, so while the other nodes are sending writes it will continue to fill up. Leading to the CMS issues. Seeing nodes as down is a side effect of JVM GC preventing the Gossip threads from running frequently enough. that said maybe someone knows off the top of their head if there is a config setting that would start failing writes (due to memtable size) before GC became an issue, and we just have this misconfigured. Nope. Cassandra does not have an explicit back pressure mechanism. The best we have is the dynamic snitch and the gossip to eventually mark a node as down. 5) I guess the question is what is the best way to bring up a failed node a) delete all data first? b) clear data but restore from previous sstable from backup to miminise subsequent data transfer c) other suggestions It depends on the failure. In your example I would have brought it back either with or without the commit log, or with the commit log except the most recently modified file. There is protection in the commit log reply to only reply mutations that match the crc check. When it was back online I would run a repair (without -pr) to repair all the data on the node. I’m not sure the level DB error has to do with the commit log reply. 6) Our experience is that taking nodes down that have problems, then deleting data (subsets if we can see partial corruption) and re-adding is much safer (but our cluster is VERY fast). You should not need to do this, what sort of corruptions ? Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 17/11/2013, at 3:56 pm, graham sanderson gra...@vast.com wrote: agreed; that was a parallel issue from our ops (I apologize and will try to avoid duplicates) - I was asking the question from the architecture side as to what should happen rather than describing it as a bug. Nonetheless, I/We are still curious if anyone has an answer. On Nov 16, 2013, at 6:13 PM, Mikhail Stepura mikhail.step...@outlook.com wrote: Looks like someone has the same (1-4) questions: https://issues.apache.org/jira/browse/CASSANDRA-6364 -M graham sanderson wrote in message news:7161e7e0-cf24-4b30-b9ca-2faafb0c4...@vast.com... We are currently looking to deploy on the 2.0 line of cassandra, but obviously are watching for bugs (we are currently on 2.0.2) - we are aware of a couple of interesting known bugs to be fixed in 2.0.3 and one in 2.1, but none have been observed (in production use cases) or are likely to affect our current proposed deployment. I have a few general questions: The first particular test we tried was to physically remove the SSD commit drive for one of the nodes whilst under HEAVY write load (maybe a few hundred MB/s of data to be replicated 3 times - 6 node single local data center) and also while running read performance tests.. We currently have both node (CQL3) and Astyanax (Thrift) clients. Frankly everything was pretty good (no read/write failures or indeed (observed) latency issues) except, and maybe people can comment on any of these: 1) There were NO errors in the log on the node where we removed the commit log SSD drive - this surprised us (of course our ops monitoring would detect the downed disk too, but we hope to be able to look for ERROR level logging in system.log to cause alerts also) 2) The node with no commit log disk just kept writing to memtables, but: 3) This was causing major CMS GC issues which eventually caused the node
Re: Nodes not added to existing cluster
- broadcast_address is set to the instance's public address You only need this if you have a multi region setup. I’ve gisted the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0 This error TRACE [HANDSHAKE-/NODE_1_PUBLIC_IP] 2013-11-18 06:57:13,984 OutboundTcpConnection.java (line 393) Cannot handshake version with /NODE_1_PUBLIC_IP java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:402) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.io.InputStream.read(InputStream.java:101) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:387) Is preventing the node from reading the version and results in this line being printed ( -2147483648 is the no version flag) OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information yet, will retry Not really sure why that exception is being thrown, the help does not make it clear http://docs.oracle.com/javase/7/docs/api/java/nio/channels/AsynchronousCloseException.html Check the networking. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 18/11/2013, at 8:36 pm, Skye Book skye.b...@gmail.com wrote: Hi there, I’m bringing this thread back as its something that I thought was solved and is apparently not fixed on my end. To recap, I’m having trouble getting a node to join a cluster. Configuration seems all right using the EC2MultiRegionSnitch but new nodes are unable to handshake with seeds. - Security Group has 22 1024-65535 open - Nodes are configured with password authentication using CassandraAuthorizer - internode_authenticator is commented out in configuration - rpc_address is set to the instance’s private address - listen_address is set to the instance’s private address - broadcast_address is set to the instance's public address As was suggested earlier, I’ve enabled TRACE logging for OutboundTcpConnection and get the following dumped into system.log when the new node is started up without itself in the seed list (if its own IP is in the list it just creates a new single node cluster). I’ve gisted the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0 It looks like the handshake process completely and utterly fails as it seems unable to get any information from the other nodes as evidenced by: OutboundTcpConnection.java (line 386) Handshaking version with /NODE_1_PUBLIC_IP OutboundTcpConnection.java (line 386) Handshaking version with /NODE_2_PUBLIC_IP OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information yet, will retry Thanks in advance for any light you all might be able to shed on what’s going on. On Sep 26, 2013, at 9:03 PM, Aaron Morton aa...@thelastpickle.com wrote: INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd If you can turn up logging to TRACE for org.apache.cassandra.net.OutboundTcpConnection it will include the full error. The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join. Are you mixing versions ? Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 26/09/2013, at 5:13 PM, Skye Book skye.b...@gmail.com wrote: Hi Aaron, thanks for the clarification. As might be expected, having the broadcast_address fixed hasn't fixed anything. What I did find after writing my last email is that output.log is littered with these: INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd INFO 05:03:49,803 Cannot handshake version with /ww.xx.yy.zz INFO 05:03:49,805 Handshaking version with /ww.xx.yy.zz The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join. I started thinking that maybe EC2 was having an-advertised problem communicating between AZ's but bringing up nodes in both of the other availability zones resulted in the same wrong behavior. I've gist'd my cassandra.yaml, its pretty standard and hasn't caused an issue in the past for me. https://gist.github.com/skyebook/ec9364cdcec02e803ffc Skye Book http://skyebook.net -- @sbook
Re: Read inconsistency after backup and restore to different cluster
we then take the snapshot archive generated FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1, then we sounds correct. Depending on what additional comments/recommendation you or another member of the list may have (if any) based on the clarification I've made above, Also if you backup the system data it will bring along the tokens. This can be a pain if you want to change the cluster name. cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/11/2013, at 10:44 am, David Laube d...@stormpath.com wrote: Thank you for the detailed reply Rob! I have replied to your comments in-line below; On Nov 14, 2013, at 1:15 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Nov 14, 2013 at 12:37 PM, David Laube d...@stormpath.com wrote: It is almost as if the data only exists on some of the nodes, or perhaps the token ranges are dramatically different --again, we are using vnodes so I am not exactly sure how this plays into the equation. The token ranges are dramatically different, due to vnode random token selection from not setting initial_token, and setting num_tokens. You can verify this by listing the tokens per physical node in nodetool gossipinfo or (iirc) nodetool status. 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes in the new cluster-B ring. I don't understand this at all, do you mean that you are using one source node's data to load each of of the target nodes? Or are you just saying there's a 1:1 relationship between source snapshots and target nodes to load into? Unless you have RF=N, using one source for 5 target nodes won't work. We have configured RF=3 for the keyspace in question. Also, from a client perspective, we read with CL=1 and write with CL=QUORUM. Since we have 5 nodes total in cluster-A, we snapshot keyspace_name on each of the five nodes which results in a snapshot directory on each of the five nodes that we archive and ship off to s3. We then take the snapshot archive generated FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1, then we take the snapshot archive FROM cluster-A_node2 and copy/extract/restore TO cluster-B_node2 and so on and so forth. To do what I think you're attempting to do, you have basically two options. 1) don't use vnodes and do a 1:1 copy of snapshots 2) use vnodes and a) get a list of tokens per node from the source cluster b) put a comma delimited list of these in initial_token in cassandra.yaml on target nodes c) probably have to un-set num_tokens (this part is unclear to me, you will have to test..) d) set auto_bootstrap:false in cassandra.yaml e) start target nodes, they will not-bootstrap into the same ranges as the source cluster f) load schema / copy data into datadir (being careful of https://issues.apache.org/jira/browse/CASSANDRA-6245) g) restart node or use nodetool refresh (I'd probably restart the node to avoid the bulk rename that refresh does) to pick up sstables h) remove auto_bootstrap:false from cassandra.yaml I *believe* this *should* work, but have never tried it as I do not currently run with vnodes. It should work because it basically makes implicit vnode tokens explicit in the conf file. If it *does* work, I'd greatly appreciate you sharing details of your experience with the list. I'll start with parsing out the token ranges that our vnode config ends up assigning in cluster-A, and doing some creative config work on the target cluster-B we are trying to restore to as you have suggested. Depending on what additional comments/recommendation you or another member of the list may have (if any) based on the clarification I've made above, I will absolutely report back my findings here. General reference on tasks of this nature (does not consider vnodes, but treat vnodes as just a lot of physical nodes and it is mostly relevant) : http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra =Rob
Re: making sense of output from Eclipse Memory Analyzer tool taken from .hprof file
What version of cassandra are you using ? What are the JVM settings? (check with ps aux | grep cassandra) OOM in cassandra 1.2+ is rare but there is also https://issues.apache.org/jira/browse/CASSANDRA-5706 and https://issues.apache.org/jira/browse/CASSANDRA-6087 One instance of org.apache.cassandra.db.ColumnFamilyStore loaded by sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8 occupies 984,094,664 (11.64%) bytes. 938MB is a bit of memory, the CFS and data tracker are dealing with the memtable. This may indicate things are not being flushed from memory correctly. •java.lang.Thread @ 0x73e1f74c8 CompactionExecutor:158 - 839,225,000 (9.92%) bytes. •java.lang.Thread @ 0x717f08178 MutationStage:31 - 809,909,192 (9.58%) bytes. •java.lang.Thread @ 0x717f082c8 MutationStage:5 - 649,667,472 (7.68%) bytes. •java.lang.Thread @ 0x717f083a8 MutationStage:21 - 498,081,544 (5.89%) bytes. •java.lang.Thread @ 0x71b357e70 MutationStage:11 - 444,931,288 (5.26%) bytes. maybe very big rows and/or very big mutations. hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 15/11/2013, at 12:34 pm, Mike Koh defmike...@gmail.com wrote: I am investigating Java Out of memory heap errors. So I created an .hprof file and loaded it into Eclipse Memory Analyzer Tool which gave some Problem Suspects. First one looks like: One instance of org.apache.cassandra.db.ColumnFamilyStore loaded by sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8 occupies 984,094,664 (11.64%) bytes. The memory is accumulated in one instance of org.apache.cassandra.db.DataTracker$View loaded by sun.misc.Launcher$AppClassLoader @ 0x613e1bdc8. If I click around into the verbiage, I believe I can pick out the name of a column family but that is about it. Can someone explain what the above means in more detail and if it is indicative of a problem? Next one looks like: - •java.lang.Thread @ 0x73e1f74c8 CompactionExecutor:158 - 839,225,000 (9.92%) bytes. •java.lang.Thread @ 0x717f08178 MutationStage:31 - 809,909,192 (9.58%) bytes. •java.lang.Thread @ 0x717f082c8 MutationStage:5 - 649,667,472 (7.68%) bytes. •java.lang.Thread @ 0x717f083a8 MutationStage:21 - 498,081,544 (5.89%) bytes. •java.lang.Thread @ 0x71b357e70 MutationStage:11 - 444,931,288 (5.26%) bytes. -- If I click into the verbiage, they above Compaction and Mutations all seem to be referencing the same column family. Are the above related? Is there a way I can tell more exactly what is being compacted and/or mutated more specifically than which column family?