Re: Performance problem with large wide row inserts using CQL
On Fri, Feb 21, 2014 at 11:51 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 20, 2014 at 10:49 PM, Rüdiger Klaehn rkla...@gmail.comwrote: Hi Sylvain, I applied the patch to the cassandra-2.0 branch (this required some manual work since I could not figure out which commit it was supposed to apply for, and it did not apply to the head of cassandra-2.0). Yeah, some commit yesterday made the patch not apply cleanly anymore. In any case, It's not committed to the cassandra-2.0 branch and will be part of 2.0.6. The benchmark now runs in pretty much identical time to the thrift based benchmark. ~30s for 1000 inserts of 1 key/value pairs each. Great work! Glad that it helped. Thanks for the quick fix. I was really starting to get irritated when the people at SO basically told me that there is something wrong in my code. I still have some questions regarding the mapping. Please bear with me if these are stupid questions. I am quite new to Cassandra. The basic cassandra data model for a keyspace is something like this, right? SortedMapbyte[], SortedMapbyte[], PairLong, byte[] ^ row key. determines which server(s) the rest is stored on ^ column key ^ timestamp (latest one wins) ^ value (can be size 0) It's a reasonable way to think of how things are stored internally, yes. Though as DuyHai mentioned, the first map is really sorting by token and in general that means you use mostly the sorting of the second map concretely. Yes, understood. So the first SortedMap is sorted on some kind of hash of the actual key to make sure the data gets evenly distributed along the nodes? What if my key is already a good hash: is there a way to use an identity function as a hash function (in CQL)? I am thinking about some kind of content addressed storage, where the key is a 20 byte SHA1 hash of the data (like in git). Obviously this is already a pretty good hash. So if I have a table like the one in my benchmark (using blobs) CREATE TABLE IF NOT EXISTS test.wide ( time blob, name blob, value blob, PRIMARY KEY (time,name)) WITH COMPACT STORAGE From reading http://www.datastax.com/dev/blog/thrift-to-cql3 it seems that - time maps to the row key and name maps to the column key without any overhead - value directly maps to value in the model above without any prefix is that correct, or is there some overhead involved in CQL over the raw model as described above? If so, where exactly? That's correct. For completeness sake, if you were to remove the COMPACT STORAGE, there would be some overhead in how it maps to the underlying column key, but that overhead would buy you much more flexibility in how you could evolve this table schema (you could add more CQL columns later if needs be, have collections or have static columns following CASSANDRA-6561 that comes in 2.0.6; none of which you can have with COMPACT STORAGE). Note that it's perfectly fine to use COMPACT STORAGE if you know you don't and won't need the additional flexibility, but I generally advise people to actually check first that using COMPACT STORAGE does make a concrete and meaningful difference for their use case (be careful with premature optimization really). In this case I am confident that the schema will not change. But there will be other tables built from the same data where I am not going to use compact storage. cheers, Rüdiger
Queuing System
Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Disabling opscenter data collection in Datastax community 2.0
On 02/22/2014 06:12 AM, user 01 wrote: I'm using dsc20 (datastax community edition for cassandra 2.0) in production environment. But since I am not authorized to use Opscenter for production use. So how do I disable the data recording that is being done for opscenter consumption, as this is just a unusable for me will put unnecessary load on my machine ? The agent is lightweight. How are you planning to monitor your production env? You didn't hint at how you installed DSC/OpsCenter, but stop the agents, uninstall the agents, and drop the keyspace. The details of those steps depend on how you installed (rpm, deb, tar). http://www.datastax.com/documentation/opscenter/4.0/opsc/reference/opscInstallLocations_g.html http://www.datastax.com/documentation/opscenter/4.0/opsc/online_help/opscRemovingPackages_t.html Those docs might be helpful. Let us know how you installed DSC, if you need some better details. -- Kind regards, Michael
Re: Queuing System
Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.comwrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, - High Availability - No Data Loss - Better Performance. Following are some libraries that were considered along with the limitation I see, - Redis - Data Loss - ZooKeeper - Not advised for Queue system. - TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan doanduy...@gmail.com wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.comwrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, - High Availability - No Data Loss - Better Performance. Following are some libraries that were considered along with the limitation I see, - Redis - Data Loss - ZooKeeper - Not advised for Queue system. - TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
While, historically, it has been true that queuing in Cassandra has been an anti-pattern, it is also true that Leveled Compaction addresses the worst aspect of frequent deletes in Cassandra, and that overall, queuing in Cassandra is nowhere near the anti-pattern that it used to be. This is something that I've been meaning to write about more extensively. If your requirements are more around availability (particularly multi-dc) and relability with moderate (not extreme) performance, it is quite possible to build a pretty decent system on top of Cassandra. You don't mention your throughput requirements, nor additional semantics that might be necessary (e.g. deliver at-least-once vs deliver exactly once), but Cassandra 2.0's lightweight transactions provide a CAS primitive that can be used to ensure deliver-once if that is a requirement. I'd be happy to continue discussing appropriate data-models and access patterns if you decide to go down this path. -Tupshin On Sat, Feb 22, 2014 at 10:03 AM, Jagan Ranganathan ja...@zohocorp.comwrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, - High Availability - No Data Loss - Better Performance. Following are some libraries that were considered along with the limitation I see, - Redis - Data Loss - ZooKeeper - Not advised for Queue system. - TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
Hi Michael, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit. Complexities involving - Handling scenarios like Rabbit Connection failure etc Vs Cassandra write performance and replication with hinted handoff support etc, makes me wonder if this is a better path. Regards, Jagan On Sat, 22 Feb 2014 21:01:14 +0530 Michael Laing lt;michael.la...@nytimes.comgt; wrote We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan lt;doanduy...@gmail.comgt; wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
If performance and availability for messaging is a requirement then use Apache Kafka http://kafka.apache.org/ You can pass the same thrift/avro objects through the Kafka commit log or strings or whatever you want. /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Feb 22, 2014, at 11:13 AM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi Michael, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit. Complexities involving - Handling scenarios like Rabbit Connection failure etc Vs Cassandra write performance and replication with hinted handoff support etc, makes me wonder if this is a better path. Regards, Jagan On Sat, 22 Feb 2014 21:01:14 +0530 Michael Laing michael.la...@nytimes.com wrote We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan doanduy...@gmail.com wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
Hi Joe, If my understanding is right, Kafka does not satisfy the high availability/replication part well because of the need for leader and In-Sync replicas. Regards, Jagan On Sat, 22 Feb 2014 22:02:27 +0530 Joe Steinlt;crypt...@gmail.comgt; wrote If performance and availability for messaging is a requirement then use Apache Kafka http://kafka.apache.org/ You can pass the same thrift/avro objects through the Kafka commit log or strings or whatever you want. /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Feb 22, 2014, at 11:13 AM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi Michael, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit. Complexities involving - Handling scenarios like Rabbit Connection failure etc Vs Cassandra write performance and replication with hinted handoff support etc, makes me wonder if this is a better path. Regards, Jagan On Sat, 22 Feb 2014 21:01:14 +0530 Michael Laing lt;michael.la...@nytimes.comgt; wrote We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan lt;doanduy...@gmail.comgt; wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
Hi, Thanks for the pointer. Following are some options given there, If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect. A broker will usually have some notion of what’s next in the sequence and thus be able to do much more targeted queries, down to a single record if the storage strategy were to choose monotonic sequence numbers. We need to do is have some intelligence in using the system and avoid tombstones either use the pointed Column Name or use proper start column if slice query is used. Is that right or I am missing something here? Regards, Jagan On Sat, 22 Feb 2014 20:55:39 +0530 DuyHai Doanlt;doanduy...@gmail.comgt; wrote Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
Thanks Tupshin for your assistance. As I mentioned in the other mail, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit.I use Rabbit for Messaging because of the Routing and Push model communication etc. So I am thinking of using Cassandra as a temporary Queue which will give fast write performance with no data loss Vs waiting for Rabbit Ack @ application level or handling Rabbit re-connection Vs Cassandra hinted handoff writes. So Cassandra might aggregate all my msg queue temporarily before I publish them to Rabbit. Is this fine? If so, please share your insight on which model amp; access pattern will be a better fit for this usage. Throughput requirements may be around say 100 ops/sec. Regards, Jagan On Sat, 22 Feb 2014 21:10:36 +0530 Tupshin Harperlt;tups...@tupshin.comgt; wrote While, historically, it has been true that queuing in Cassandra has been an anti-pattern, it is also true that Leveled Compaction addresses the worst aspect of frequent deletes in Cassandra, and that overall, queuing in Cassandra is nowhere near the anti-pattern that it used to be. This is something that I've been meaning to write about more extensively. If your requirements are more around availability (particularly multi-dc) and relability with moderate (not extreme) performance, it is quite possible to build a pretty decent system on top of Cassandra. You don't mention your throughput requirements, nor additional semantics that might be necessary (e.g. deliver at-least-once vs deliver exactly once), but Cassandra 2.0's lightweight transactions provide a CAS primitive that can be used to ensure deliver-once if that is a requirement. I'd be happy to continue discussing appropriate data-models and access patterns if you decide to go down this path. -Tupshin On Sat, Feb 22, 2014 at 10:03 AM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: List support in Net::Async::CassandraCQL ?
(resending for the list now I'm subscribed) On Sat, 22 Feb 2014 14:03:06 +1100 Jacob Rhoden jacob.rho...@me.com wrote: This perl library has been extremely useful for scripting up data migrations. I wonder if anyone knows of the easiest way to use lists with this driver? Throwing a perl array in as a parameter doesn’t work as is: my $q = $cass-prepare(update contact set name=?, address=? where uuid=?)-get; push @f, $q-execute([$name, @address, $uuid]); Future-needs_all( @f )-get; Returns the following: Cannot encode address: not an ARRAY at /usr/local/share/perl/5.14.2/Net/Async/CassandraCQL/Query.pm line 182 It needs to arrive as an ARRAYref: $q-execute([$name, \@address, $uuid]); Or if you'd prefer you can use named bindings: $q-execute({name = $name, address = \@address, uuid = $uuid}); -- Paul LeoNerd Evans leon...@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Re: Queuing System
Jagan Few time ago I dealed with a similar queuing design for one customer. *If you never delete messages in the queue*, then it is possible to use wide rows with bucketing and increasing monotonic column name to store messages. CREATE TABLE *read_only_queue *( bucket_number int, insertion_time timeuuid, message text, PRIMARY KEY(bucket_number,insertion_time) ); Let's say that you allow only 100 000 messages per partition (physical row) to avoid too wide rows, then inserting/reading from the table *read_only_queue *is easy; For message producer : 1) Start at bucket_number = 1 2) Insert messages with column name = generated timeUUID with micro-second precision (depending on whether the insertion rate is high or not) 3) If message count = 100 000, increment bucket_number by one and go to 2) For message reader: 1) Start at bucket_number = 1 2) Read messages by slice of *N, *save the *insertion_time *of the last read message 3) Use the saved *insertion_time *to perform next slice query 4) If read messages count = 100 000, increment bucket_number and go to 2). Keep the *insertion_time, *do not reset it since his value is increasing monotonically For multiple and concurrent producers writers, there is a trick. Let's assume you have *P* concurrent producers and *C* concurrent consumers. Assign a numerical ID for each producer and consumer. First producer ID = 1... last producer ID = *P*. Same for consumers. - re-use the above algorithm - each producer/consumer start at *bucket_number *= his ID - at the end of the row, - next bucket_number = current bucker_number + *P* for producers - next bucket_number = current bucker_number + *C* for consumers The last thing to take care of is compaction configuration to reduce the number of SSTables on disk. If you achieve to get rid of accumulation effects, e.g reading rate is faster than writing rate, the message are likely to be consumed while it's still in memory (in memtable) at server side. In this particular case, you can optimize further by deactivating compaction for the table. Regards Duy Hai On Sat, Feb 22, 2014 at 5:56 PM, Jagan Ranganathan ja...@zohocorp.comwrote: Hi, Thanks for the pointer. Following are some options given there, - If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect. - A broker will usually have some notion of what's next in the sequence and thus be able to do much more targeted queries, down to a single record if the storage strategy were to choose monotonic sequence numbers. We need to do is have some intelligence in using the system and avoid tombstones either use the pointed Column Name or use proper start column if slice query is used. Is that right or I am missing something here? Regards, Jagan On Sat, 22 Feb 2014 20:55:39 +0530 *DuyHai Doandoanduy...@gmail.com doanduy...@gmail.com* wrote Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.comwrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, - High Availability - No Data Loss - Better Performance. Following are some libraries that were considered along with the limitation I see, - Redis - Data Loss - ZooKeeper - Not advised for Queue system. - TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
We use this same setup also and it works great. Thunder - Reply message - From: Laing, Michael michael.la...@nytimes.com To: user@cassandra.apache.org Subject: Queuing System Date: Sat, Feb 22, 2014 7:31 AM We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan doanduy...@gmail.com wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High AvailabilityNo Data LossBetter Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data LossZooKeeper - Not advised for Queue system.TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: Queuing System
This seems a bit overkill. We run far more than 100mps (closer to 600) in rabbit with very good latency on a 3 node cluster. It has been very reliable as well. Thunder - Reply message - From: Jagan Ranganathan ja...@zohocorp.com To: user@cassandra.apache.org Subject: Queuing System Date: Sat, Feb 22, 2014 9:06 AM Thanks Tupshin for your assistance. As I mentioned in the other mail, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit.I use Rabbit for Messaging because of the Routing and Push model communication etc. So I am thinking of using Cassandra as a temporary Queue which will give fast write performance with no data loss Vs waiting for Rabbit Ack @ application level or handling Rabbit re-connection Vs Cassandra hinted handoff writes. So Cassandra might aggregate all my msg queue temporarily before I publish them to Rabbit. Is this fine? If so, please share your insight on which model access pattern will be a better fit for this usage. Throughput requirements may be around say 100 ops/sec. Regards, Jagan On Sat, 22 Feb 2014 21:10:36 +0530 Tupshin Harpertups...@tupshin.com wrote While, historically, it has been true that queuing in Cassandra has been an anti-pattern, it is also true that Leveled Compaction addresses the worst aspect of frequent deletes in Cassandra, and that overall, queuing in Cassandra is nowhere near the anti-pattern that it used to be. This is something that I've been meaning to write about more extensively. If your requirements are more around availability (particularly multi-dc) and relability with moderate (not extreme) performance, it is quite possible to build a pretty decent system on top of Cassandra. You don't mention your throughput requirements, nor additional semantics that might be necessary (e.g. deliver at-least-once vs deliver exactly once), but Cassandra 2.0's lightweight transactions provide a CAS primitive that can be used to ensure deliver-once if that is a requirement. I'd be happy to continue discussing appropriate data-models and access patterns if you decide to go down this path. -Tupshin On Sat, Feb 22, 2014 at 10:03 AM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High AvailabilityNo Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data LossZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
[OT]: Can I have a non-delivering subscription?
A question about the mailing list itself, rather than Cassandra. I've re-subscribed simply because I have to be subscribed in order to send to the list, as I sometimes try to when people Cc questions about my Net::Async::CassandraCQL perl module to me. However, if I want to read the list, I usually do so on the online archives and not by mail. Is it possible to have a non-delivering subscription, which would let me send messages, but doesn't deliver anything back to me? -- Paul LeoNerd Evans leon...@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/ signature.asc Description: PGP signature
Re: Queuing System
Without them you have no durability. With them you have guarantees... More than any other system with messaging features. It is a durable CP commit log. Works very well for data pipelines with AP systems like Cassandra which is a different system solving different problems. When a Kafka leader fails you right might block and wait for 10ms while a new leader is elected but writes can be guaranteed. The consumers then read and process data and write to Cassandra. And then have your app read from Cassandra for what what was processed. These are very typical type architectures at scale https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Feb 22, 2014, at 11:49 AM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi Joe, If my understanding is right, Kafka does not satisfy the high availability/replication part well because of the need for leader and In-Sync replicas. Regards, Jagan On Sat, 22 Feb 2014 22:02:27 +0530 Joe Steincrypt...@gmail.com wrote If performance and availability for messaging is a requirement then use Apache Kafka http://kafka.apache.org/ You can pass the same thrift/avro objects through the Kafka commit log or strings or whatever you want. /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Feb 22, 2014, at 11:13 AM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi Michael, Yes I am planning to use RabbitMQ for my messaging system. But I wonder which will give better performance if writing directly into Rabbit with Ack support Vs a temporary Queue in Cassandra first and then dequeue and publish in Rabbit. Complexities involving - Handling scenarios like Rabbit Connection failure etc Vs Cassandra write performance and replication with hinted handoff support etc, makes me wonder if this is a better path. Regards, Jagan On Sat, 22 Feb 2014 21:01:14 +0530 Michael Laing michael.la...@nytimes.com wrote We use RabbitMQ for queuing and Cassandra for persistence. RabbitMQ with clustering and/or federation should meet your high availability needs. Michael On Sat, Feb 22, 2014 at 10:25 AM, DuyHai Doan doanduy...@gmail.com wrote: Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan ja...@zohocorp.com wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Loading CQL PDO for CentOS PHP
I'm trying to get CQL going for my CentOS 5 cassandra PHP platform. I've installed thrift, but when I try to make cassandra-pdo or YACassandraPDO for that matter, none of the tests pass. And when I install it with PHP, phpinfo still doesn't show it loading and it doesn't work. Any ideas would be appreciated. There are pretty good instructions here - https://code.google.com/a/apache-extras.org/p/cassandra-pdo/ - for other platforms. But I can't find anything devoted to CentOS. Spencer
Re: Update multiple rows in a CQL lightweight transaction
#5633 was actually closed because the static columns feature ( https://issues.apache.org/jira/browse/CASSANDRA-6561) which has been checked in to the 2.0 branch but is not yet part of a release (it will be in 2.0.6). That feature will let you update multiple rows within a single partition by doing a CAS write based on a static column shared by all rows within the partition. Example extracted from the ticket: CREATE TABLE foo ( x text, y bigint, t bigint static, z bigint, PRIMARY KEY (x, y) ); insert into foo (x,y,t, z) values ('a', 1, 1, 10); insert into foo (x,y,t, z) values ('a', 2, 2, 20); select * from foo; x | y | t | z ---+---+---+ a | 1 | 2 | 10 a | 2 | 2 | 20 (Note that both values of t are 2 because it is static) begin batch update foo set z = 1 where x = 'a' and y = 1; update foo set z = 2 where x = 'a' and y = 2 if t = 4; apply batch; [applied] | x | y| t ---+---+--+--- False | a | null | 2 (Both updates failed to apply because there was an unmet conditional on one of them) select * from foo; x | y | t | z ---+---+---+ a | 1 | 2 | 10 a | 2 | 2 | 20 begin batch update foo set z = 1 where x = 'a' and y = 1; update foo set z = 2 where x = 'a' and y = 2 if t = 2; apply batch; [applied] --- True (both updates succeeded because the check on t succeeded) select * from foo; x | y | t | z ---+---+---+--- a | 1 | 2 | 1 a | 2 | 2 | 2 Hope this helps. -Tupshin On Fri, Feb 21, 2014 at 6:05 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Clint The Resolution status of the JIRA is set to Later, probably the implementation is not done yet. The JIRA was opened to discuss about impl strategy but nothing has been coded so far I guess. On Sat, Feb 22, 2014 at 12:02 AM, Clint Kelly clint.ke...@gmail.comwrote: Folks, Does anyone know how I can modify multiple rows at once in a lightweight transaction in CQL3? I saw the following ticket: https://issues.apache.org/jira/browse/CASSANDRA-5633 but it was not obvious to me from the comments how (or whether) this got resolved. I also couldn't find anything in the DataStax documentation about how to perform these operations. I'm in particular interested in how to perform a compare-and-set operation that modifies multiple rows (with the same partition key) using the DataStax Java driver. Thanks! Best regards, Clint
abusing cassandra's multi DC abilities
Upfront TLDR: We want to do stuff (reindex documents, bust cache) when changed data from DC1 shows up in DC2. Full Story: We're planning on adding data centers throughout the US. Our platform is used for business communications. Each DC currently utilizes elastic search and redis. A message can be sent from one user to another, and the intent is that it would be seen in near-real-time. This means that 2 people may be using different data centers, and the messages need to propagate from one to the other. On the plus side, we know we get this with Cassandra (fist pump) but the other pieces, not so much. Even if they did work, there's all sorts of race conditions that could pop up from having different pieces of our architecture communicating over different channels. From this, we've arrived at the idea that since Cassandra is the authoritative data source, we might be able to trigger events in DC2 based on activity coming through either the commit log or some other means. One idea was to use a CF with a low gc time as a means of transporting messages between DCs, and watching the commit logs for deletes to that CF in order to know when we need to do things like reindex a document (or a new document), bust cache, etc. Facebook did something similar with their modifications to MySQL to include cache keys in the replication log. Assuming this is sane, I'd want to avoid having the same event register on 3 servers, thus registering 3 items in the queue when only one should be there. So, for any piece of data replicated from the other DC, I'd need a way to determine if it was supposed to actually trigger the event or not. (Maybe it looks at the token and determines if the current server falls in the token range?) Or is there a better way? So, my questions to all ye Cassandra users: 1. Is this is even sane? 2. Is anyone doing it? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: List support in Net::Async::CassandraCQL ?
Hi Paul, On 23 Feb 2014, at 4:15 am, Paul LeoNerd Evans leon...@leonerd.org.uk wrote: On Sat, 22 Feb 2014 14:03:06 +1100 Jacob Rhoden jacob.rho...@me.com wrote: my $q = $cass-prepare(update contact set name=?, address=? where uuid=?)-get; push @f, $q-execute([$name, @address, $uuid]); Future-needs_all( @f )-get; Returns the following: Cannot encode address: not an ARRAY at /usr/local/share/perl/5.14.2/Net/Async/CassandraCQL/Query.pm line 182 It needs to arrive as an ARRAYref: $q-execute([$name, \@address, $uuid]); Thanks! I did try this without success. Perhaps I am just making a simple perl mistake then? I’ve been doing java so long, my perl is a little rusty: my @address = (); if(defined $a1 $a1 ne ) { push @address, $a1; } if(defined $a2 $a2 ne ) { push @address, $a2; } if(defined $a3 $a3 ne ) { push @address, $a3; } my @f; my $q = $cass-prepare(update contact set name=?, address=? where uuid=?)-get; push @f, $q-execute([$name, \@address, $uuid]); Future-needs_all( @f )-get; But this also returns an error: Cannot encode address: not an ARRAY at /usr/local/share/perl/5.14.2/Net/Async/CassandraCQL/Query.pm line 182
Re: Queuing System
Thanks Duy Hai for sharing the details. I have a doubt. If for some reason there is a Network Partition or more than 2 Node failure serving the same partition/load and you ended up writing hinted hand-off. Is there a possibility of a data loss? If yes, how do we avoid that? Regards, Jagan On Sat, 22 Feb 2014 22:48:19 +0530 DuyHai Doan lt;doanduy...@gmail.comgt; wrote Jagan Few time ago I dealed with a similar queuing design for one customer. If you never delete messages in the queue, then it is possible to use wide rows with bucketing and increasing monotonic column name to store messages. CREATE TABLE read_only_queue ( bucket_number int, insertion_time timeuuid, message text, PRIMARY KEY(bucket_number,insertion_time) ); Let's say that you allow only 100 000 messages per partition (physical row) to avoid too wide rows, then inserting/reading from the table read_only_queue is easy; For message producer : 1) Start at bucket_number = 1 2) Insert messages with column name = generated timeUUID with micro-second precision (depending on whether the insertion rate is high or not) 3) If message count = 100 000, increment bucket_number by one and go to 2) For message reader: 1) Start at bucket_number = 1 2) Read messages by slice of N, save the insertion_time of the last read message 3) Use the saved insertion_time to perform next slice query 4) If read messages count = 100 000, increment bucket_number and go to 2). Keep the insertion_time, do not reset it since his value is increasing monotonically For multiple and concurrent producers amp; writers, there is a trick. Let's assume you have P concurrent producers and C concurrent consumers. Assign a numerical ID for each producer and consumer. First producer ID = 1... last producer ID = P. Same for consumers. - re-use the above algorithm - each producer/consumer start at bucket_number = his ID - at the end of the row, - next bucket_number = current bucker_number + P for producers - next bucket_number = current bucker_number + C for consumers The last thing to take care of is compaction configuration to reduce the number of SSTables on disk. If you achieve to get rid of accumulation effects, e.g reading rate is faster than writing rate, the message are likely to be consumed while it's still in memory (in memtable) at server side. In this particular case, you can optimize further by deactivating compaction for the table. Regards Duy Hai On Sat, Feb 22, 2014 at 5:56 PM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, Thanks for the pointer. Following are some options given there, If you know where your live data begins, hint Cassandra with a start column, to reduce the scan times and the amount of tombstones to collect. A broker will usually have some notion of what’s next in the sequence and thus be able to do much more targeted queries, down to a single record if the storage strategy were to choose monotonic sequence numbers. We need to do is have some intelligence in using the system and avoid tombstones either use the pointed Column Name or use proper start column if slice query is used. Is that right or I am missing something here? Regards, Jagan On Sat, 22 Feb 2014 20:55:39 +0530 DuyHai Doanlt;doanduy...@gmail.comgt; wrote Jagan Queue-like data structures are known to be one of the worst anti patterns for Cassandra: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets On Sat, Feb 22, 2014 at 4:03 PM, Jagan Ranganathan lt;ja...@zohocorp.comgt; wrote: Hi, I need to decouple some of the work being processed from the user thread to provide better user experience. For that I need a queuing system with the following needs, High Availability No Data Loss Better Performance. Following are some libraries that were considered along with the limitation I see, Redis - Data Loss ZooKeeper - Not advised for Queue system. TokyoCabinet/SQLite/LevelDB - of this Level DB seem to be performing better. With replication requirement, I probably have to look at Apache ActiveMQ+LevelDB. After checking on the third option above, I kind of wonder if Cassandra with Leveled Compaction offer a similar system. Do you see any issues in such a usage or is there other better solutions available. Will be great to get insights on this. Regards, Jagan
Re: [OT]: Can I have a non-delivering subscription?
Yeah, it¹s called a rule. Set one up to delete everything from user@cassandra.apache.org. On 2/22/14, 10:32 AM, Paul LeoNerd Evans leon...@leonerd.org.uk wrote: A question about the mailing list itself, rather than Cassandra. I've re-subscribed simply because I have to be subscribed in order to send to the list, as I sometimes try to when people Cc questions about my Net::Async::CassandraCQL perl module to me. However, if I want to read the list, I usually do so on the online archives and not by mail. Is it possible to have a non-delivering subscription, which would let me send messages, but doesn't deliver anything back to me? -- Paul LeoNerd Evans leon...@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Re: Disabling opscenter data collection in Datastax community 2.0
I would be using nodetool JConsole for monitoring. Though it would be less informative but I think it will do. Otherwise also I cannot use Opscenter as I am not using the DSE but DSC, in production. So I am not allowed to use it for prod. use, Isn't it ? Not everyone here as well is using DSE hence Opscenter is not used in every Cassandra production installation. I installed Dsc20 using apt-get after adding datastax repository as suggested in the datastax's Cassandra 2.0 docs. I found that Opscenter keyspace was created by default when I installed dsc20, it would make no sense that it writes data which is unused in case I don't use Opscenter. On 2/22/14, Michael Shuler mich...@pbandjelly.org wrote: On 02/22/2014 06:12 AM, user 01 wrote: I'm using dsc20 (datastax community edition for cassandra 2.0) in production environment. But since I am not authorized to use Opscenter for production use. So how do I disable the data recording that is being done for opscenter consumption, as this is just a unusable for me will put unnecessary load on my machine ? The agent is lightweight. How are you planning to monitor your production env? You didn't hint at how you installed DSC/OpsCenter, but stop the agents, uninstall the agents, and drop the keyspace. The details of those steps depend on how you installed (rpm, deb, tar). http://www.datastax.com/documentation/opscenter/4.0/opsc/reference/opscInstallLocations_g.html http://www.datastax.com/documentation/opscenter/4.0/opsc/online_help/opscRemovingPackages_t.html Those docs might be helpful. Let us know how you installed DSC, if you need some better details. -- Kind regards, Michael
Re: Disabling opscenter data collection in Datastax community 2.0
You can use OpsCenter in production with DSC/Apache Cassandra clusters. Some features are only enabled with DSE, but the rest work fine with DSC. -Tupshin On Feb 22, 2014 11:20 PM, user 01 user...@gmail.com wrote: I would be using nodetool JConsole for monitoring. Though it would be less informative but I think it will do. Otherwise also I cannot use Opscenter as I am not using the DSE but DSC, in production. So I am not allowed to use it for prod. use, Isn't it ? Not everyone here as well is using DSE hence Opscenter is not used in every Cassandra production installation. I installed Dsc20 using apt-get after adding datastax repository as suggested in the datastax's Cassandra 2.0 docs. I found that Opscenter keyspace was created by default when I installed dsc20, it would make no sense that it writes data which is unused in case I don't use Opscenter. On 2/22/14, Michael Shuler mich...@pbandjelly.org wrote: On 02/22/2014 06:12 AM, user 01 wrote: I'm using dsc20 (datastax community edition for cassandra 2.0) in production environment. But since I am not authorized to use Opscenter for production use. So how do I disable the data recording that is being done for opscenter consumption, as this is just a unusable for me will put unnecessary load on my machine ? The agent is lightweight. How are you planning to monitor your production env? You didn't hint at how you installed DSC/OpsCenter, but stop the agents, uninstall the agents, and drop the keyspace. The details of those steps depend on how you installed (rpm, deb, tar). http://www.datastax.com/documentation/opscenter/4.0/opsc/reference/opscInstallLocations_g.html http://www.datastax.com/documentation/opscenter/4.0/opsc/online_help/opscRemovingPackages_t.html Those docs might be helpful. Let us know how you installed DSC, if you need some better details. -- Kind regards, Michael