Re: are there any free Cassandra -> ElasticSearch connector / plugin ?
I haven't used it yet, but https://github.com/vroyer/elassandra <https://github.com/vroyer/elassandra> -- Brian O'Neill Principal Architect @ Monetate m: 215.588.6024 bone...@monetate.com <mailto:bone...@monetate.com> > On Oct 13, 2016, at 6:02 PM, Eric Ho <e...@analyticsmd.com> wrote: > > I don't want to change my code to write into C* and then to ES. > So, I'm looking for some sort of a sync tool that will sync my C* table into > ES and it should be smart enough to avoid duplicates or gaps. > Is there such a tool / plugin ? > I'm using stock apache Cassandra 3.7. > I know that some premium Cassandra has ES builtin or integrated but I can't > afford premium right now... > Thanks. > > -eric ho >
Re: Support for ad-hoc query
Cassandra isn¹t great at ad hoc queries. Many of us have paired it with an indexing engine like SOLR or Elastic Search. (built-into the DSE solution) As of late, I think there are a few of us exploring Spark SQL. (which you can then use via JDBC or REST) -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Srinivasa T N seen...@gmail.com Reply-To: user@cassandra.apache.org Date: Tuesday, June 9, 2015 at 2:38 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Support for ad-hoc query Hi All, I have an web application running with my backend data stored in cassandra. Now I want to do some analysis on the data stored which requires some ad-hoc queries fired on cassandra. How can I do the same? Regards, Seenu.
Re: Spark SQL JDBC Server + DSE
Kudos Ben. We¹ve been tracking Zeppelin, and considered doing the same thing. You beat us to it. Well done. -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Ben Bromhead b...@instaclustr.com Reply-To: user@cassandra.apache.org Date: Tuesday, June 2, 2015 at 5:05 PM To: user@cassandra.apache.org Subject: Re: Spark SQL JDBC Server + DSE If you want a web based notebook style approach (similar to ipython) check out https://github.com/apache/incubator-zeppelin And https://github.com/apache/incubator-zeppelin/pull/86 Bonus free pretty graphs! On 1 June 2015 at 11:41, Sebastian Estevez sebastian.este...@datastax.com wrote: Have you looked at job server? https://github.com/spark-jobserver/spark-jobserver https://www.youtube.com/watch?v=8k9ToZ4m6os http://planetcassandra.org/blog/post/fast-spark-queries-on-in-memory-datasets/ All the best, http://www.datastax.com/ Sebastián Estévez Solutions Architect | 954 905 8615 tel:954%20905%208615 | sebastian.este...@datastax.com https://www.linkedin.com/company/datastax https://www.facebook.com/datastax https://twitter.com/datastax https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax http://cassandrasummit-datastax.com/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world¹s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. On Mon, Jun 1, 2015 at 8:13 AM, Mohammed Guller moham...@glassbeam.com wrote: Brian, We haven¹t open sourced the REST server, but not opposed to doing it. Just need to carve out some time to clean up the code and carve it out from all the other stuff that we do in that REST server. Will try to do it in the next few weeks. If you need it sooner, let me know. I did consider the option of writing our own Spark SQL JDBC driver for C*, but it is lower on the priority list right now. Mohammed From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: Saturday, May 30, 2015 3:12 AM To: user@cassandra.apache.org Subject: Re: Spark SQL JDBC Server + DSE Any chance you open-sourced, or could open-source the REST server? ;) In thinking about it It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver against Cassandra, akin to what they have for hive: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-t hrift-jdbcodbc-server I wouldn¹t mind collaborating on that, if you are headed in that direction. (and then I could write the REST server on top of that) LMK, -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 tel:215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Friday, May 29, 2015 at 2:15 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Brian, I implemented a similar REST server last year and it works great. Now we have a requirement to support JDBC connectivity in addition to the REST API. We want to allow users to use tools like Tableau to connect to C* through the Spark SQL JDBC/Thift server. Mohammed From: Brian O'Neill [mailto:boneil
Re: Spark SQL JDBC Server + DSE
Any chance you open-sourced, or could open-source the REST server? ;) In thinking about it It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver against Cassandra, akin to what they have for hive: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the- thrift-jdbcodbc-server I wouldn¹t mind collaborating on that, if you are headed in that direction. (and then I could write the REST server on top of that) LMK, -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Friday, May 29, 2015 at 2:15 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Brian, I implemented a similar REST server last year and it works great. Now we have a requirement to support JDBC connectivity in addition to the REST API. We want to allow users to use tools like Tableau to connect to C* through the Spark SQL JDBC/Thift server. Mohammed From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: Thursday, May 28, 2015 6:16 PM To: user@cassandra.apache.org Subject: Re: Spark SQL JDBC Server + DSE Mohammed, This doesn¹t really answer your question, but I¹m working on a new REST server that allows people to submit SQL queries over REST, which get executed via Spark SQL. Based on what I started here: http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example. html I assume you need JDBC connectivity specifically? -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Thursday, May 28, 2015 at 8:26 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Anybody out there using DSE + Spark SQL JDBC server? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Tuesday, May 26, 2015 6:17 PM To: user@cassandra.apache.org Subject: Spark SQL JDBC Server + DSE Hi As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open source C*. Only DSE supports the Spark SQL JDBC server. We would like to find out whether how many organizations are using this combination. If you do use DSE + Spark SQL JDBC server, it would be great if you could share your experience. For example, what kind of issues you have run into? How is the performance? What reporting tools you are using? Thank you! Mohammed
Re: Spark SQL JDBC Server + DSE
Mohammed, This doesn¹t really answer your question, but I¹m working on a new REST server that allows people to submit SQL queries over REST, which get executed via Spark SQL. Based on what I started here: http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example. html I assume you need JDBC connectivity specifically? -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Thursday, May 28, 2015 at 8:26 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Anybody out there using DSE + Spark SQL JDBC server? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Tuesday, May 26, 2015 6:17 PM To: user@cassandra.apache.org Subject: Spark SQL JDBC Server + DSE Hi As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open source C*. Only DSE supports the Spark SQL JDBC server. We would like to find out whether how many organizations are using this combination. If you do use DSE + Spark SQL JDBC server, it would be great if you could share your experience. For example, what kind of issues you have run into? How is the performance? What reporting tools you are using? Thank you! Mohammed
Re: cassandra and spark from cloudera distirbution
Depends which veresion of Spark you are running on Cloudera. Once you know that have a look at the compatibility chart here: https://github.com/datastax/spark-cassandra-connector -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Serega Sheypak serega.shey...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, April 22, 2015 at 1:48 PM To: user user@cassandra.apache.org Subject: Re: cassandra and spark from cloudera distirbution We already use it. Would like to use Spark from cloudera distribution. Should it work? 2015-04-22 19:43 GMT+02:00 Jay Ken jaytechg...@gmail.com: There is a Enerprise Edition from Datastax; where they have Spark and Cassandra Integration. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise Thanks, Jay On Wed, Apr 22, 2015 at 6:41 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, are Cassandra and Spark from Cloudera compatible? Where can I find these compatilibity notes?
Re: Adhoc querying in Cassandra?
+1, I think many organizations (including ours) pair Elastic Search with Cassandra. Use Cassandra as your system of record, then index the data with ES. -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Ali Akhtar ali.rac...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, April 22, 2015 at 7:52 AM To: user@cassandra.apache.org Subject: Re: Adhoc querying in Cassandra? You might find it better to use elasticsearch for your aggregate queries and analytics. Cassandra is more of just a data store. On Apr 22, 2015 4:42 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, Currently we are setting up a ³big² data cluster, but we are only going to have a couple of servers to start with but we need to be able to scale out quickly when usage ramps up. Previously we have used Hadoop/HBase for our big data cluster, but since we are starting this one on only two nodes I think Cassandra will be a much better fit, as Hadoop and HBase really need at least 3 to achieve any sort of resilience (zookeeper quorum etc). My question is this: I have used Apache Phoenix as a JDBC layer on top of HBase, which allows me to issue ad-hoc SQL-style queries. (eg count the number of times users have clicked on a certain button after clicking a different button in the last 3 weeks etc). My understanding is that CQL does not support this style of adhoc aggregate querying out of the box. Is there a recommended way to do count, sum, average etc without writing client code (in my case Java) every time I want to run one? I have been looking at projects like Drill, Spark etc that could potentially sit on top of Cassandra but without actually setting everything up and testing them it is difficult to figure out what they would give us. Does anyone else interactively issue adhoc aggregate queries to Cassandra, and if so, what stack do you use? Thanks! Matt
Re: Adhoc querying in Cassandra?
Again agreed. They have different usage patterns (C* heavy writes, ES heavy read), I would separate them. SOLR should be sufficient. I believe DSE is a tight integration between SOLR and C*. -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Ali Akhtar ali.rac...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, April 22, 2015 at 8:10 AM To: user@cassandra.apache.org Subject: Re: Adhoc querying in Cassandra? I believe ElasticSearch has better support for scaling horizontally (by adding nodes) than Solr does. Some benchmarks that I've looked at, also show it as performing better under high load. I probably wouldn't run them both on the same node, or you might see low performance as they compete for resources. What type of usage do you expect - mostly read, or mostly write? On Wed, Apr 22, 2015 at 5:06 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi Ali, Brian, Thanks for the suggestion we have previously used Solr (SolrCloud for distribution) for a lot of other products, presumably this will do the same job as ElasticSearch? Or does ElasticSearch have specifically better integration with Cassandra or better support for aggregate queries? Would it be an ok architecture to have a Cassandra node and a Solr/ES instance on each box, so they scale together? Or is it better to have separate servers for storage and search? Cheers, Matt From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: 22 April 2015 12:56 To: user@cassandra.apache.org Subject: Re: Adhoc querying in Cassandra? +1, I think many organizations (including ours) pair Elastic Search with Cassandra. Use Cassandra as your system of record, then index the data with ES. -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Ali Akhtar ali.rac...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, April 22, 2015 at 7:52 AM To: user@cassandra.apache.org Subject: Re: Adhoc querying in Cassandra? You might find it better to use elasticsearch for your aggregate queries and analytics. Cassandra is more of just a data store. On Apr 22, 2015 4:42 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, Currently we are setting up a ³big² data cluster, but we are only going to have a couple of servers to start with but we need to be able to scale out quickly when usage ramps up. Previously we have used Hadoop/HBase for our big data cluster, but since we are starting this one on only two nodes I think Cassandra will be a much better fit, as Hadoop and HBase really need at least 3 to achieve any sort of resilience (zookeeper quorum etc). My question is this: I have used Apache Phoenix as a JDBC layer on top of HBase, which allows me to issue ad-hoc SQL-style queries. (eg count the number of times users have clicked on a certain button after clicking a different button in the last 3 weeks etc). My understanding is that CQL does not support this style of adhoc aggregate querying out of the box. Is there a recommended way to do count, sum, average etc without writing client code (in my case Java) every time I want to run one? I have been looking at projects like Drill, Spark etc that could potentially sit on top of Cassandra but without actually setting everything up and testing them it is difficult to figure out what they would give us. Does anyone else interactively issue adhoc aggregate queries to Cassandra, and if so, what stack do you use
Re: Cassandra - Storm
I¹d recommend using Storm¹s State abstraction. Check out: https://github.com/hmsonline/storm-cassandra-cql -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Vanessa Gligor vanessagli...@gmail.com Reply-To: user@cassandra.apache.org Date: Friday, April 3, 2015 at 1:13 AM To: user@cassandra.apache.org Subject: Cassandra - Storm Hi all, Did anybody use Cassandra for the tuple storage in Storm? I have this scenario: I have a spout (getting messages from RabbitMQ) and I want to save all these messages in Cassandra using a bolt. What is the best choice regarding the connection to the DB? I have read about Hector API. I used it, but for now I wasn't able to add a new row in a column family. Any help would be appreciated. Regards, Vanessa.
Re: Frequent timeout issues
Are you using the storm-cassandra-cql driver? (https://github.com/hmsonline/storm-cassandra-cql) If so, what version? Batching or no batching? -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Amlan Roy amlan@cleartrip.com Reply-To: user@cassandra.apache.org Date: Wednesday, April 1, 2015 at 11:37 AM To: user@cassandra.apache.org Subject: Re: Frequent timeout issues Replication factor is 2. CREATE KEYSPACE ct_keyspace WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC1': '2' }; Inserts are happening from Storm using java driver. Using prepared statement without batch. On 01-Apr-2015, at 8:42 pm, Brice Dutheil brice.duth...@gmail.com wrote: And the keyspace? What is the replication factor. Also how are the inserts done? On Wednesday, April 1, 2015, Amlan Roy amlan@cleartrip.com wrote: Write consistency level is ONE. This is the describe output for one of the tables. CREATE TABLE event_data ( event text, week text, bucket int, date timestamp, unique text, adt int, age listint, arrival listtimestamp, bank text, bf double, cabin text, card text, carrier listtext, cb double, channel text, chd int, company text, cookie text, coupon listtext, depart listtimestamp, dest listtext, device text, dis double, domain text, duration bigint, emi int, expressway boolean, flight listtext, freq_flyer listtext, host text, host_ip text, inf int, instance text, insurance text, intl boolean, itinerary text, journey text, meal_pref listtext, mkp double, name listtext, origin listtext, pax_type listtext, payment text, pref_carrier listtext, referrer text, result_cnt int, search text, src text, src_ip text, stops int, supplier listtext, tags listtext, total double, trip text, user text, user_agent text, PRIMARY KEY ((event, week, bucket), date, unique) ) WITH CLUSTERING ORDER BY (date DESC, unique ASC) AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.10 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.00 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor¹}; On 01-Apr-2015, at 8:00 pm, Eric R Medley emed...@xylocore.com javascript:_e(%7B%7D,'cvml','emed...@xylocore.com'); wrote: Also, can you provide the table details and the consistency level you are using? Regards, Eric R Medley On Apr 1, 2015, at 9:13 AM, Eric R Medley emed...@xylocore.com javascript:_e(%7B%7D,'cvml','emed...@xylocore.com'); wrote: Amlan, Can you provide information on how much data is being written? Are any of the columns really large? Are any writes succeeding or are all timing out? Regards, Eric R Medley On Apr 1, 2015, at 9:03 AM, Amlan Roy amlan@cleartrip.com javascript:_e(%7B%7D,'cvml','amlan@cleartrip.com'); wrote: Hi, I am new to Cassandra. I have setup a cluster with Cassandra 2.0.13. I am writing the same data in HBase and Cassandra and find that the writes are extremely slow in Cassandra and frequently seeing exception ³Cassandra timeout during write query at consistency ONE. The cluster size for both HBase and Cassandra are same. Looks like something is wrong with my cluster setup. What can be the possible issue? Data and commit logs are written into two separate disks. Regards, Amlan -- Brice
Re: cassandra source code
FWIW I just went through this, and posted the process I used to get up and running: http://brianoneill.blogspot.com/2015/03/getting-started-with-cassandra.html -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Divya Divs divya.divi2...@gmail.com Reply-To: user@cassandra.apache.org Date: Tuesday, March 24, 2015 at 1:29 AM To: user@cassandra.apache.org, Jason Wee peich...@gmail.com, Eric Stevens migh...@gmail.com Subject: cassandra source code Hi I'm Divya, I'm trying to run the source code of cassandra in eclipse. I'm taking the source code from github. I'm using windows 64-bit, I'm following the instructions from this website. http://runningcassandraineclipse.blogspot.in/. In the github cassandra-trunk, conf/log4j-server.properies directories and org.apache.cassandra.thrift.CassandraDaemon, main class is not there. please give me a document to run the source code of cassandra. Please kindly help me to proceed. Please reply me as soon as possible. Thanking you
Re: IF NOT EXISTS on UPDATE statements?
FWIW we have the exact same need. And we have been struggling with the differences in CQL between UPDATE and INSERT. Our use case: We do in-memory dimensional aggregations that we want to write to C* using LWT. (so, it¹s a low-volume of writes, because we are doing aggregations across time windows) On ³commit², we: 1) Read current value for time window (which returns null if not exists for time window, or current_value if exists) 2) Then we need to UPSERT new_value for window where new_value = current_value + agg_value but only if no other node has updated the value For (2), we would love to see: UPSERT value=new_value where (not exists || value=read_value) (ignoring some intricacies) -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Robert Stupp sn...@snazy.de Reply-To: user@cassandra.apache.org Date: Tuesday, November 18, 2014 at 12:35 PM To: user@cassandra.apache.org Subject: Re: IF NOT EXISTS on UPDATE statements? There is no way to mimic IF NOT EXISTS on UPDATE and it's not a bug. INSERT and UPDATE are not totally orthogonal in CQL and you should use INSERT for actual insertion and UPDATE for updates (granted, the database will not reject our query if you break this rule but it's nonetheless the way it's intended to be used). OK.. (and not trying to be difficult here). We can¹t have it both ways. One of these use cases is a bug You¹re essentially saying ³don¹t do that, but yeah, you can do it.. ³ Either UPDATE should support IF NOT EXISTS or UPDATE should not perform INSERTs. UPDATE performs like INSERT in the meaning of an UPSERT - means: INSERT allows to write the same primary key again and UPDATE allows to write data to a non-existing primary key (effectively inserting data). (That¹s what NoSQL databases do.) Take that as an advantage / feature not present on other DBs. UPDATE IF EXISTS³ and INSERT IF NOT EXISTS³ are *expensive* operations (require serial-consistency/LWT which requires some more network roundtrips). IF [NOT] EXISTS³ is basically some kind of convenience³. And please take into account that UPDATE also has IF column = value ³ condition (using LWT).
Re: IF NOT EXISTS on UPDATE statements?
Exactly. Perfect. Will do. Thanks Robert. -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Robert Stupp sn...@snazy.de Reply-To: user@cassandra.apache.org Date: Tuesday, November 18, 2014 at 2:26 PM To: user@cassandra.apache.org Subject: Re: IF NOT EXISTS on UPDATE statements? For (2), we would love to see: UPSERT value=new_value where (not exists || value=read_value) That would be something like UPDATE IF column=value OR NOT EXISTS³. Took at the C* source and that feels like a LHF (for 3.0) so I opened https://issues.apache.org/jira/browse/CASSANDRA-8335 for that. Fell free to comment on that :)
Re: [ANN] SparkSQL support for Cassandra with Calliope
Well done Rohit. (and crew) -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Rohit Rai ro...@tuplejump.com Reply-To: user@cassandra.apache.org Date: Friday, October 3, 2014 at 2:16 PM To: user@cassandra.apache.org Subject: [ANN] SparkSQL support for Cassandra with Calliope Hi All, An year ago we started this journey and laid the path for Spark + Cassandra stack. We established the ground work and direction for Spark Cassandra connectors and we have been happy seeing the results. With Spark 1.1.0 and SparkSQL release, we its time to take Calliope http://tuplejump.github.io/calliope/ to the logical next level also paving the way for much more advanced functionality to come. Yesterday we released Calliope 1.1.0 Community Tech Preview https://twitter.com/tuplejump/status/517739186124627968 , which brings Native SparkSQL support for Cassandra. The further details are available here http://tuplejump.github.io/calliope/tech-preview.html . This release showcases in core spark-sql http://tuplejump.github.io/calliope/start-with-sql.html , hiveql http://tuplejump.github.io/calliope/start-with-hive.html and HiveThriftServer http://tuplejump.github.io/calliope/calliope-server.html support. I differentiate it as native spark-sql integration as it doesn't rely on Cassandra's hive connectors (like Cash or DSE) and saves a level of indirection through Hive. It also allows us to harness Spark's analyzer and optimizer in future to work out the best execution plan targeting a balance between Cassandra's querying restrictions and Sparks in memory processing. As far as we know this it the first and only third party data store connector for SparkSQL. This is a CTP release as it relies on Spark internals that still don't have/stabilized a developer API and we will work with the Spark Community in documenting the requirements and working towards a standard and stable API for third party data store integration. On another note, we no longer require you to signup to access the early access code repository. Inviting all of you try it and give us your valuable feedback. Regards, Rohit Founder CEO, Tuplejump, Inc. www.tuplejump.com http://www.tuplejump.com The Data Engineering Platform
Re: Cassandra blob storage
You may want to look at: https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: prem yadav ipremya...@gmail.com Reply-To: user@cassandra.apache.org Date: Tuesday, March 18, 2014 at 1:41 PM To: user@cassandra.apache.org Subject: Cassandra blob storage Hi, I have been spending some time looking into whether large files(100mb) can be stores in Cassandra. As per Cassandra faq: Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Does the above statement still hold? Thrift supports framed data transport, does that change the above statement. If not, why does casssandra not adopt the Thrift framed data transfer support? Thanks
Re: Proposal: freeze Thrift starting with 2.1.0
just when you thought the thread died First, let me say we are *WAY* off topic. But that is a good thing. I love this community because there are a ton of passionate, smart people. (often with differing perspectives ;) RE: Reporting against C* (@Peter Lin) We¹ve had the same experience. Pig + Hadoop is painful. We are experimenting with Spark/Shark, operating directly against the data. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html The Shark layer gives you SQL and caching capabilities that make it easy to use and fast (for smaller data sets). In front of this, we are going to add dimensional aggregations so we can operate at larger scales. (then the Hive reports will run against the aggregations) RE: REST Server (@Russel Bradbury) We had moderate success with Virgil, which was a REST server built directly on Thrift. We built it directly on top of Thrift, so one day it could be easily embedded in the C* server itself. It could be deployed separately, or run an embedded C*. More often than not, we ended up running it separately to separate the layers. (just like Titan and Rexster) I¹ve started on a rewrite of Virgil called Memnon that rides on top of CQL. (I¹d love some help) https://github.com/boneill42/memnon RE: CQL vs. Thrift We¹ve hitched our wagons to CQL. CQL != Relational. We¹ve had success translating our ³native² schemas into CQL, including all the NoSQL goodness of wide-rows, etc. You just need a good understanding of how things translate into storage and underlying CFs. If anything, I think we could add some DESCRIBE information, which would help users with this, along the lines of: (https://issues.apache.org/jira/browse/CASSANDRA-6676) CQL does open up the *opportunity* for users to articulate more complex queries using more familiar syntax. (including future things such as joins, grouping, etc.) To me, that is exciting, and again one of the reasons we are leaning on it. my two cents, brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Peter Lin wool...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, March 12, 2014 at 8:44 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Proposal: freeze Thrift starting with 2.1.0 yes, I was looking at intravert last nite. For the kinds of reports my customers ask us to do, joins and subqueries are important. Having tried to do a simple join in PIG, the level of pain is high. I'm a masochist, so I don't mind breaking a simple join into multiple MR tasks, though I do find myself asking why the hell does it need to be so painful in PIG? Many of my friends say what is this crap! or this is better than writing sql queries to run reports? Plus, using ETL techniques to extract summaries only works for cases where the data is small enough. Once it gets beyond a certain size, it's not practical, which means we're back to crappy reporting languages that make life painful. Lots of big healthcare companies have thousands of MOLAP cubes on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of management headaches. being able to report directly on the raw data avoids many of the issues, but that's my bias perspective. On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote: I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins -- Did you have a look at Intravert ? I think it does union intersection on server side for you. Not sure about join though.. On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote: Hi Ed, I agree Solr is deeply integrated into DSE. I've looked at Solandra in the past and studied the code. My understanding is DSE uses Cassandra for storage and the user has both API available. I do think it can be integrated further to make moderate to complex queries easier and probably faster. That's why we built our own JPA-like object query API. I would love to see Cassandra get to the point where users can define complex queries with subqueries, like, group by and joins. Clearly lots of people want these features and even
[Blog] : Storm and Cassandra : A Three Year Retrospective
A community member asked for a blog post on Storm + Cassandra. FWIW, here was our journey. http://brianoneill.blogspot.com/2014/02/storm-and-cassandra-three-year.html -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
Re: CQL list command
+1, agreed. I do the same thing. If cli is going away, we¹ll need this ability in cqlsh. I created a JIRA issue for it: https://issues.apache.org/jira/browse/CASSANDRA-6676 We¹ll see what the crew come back with. -brian --- Brian O'Neill Chief Technology Officer Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 2/7/14, 2:33 AM, Ben Hood 0x6e6...@gmail.com wrote: On Thu, Feb 6, 2014 at 9:01 PM, Andrew Cobley a.e.cob...@dundee.ac.uk wrote: I often use the CLI command LIST for debugging or when teaching students showing them what's going on under the hood of CQL. I see that CLI swill be removed in Cassandra 3 and we will lose this ability. It would be nice if CQL retained it, or something like it for debugging and etching purposes. I agree. I use LIST every now and then to verify the storage layout of partitioning and cluster columns. What would be cool is to do something like: cqlsh:y CREATE TABLE x ( ... a int, ... b int, ... c int, ... PRIMARY KEY (a,b) ... ); cqlsh:y insert into x (a,b,c) values (1,1,1); cqlsh:y insert into x (a,b,c) values (2,1,1); cqlsh:y insert into x (a,b,c) values (2,2,1); cqlsh:y select * from x; a | b | c ---+---+--- 1 | 1 | 1 2 | 1 | 1 2 | 2 | 1 (3 rows) cqlsh:y select * from x show storage; // requires monospace font +---+ +---+ |b:1| |a:1| +-- |---| +---+ |c:1| +---+ +---+---+ +---+ |b:1|b:2| |a:2| +-- |---|---| +---+ |c:1|c:2| +---+---+ (2 rows)
Re: Dimensional SUM, COUNT, DISTINCT in C* (replacing Acunu)
Thanks for the pointer Alain. At a quick glance, it looks like people are looking for query time filtering/aggregation, which will suffice for small data sets. Hopefully we might be able to extend that to perform pre-computations as well. (which would support much larger data sets / volumes) I¹ll continue the discussion on the issue. thanks again, brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Alain RODRIGUEZ arodr...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, December 18, 2013 at 5:13 AM To: user@cassandra.apache.org Cc: d...@cassandra.apache.org d...@cassandra.apache.org Subject: Re: Dimensional SUM, COUNT, DISTINCT in C* (replacing Acunu) Hi, this would indeed be much appreciated by a lot of people. There is this issue, existing about this subject https://issues.apache.org/jira/browse/CASSANDRA-4914 Maybe could you help commiters there. Hope this will be usefull to you. Please let us know when you find a way to do these operations. Cheers. 2013/12/18 Brian O'Neill b...@alumni.brown.edu We are seeking to replace Acunu in our technology stack / platform. It is the only component in our stack that is not open source. In preparation, over the last few weeks I¹ve migrated Virgil to CQL. The vision is that Virgil could receive a REST request to upsert/delete data (hierarchical JSON to support collections). Virgil would lookup the dimensions/aggregations for that table, add the key to the pertinent dimensional tables (e.g. DISTINCT), incorporate values into aggregations (e.g. SUMs) and increment/decrement relevant counters (COUNT). (using additional CF¹s) This seems straight-forward, but appears to require a read-before-write. (e.g. read the current value of a SUM, incorporate the new value, then use the lightweight transactions of C* 2.0 to conditionally update the value.) Before I go down this path, I was wondering if anyone is designing/working on the same, perhaps at a lower level? (CQL?) Is there any intent to support aggregations/filters (COUNT, SUM, DISTINCT, etc) at the CQL level? If so, is there a preliminary design? I can see a lower-level approach, which would leverage the commit logs (and mem/sstables) and perform the aggregation during read-operations (and flush/compaction). thoughts? i'm open to all ideas. -brian -- Brian ONeill Chief Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 tel:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Dimensional SUM, COUNT, DISTINCT in C* (replacing Acunu)
We are seeking to replace Acunu in our technology stack / platform. It is the only component in our stack that is not open source. In preparation, over the last few weeks I’ve migrated Virgil to CQL. The vision is that Virgil could receive a REST request to upsert/delete data (hierarchical JSON to support collections). Virgil would lookup the dimensions/aggregations for that table, add the key to the pertinent dimensional tables (e.g. DISTINCT), incorporate values into aggregations (e.g. SUMs) and increment/decrement relevant counters (COUNT). (using additional CF’s) This seems straight-forward, but appears to require a read-before-write. (e.g. read the current value of a SUM, incorporate the new value, then use the lightweight transactions of C* 2.0 to conditionally update the value.) Before I go down this path, I was wondering if anyone is designing/working on the same, perhaps at a lower level? (CQL?) Is there any intent to support aggregations/filters (COUNT, SUM, DISTINCT, etc) at the CQL level? If so, is there a preliminary design? I can see a lower-level approach, which would leverage the commit logs (and mem/sstables) and perform the aggregation during read-operations (and flush/compaction). thoughts? i'm open to all ideas. -brian -- Brian ONeill Chief Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Drop keyspace via CQL hanging on master/trunk.
Great. Thanks Aaron. FWIW, I am/was porting Virgil over CQL. I should be able to release a new REST API for C* (using CQL) shortly. -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On Dec 10, 2013, at 1:51 PM, Aaron Morton aa...@thelastpickle.com wrote: Looks like a bug, will try to fix today https://issues.apache.org/jira/browse/CASSANDRA-6472 Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 6/12/2013, at 10:25 am, Brian O'Neill b...@alumni.brown.edu wrote: I removed the data directory just to make sure I had a clean environment. (eliminating the possibility of corrupt keyspaces/files causing problems) -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Jason Wee peich...@gmail.com Reply-To: user@cassandra.apache.org Date: Thursday, December 5, 2013 at 4:03 PM To: user@cassandra.apache.org Subject: Re: Drop keyspace via CQL hanging on master/trunk. Hey Brian, just out of curiosity, why would you remove cassandra data directory entirely? /Jason On Fri, Dec 6, 2013 at 2:38 AM, Brian O'Neill b...@alumni.brown.edu wrote: When running Cassandra from trunk/master, I see a drop keyspace command hang at the CQL prompt. To reproduce: 1) Removed my cassandra data directory entirely 2) Fired up cqlsh, and executed the following CQL commands in succession: bone@zen:~/git/boneill42/cassandra- bin/cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 19.38.0] Use HELP for help. cqlsh describe keyspaces; system system_traces cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS= trategy', 'replication_factor':'1'}; cqlsh describe keyspaces; system test_keyspace system_traces cqlsh drop keyspace test_keyspace; THIS HANGS INDEFINITELY thoughts? user error? worth filing an issue? One other note — this happens using the CQL java driver as well. -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
Drop keyspace via CQL hanging on master/trunk.
When running Cassandra from trunk/master, I see a drop keyspace command hang at the CQL prompt. To reproduce: 1) Removed my cassandra data directory entirely 2) Fired up cqlsh, and executed the following CQL commands in succession: bone@zen:~/git/boneill42/cassandra- bin/cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 19.38.0] Use HELP for help. cqlsh describe keyspaces; system system_traces cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS= trategy', 'replication_factor':'1'}; cqlsh describe keyspaces; system test_keyspace system_traces cqlsh drop keyspace test_keyspace; THIS HANGS INDEFINITELY thoughts? user error? worth filing an issue? One other note this happens using the CQL java driver as well. -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
Re: Drop keyspace via CQL hanging on master/trunk.
I removed the data directory just to make sure I had a clean environment. (eliminating the possibility of corrupt keyspaces/files causing problems) -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Jason Wee peich...@gmail.com Reply-To: user@cassandra.apache.org Date: Thursday, December 5, 2013 at 4:03 PM To: user@cassandra.apache.org Subject: Re: Drop keyspace via CQL hanging on master/trunk. Hey Brian, just out of curiosity, why would you remove cassandra data directory entirely? /Jason On Fri, Dec 6, 2013 at 2:38 AM, Brian O'Neill b...@alumni.brown.edu wrote: When running Cassandra from trunk/master, I see a drop keyspace command hang at the CQL prompt. To reproduce: 1) Removed my cassandra data directory entirely 2) Fired up cqlsh, and executed the following CQL commands in succession: bone@zen:~/git/boneill42/cassandra- bin/cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 19.38.0] Use HELP for help. cqlsh describe keyspaces; system system_traces cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS= trategy', 'replication_factor':'1'}; cqlsh describe keyspaces; system test_keyspace system_traces cqlsh drop keyspace test_keyspace; THIS HANGS INDEFINITELY thoughts? user error? worth filing an issue? One other note this happens using the CQL java driver as well. -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
Re: Main method not found in class org.apache.cassandra.service.CassandraDaemon
Vivek, The location of CassandraDaemon changed between versions. (from org.apache.cassandra.thrift to org.apache.cassandra.service) It is likely that the start scripts are picking up the old version on the classpath, which results in the main method not being found. Do you have CASSANDRA_HOME set? I believe the start scripts will use that. Perhaps you have that set and pointed to the older 1.1.X version? -brian On Wed, Jul 17, 2013 at 8:31 AM, Vivek Mishra mishra.v...@gmail.com wrote: Finally, i have to delete all rpm installed files to get this working, folders are: /usr/share/cassandra /etc/alternatives/cassandra /usr/bin/cassandra /usr/bin/cassandra.in.sh /usr/bin/cassandra-cli Still don't understand why it's giving me such weird error: Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) *** This is not informative at all and does not even Help! -Vivek On Wed, Jul 17, 2013 at 3:49 PM, Vivek Mishra mishra.v...@gmail.comwrote: @aaron Thanks for your reply. I did have a look rpm installed files 1. /etc/alternatives/cassandra, it contains configuration files only. and .sh files are installed within /usr/bin folder. Even if i try to run from extracted tar ball folder as /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -f same error. /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -v gives me 1.1.12 though it should give me 1.2.4 -Vivek it gives me same error. On Wed, Jul 17, 2013 at 3:37 PM, aaron morton aa...@thelastpickle.comwrote: Something is messed up in your install. Can you try scrubbing the install and restarting ? Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 17/07/2013, at 6:47 PM, Vivek Mishra mishra.v...@gmail.com wrote: Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) Hi, I am getting this error. Earlier it was working fine for me, when i simply downloaded the tarball installation and ran cassandra server. Recently i did rpm package installation of Cassandra and which is working fine. But somehow when i try to run it via originally extracted tar package. i am getting: * xss = -ea -javaagent:/home/impadmin/software/apache-cassandra-1.2.4//lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1024M -Xmx1024M -Xmn256M -XX:+HeapDumpOnOutOfMemoryError -Xss180k Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) * I tried setting CASSANDRA_HOME directory, but no luck. Error is bit confusing, Any suggestions??? -Vivek -- Brian ONeill Chief Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Main method not found in class org.apache.cassandra.service.CassandraDaemon
Vivek, You could try echoing the CLASSPATH to double check. Drop an echo into the launch_service function in the cassandra shell script. (~line 121) Let us know the output. -brian --- Brian O'Neill Chief Architect Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Vivek Mishra mishra.v...@gmail.com Reply-To: user@cassandra.apache.org Date: Wednesday, July 17, 2013 10:24 AM To: user@cassandra.apache.org Subject: Re: Main method not found in class org.apache.cassandra.service.CassandraDaemon Hi Brian, Thanks for your response. I think i did change CASSANDRA_HOME to point to new directory. -Vivek On Wed, Jul 17, 2013 at 7:03 PM, Brian O'Neill b...@alumni.brown.edu wrote: Vivek, The location of CassandraDaemon changed between versions. (from org.apache.cassandra.thrift to org.apache.cassandra.service) It is likely that the start scripts are picking up the old version on the classpath, which results in the main method not being found. Do you have CASSANDRA_HOME set? I believe the start scripts will use that. Perhaps you have that set and pointed to the older 1.1.X version? -brian On Wed, Jul 17, 2013 at 8:31 AM, Vivek Mishra mishra.v...@gmail.com wrote: Finally, i have to delete all rpm installed files to get this working, folders are: /usr/share/cassandra /etc/alternatives/cassandra /usr/bin/cassandra /usr/bin/cassandra.in.sh http://cassandra.in.sh /usr/bin/cassandra-cli Still don't understand why it's giving me such weird error: Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) *** This is not informative at all and does not even Help! -Vivek On Wed, Jul 17, 2013 at 3:49 PM, Vivek Mishra mishra.v...@gmail.com wrote: @aaron Thanks for your reply. I did have a look rpm installed files 1. /etc/alternatives/cassandra, it contains configuration files only. and .sh files are installed within /usr/bin folder. Even if i try to run from extracted tar ball folder as /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -f same error. /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -v gives me 1.1.12 though it should give me 1.2.4 -Vivek it gives me same error. On Wed, Jul 17, 2013 at 3:37 PM, aaron morton aa...@thelastpickle.com wrote: Something is messed up in your install. Can you try scrubbing the install and restarting ? Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 17/07/2013, at 6:47 PM, Vivek Mishra mishra.v...@gmail.com wrote: Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) Hi, I am getting this error. Earlier it was working fine for me, when i simply downloaded the tarball installation and ran cassandra server. Recently i did rpm package installation of Cassandra and which is working fine. But somehow when i try to run it via originally extracted tar package. i am getting: * xss = -ea -javaagent:/home/impadmin/software/apache-cassandra-1.2.4//lib/jamm-0.2.5. jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1024M -Xmx1024M -Xmn256M -XX:+HeapDumpOnOutOfMemoryError -Xss180k Error: Main method not found in class org.apache.cassandra.service.CassandraDaemon, please define the main method as: public static void main(String[] args) * I tried setting CASSANDRA_HOME directory, but no luck. Error is bit confusing, Any suggestions??? -Vivek -- Brian ONeill Chief Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
SQL Injection C* (via CQL Thrift)
Mostly for fun, I wanted to throw this out there... We are undergoing a security audit for our platform (C* + Elastic Search + Storm). One component of that audit is susceptibility to SQL injection. I was wondering if anyone has attempted to construct a SQL injection attack against Cassandra? Is it even possible? I know the code paths fairly well, but... Does there exists a path in the code whereby user data gets interpreted, which could be exploited to perform user operations? From the Thrift side of things, I've always felt safe. Data is opaque. Serializers are used to convert it to Bytes, and C* doesn't ever really do anything with the data. In examining the CQL java-driver, it looks like there might be a bit more exposure to injection. (or even CQL over Thrift) I haven't dug into the code yet, but dependent on which flavor of the API you are using, you may be including user data in your statements. Does anyone know if the CQL java-driver does anything to protect against injection? Or is it possible to say that the syntax is strict enough that any embedded operations in data would not parse? just some food for thought... I'll be digging into this over the next couple weeks. If people are interested, I can throw a blog post out there with the findings. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: SQL Injection C* (via CQL Thrift)
Perfect. Thanks Sylvain. That is exactly the input I was looking for, and I agree completely. (t's easy enough to protect against) As for the thrift side (i.e. using Hector or Astyanax), anyone have a crafty way to inject something? At first glance, it doesn't appear possible, but I'm not 100% confident making that assertion. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Tuesday, June 18, 2013 8:51 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SQL Injection C* (via CQL Thrift) If you're not careful, then CQL injection is possible. Say you naively build you query with UPDATE foo SET col=' + user_input + ' WHERE key = 'k' then if user_input is foo' AND col2='bar, your user will have overwritten a column it shouldn't have been able to. And something equivalent in a BATCH statement could allow to overwrite/delete some random row in some random table. Now CQL being much more restricted than SQL (no subqueries, no generic transaction, ...), the extent of what you can do with a CQL injection is way smaller than in SQL. But you do have to be careful. As far as the Datastax java driver is concerned, you can fairly easily protect yourself by using either: 1) prepared statements: if the user input is a prepared variable, there is nothing the user can do (it's equivalent to the thrift situation). 2) using the query builder: it will escape quotes in the strings you provided, thuse avoiding injection. So I would say that injections are definitively possible if you concatenate strings too naively, but I don't think preventing them is very hard. -- Sylvain On Tue, Jun 18, 2013 at 2:02 PM, Brian O'Neill b...@alumni.brown.edu wrote: Mostly for fun, I wanted to throw this out there... We are undergoing a security audit for our platform (C* + Elastic Search + Storm). One component of that audit is susceptibility to SQL injection. I was wondering if anyone has attempted to construct a SQL injection attack against Cassandra? Is it even possible? I know the code paths fairly well, but... Does there exists a path in the code whereby user data gets interpreted, which could be exploited to perform user operations? From the Thrift side of things, I've always felt safe. Data is opaque. Serializers are used to convert it to Bytes, and C* doesn't ever really do anything with the data. In examining the CQL java-driver, it looks like there might be a bit more exposure to injection. (or even CQL over Thrift) I haven't dug into the code yet, but dependent on which flavor of the API you are using, you may be including user data in your statements. Does anyone know if the CQL java-driver does anything to protect against injection? Or is it possible to say that the syntax is strict enough that any embedded operations in data would not parse? just some food for thought... I'll be digging into this over the next couple weeks. If people are interested, I can throw a blog post out there with the findings. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
[BLOG] : Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine
FWIW, we were able to integrate Druid and Cassandra. Its only in PoC right now, but it seems like a powerful combination: http://brianoneill.blogspot.com/2013/05/cassandra-as-deep-storage-mechanism-for.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: multitenant support with key spaces
You may want to look at using virtual keyspaces: http://hector-client.github.io/hector/build/html/content/virtual_keyspaces.html And follow these tickets: http://wiki.apache.org/cassandra/MultiTenant -brian On May 6, 2013, at 2:37 AM, Darren Smythe wrote: How many keyspaces can you reasonably have? We have around 500 customers and expect that to double end of year. We're looking into C* and wondering if it makes sense for a separate KS per customer? If we have 1000 customers, so one KS per customer is 1000 keyspaces. Is that something C* can handle efficiently? Each customer has about 10 GB of data (not taking replication into account). Or is this symptomatic of a bad design? I guess the same question applies to our notion of breaking up the column families into time ranges. We're naively trying to avoid having few large CFs/KSs. Is/should that be a concern? What are the tradeoffs of a smaller number of heavyweight KS/CFs vs. manually sharding the data into more granular KSs/CFs? Thanks for any info. -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Exporting all data within a keyspace
You could always do something like this as well: http://brianoneill.blogspot.com/2012/05/dumping-data-from-cassandra-like.htm l -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Kumar Ranjan winnerd...@gmail.com Reply-To: user@cassandra.apache.org Date: Tuesday, April 30, 2013 9:11 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Exporting all data within a keyspace Try sstable2json and json2sstable. But it works on column family so you can fetch all column family and iterate over list of CF and use sstable2json tool to extract data. Remember this will only fetch on disk data do anything in memtable/cache which is to be flushed will be missed. So run compaction and then run the written script. On Tuesday, April 30, 2013, Chidambaran Subramanian wrote: Is there any easy way of exporting all data for a keyspace (and conversely) importing it. Regards Chiddu
Re: Blobs in CQL?
Great! Thanks Gabriel. Do you have an example? (are using QueryBuilder?) I couldn't find the part of the API that allowed you to pass in the byte array. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: Hi Brian, I'm using the blobs to store images in cassandra(1.2.3) using the java-driver version 1.0.0-beta1. There is no need to convert a byte array into hex. Br, Gabi On 4/11/13 3:21 PM, Brian O'Neill wrote: I started playing around with the CQL driver. Has anyone used blobs with it yet? Are you forced to convert a byte[] to hex? (e.g. I have a photo that I want to store in C* using the java-driver API) -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Blobs in CQL?
Cool. That might be it. I'll take a look at PreparedStatement. For query building, I took a look under the covers, and even when I was passing in a ByteBuffer, it runs through the following code in the java-driver: Utils.java: if (value instanceof ByteBuffer) { sb.append(0x); sb.append(ByteBufferUtil.bytesToHex((ByteBuffer)value)); } Hopefully, the prepared statement doesn't do the conversion. (I'm not sure if it is a limitation of the CQL protocol itself) thanks again, -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: I'm not using the query builder but the PreparedStatement. Here is the sample code: https://gist.github.com/devsprint/5363023 Gabi On 4/11/13 3:27 PM, Brian O'Neill wrote: Great! Thanks Gabriel. Do you have an example? (are using QueryBuilder?) I couldn't find the part of the API that allowed you to pass in the byte array. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: Hi Brian, I'm using the blobs to store images in cassandra(1.2.3) using the java-driver version 1.0.0-beta1. There is no need to convert a byte array into hex. Br, Gabi On 4/11/13 3:21 PM, Brian O'Neill wrote: I started playing around with the CQL driver. Has anyone used blobs with it yet? Are you forced to convert a byte[] to hex? (e.g. I have a photo that I want to store in C* using the java-driver API) -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Blobs in CQL?
Yep, it worked like a charm. (PreparedStatement avoided the hex conversion) But now, I'm seeing a few extra bytes come back in the select…. (I'll keep digging, but maybe you have some insight?) I see this: ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao: repository.add() byte.length()=[259804] ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao: repository.get() [foo.jpeg] byte.length()=[259861] (Notice the length's don't match up) Using this code: public void addContent(String key, byte[] data) throws NoHostAvailableException { LOG.error(repository.add() byte.length()=[ + data.length + ]); String statement = INSERT INTO + KEYSPACE + . + TABLE + (key, data) VALUES (?, ?); PreparedStatement ps = session.prepare(statement); BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data)); session.execute(bs); } public byte[] getContent(String key) throws NoHostAvailableException { Query select = select(data).from(KEYSPACE, TABLE).where(eq(key, key)); ResultSet resultSet = session.execute(select); byte[] data = resultSet.one().getBytes(data).array(); LOG.error(repository.get() [ + key + ] byte.length()=[ + data.length + ]); return data; } --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Thursday, April 11, 2013 8:48 AM To: user@cassandra.apache.org user@cassandra.apache.org Cc: Gabriel Ciuloaica gciuloa...@gmail.com Subject: Re: Blobs in CQL? Hopefully, the prepared statement doesn't do the conversion. It does not. (I'm not sure if it is a limitation of the CQL protocol itself) thanks again, -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 tel:215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com http://healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: I'm not using the query builder but the PreparedStatement. Here is the sample code: https://gist.github.com/devsprint/5363023 Gabi On 4/11/13 3:27 PM, Brian O'Neill wrote: Great! Thanks Gabriel. Do you have an example? (are using QueryBuilder?) I couldn't find the part of the API that allowed you to pass in the byte array. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 tel:215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com http://healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: Hi Brian, I'm using
Re: Blobs in CQL?
Sylvain, Interesting, when I look at the actual bytes returned, I see the byte array is prefixed with the keyspace and table name. I assume I'm doing something wrong in the select. Am I incorrectly using the ResultSet? -brian On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote: Yep, it worked like a charm. (PreparedStatement avoided the hex conversion) But now, I'm seeing a few extra bytes come back in the select…. (I'll keep digging, but maybe you have some insight?) I see this: ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao: repository.add() byte.length()=[259804] ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao: repository.get() [foo.jpeg] byte.length()=[259861] (Notice the length's don't match up) Using this code: public void addContent(String key, byte[] data) throws NoHostAvailableException { LOG.error(repository.add() byte.length()=[ + data.length + ]); String statement = INSERT INTO + KEYSPACE + . + TABLE + (key, data) VALUES (?, ?); PreparedStatement ps = session.prepare(statement); BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data)); session.execute(bs); } public byte[] getContent(String key) throws NoHostAvailableException { Query select = select(data).from(KEYSPACE, TABLE).where(eq(key, key)); ResultSet resultSet = session.execute(select); byte[] data = resultSet.one().getBytes(data).array(); LOG.error(repository.get() [ + key + ] byte.length()=[ + data. length + ]); return data; } --- Brian O'Neill Lead Architect, Software Development *Health Market Science* *The Science of Better Results* 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. ** ** From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Thursday, April 11, 2013 8:48 AM To: user@cassandra.apache.org user@cassandra.apache.org Cc: Gabriel Ciuloaica gciuloa...@gmail.com Subject: Re: Blobs in CQL? Hopefully, the prepared statement doesn't do the conversion. It does not. (I'm not sure if it is a limitation of the CQL protocol itself) thanks again, -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: I'm not using the query builder but the PreparedStatement. Here is the sample code: https://gist.github.com/devsprint/5363023 Gabi On 4/11/13 3:27 PM, Brian O'Neill wrote: Great! Thanks Gabriel. Do you have an example? (are using QueryBuilder?) I couldn't find the part of the API that allowed you to pass in the byte array. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying
Re: Blobs in CQL?
Bingo! Thanks to both of you. (the C* community rocks) A few hours worth of work, and I've got a working REST-based photo repository backed by C* using the CQL java driver. =) rock on, thanks again, -brian On Thu, Apr 11, 2013 at 9:33 AM, Sylvain Lebresne sylv...@datastax.comwrote: I assume I'm doing something wrong in the select. Am I incorrectly using the ResultSet? You're incorrectly using the returned ByteBuffer. But you should not feel bad, that API kinda sucks. The short version is that .array() returns the backing array of the ByteBuffer. But there is no guarantee that you'll have a one-to-one correspondence between the valid content of the ByteBuffer and the backing array, the backing array can be bigger in particular (long story short, this allows multiple ByteBuffer to share the same backing array, which can avoid doing copies). I also note that there is no guarantee that .array() will work unless you've called .hasArray(). Anyway, what you could do is: ByteBuffer bb = resultSet.one().getBytes(data); byte[] data = new byte[bb.remaining()]; bb.get(data); Alternatively, you can use the result of .array(), but you should only consider the bb.remaining() bytes starting at bb.arrayOffset() + bb.position() (where bb is the returned ByteBuffer). -- Sylvain -brian On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote: Yep, it worked like a charm. (PreparedStatement avoided the hex conversion) But now, I'm seeing a few extra bytes come back in the select…. (I'll keep digging, but maybe you have some insight?) I see this: ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao: repository.add() byte.length()=[259804] ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao: repository.get() [foo.jpeg] byte.length()=[259861] (Notice the length's don't match up) Using this code: public void addContent(String key, byte[] data) throws NoHostAvailableException { LOG.error(repository.add() byte.length()=[ + data.length + ] ); String statement = INSERT INTO + KEYSPACE + . + TABLE + (key, data) VALUES (?, ?); PreparedStatement ps = session.prepare(statement); BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data)); session.execute(bs); } public byte[] getContent(String key) throwsNoHostAvailableException { Query select = select(data).from(KEYSPACE, TABLE).where(eq( key, key)); ResultSet resultSet = session.execute(select); byte[] data = resultSet.one().getBytes(data).array(); LOG.error(repository.get() [ + key + ] byte.length()=[ + data.length + ]); return data; } --- Brian O'Neill Lead Architect, Software Development *Health Market Science* *The Science of Better Results* 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. ** ** From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Thursday, April 11, 2013 8:48 AM To: user@cassandra.apache.org user@cassandra.apache.org Cc: Gabriel Ciuloaica gciuloa...@gmail.com Subject: Re: Blobs in CQL? Hopefully, the prepared statement doesn't do the conversion. It does not. (I'm not sure if it is a limitation of the CQL protocol itself) thanks again, -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa
Re: Bitmap indexes - reviving CASSANDRA-1472
changing to user@ (at least until we can determine if this can/should be proposed under 1472) For those interested in analytics and set-based queries, see below... -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 4/10/13 10:43 PM, Matt Stump mrevilgn...@gmail.com wrote: Druid was our inspiration to layer bitmap indexes on top of Cassandra. Druid doesn't work for us because or data set is too large. We would need many hundreds of nodes just for the pre-processed data. What I envisioned was the ability to perform druid style queries (no aggregation) without the limitations imposed by having the entire dataset in memory. I primarily need to query whether a user performed some event, but I also intend to add trigram indexes for LIKE, ILIKE or possibly regex style matching. I wasn't aware of CONCISE, thanks for the pointer. We are currently evaluating fastbit, which is a very similar project: https://sdm.lbl.gov/fastbit/ On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill b...@alumni.brown.eduwrote: How does this compare with Druid? https://github.com/metamx/druid We're currently evaluating Acunu, Vertica and Druid... http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra. html With its bitmapped indexes, Druid appears to have the most potential. They boast some pretty impressive stats, especially WRT handling real-time updates and adding new dimensions. They also use a compression algorithm, CONCISE, to cut down on the space requirements. http://ricerca.mat.uniroma3.it/users/colanton/concise.html I haven't looked too deep into the Druid code, but I've been meaning to see if it could be backed by C*. We'd be game to join the hunt if you pursue such a beast. (with your code, or with portions of Druid) -brian On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote: What do you think about set manipulation via indexes in Cassandra? I'm interested in answering queries such as give me all users that performed event 1, 2, and 3, but not 4. If the answer is yes than I can make a case for spending my time on C*. The only downside for us would be our current prototype is in C++ so we would loose some performance and the ability to dedicate an entire machine to caching/performing queries. On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com wrote: If you mean, Can someone help me figure out how to get started updating these old patches to trunk and cleaning out the Avro? then yes, I've been knee-deep in indexing code recently. On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com wrote: I'm currently building a distributed cluster on top of cassandra to perform fast set manipulation via bitmap indexes. This gives me the ability to perform unions, intersections, and set subtraction across sub-queries. Currently I'm storing index information for thousands of dimensions as cassandra rows, and my cluster keeps this information cached, distributed and replicated in order to answer queries. Every couple of days I think to myself this should really exist in C*. Given all the benifits would there be any interest in reviving CASSANDRA-1472? Some downsides are that this is very memory intensive, even for sparse bitmaps. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
BI/Analtyics/Warehousing for data in C*
We are trudging through an options analysis for BI/DW solutions for data stored in C*. I'd love to hear people's experiences. Here is what we've found so far: http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html Maybe we just use Intravert with a custom handler to handle the dimensional cubes? https://github.com/zznate/intravert-ug Then, we could slap a javascript charting framework on it and call it cubert. =) http://www.classicgamesarcade.com/game/21652/q*bert.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: any other NYC* attendees find your usb stick of the proceedings empty?
I think the recorded sessions will be posted to the PlanetCassandra Youtube channel: http://www.planetcassandra.org/blog/post/nyc-big-data-tech-day-update Some of the slides have been posted up to slideshare: http://www.slideshare.net/boneill42/hms-nyc-talk http://www.slideshare.net/edwardcapriolo/intravert -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Brian Tarbox tar...@cabotresearch.com Reply-To: user@cassandra.apache.org Date: Monday, March 25, 2013 11:43 AM To: user@cassandra.apache.org Subject: any other NYC* attendees find your usb stick of the proceedings empty? Last week I attended DataStax's NYC* conference and one of the give-aways was a wooden USB stick. Finally getting around to loading it I find it empty. Anyone else have this problem? Are the conference presentations available somewhere else? Brian Tarbox
Re: Netflix/Astynax Client for Cassandra
Incidentally, we run Astyanax against 1.2.1. We haven't had any issues. When running against 1.2.0, we ran into this: https://github.com/Netflix/astyanax/issues/191 -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 2/7/13 6:58 AM, Peter Lin wool...@gmail.com wrote: if i'm not mistaken, isn't this due to limitations of thrift versus binary protocol? That's my understanding from datastax blogs. unless someone really needs all the features of 1.2 like asynchronous queries, astyanax and hector should work fine. On Thu, Feb 7, 2013 at 1:20 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote: Astyanax is not working with Cassandra 1.2.1. Only java-driver is working very well with both Cassandra 1.2 and 1.2.1. Cheers, Gabi On 2/7/13 8:16 AM, Michael Kjellman wrote: It's a really great library and definitely recommended by me and many who are reading this. And if you are just starting out on 1.2.1 with C* you might also want to evaluate https://github.com/datastax/java-driver and the new binary protocol. Best, michael From: Cassa L lcas...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday, February 6, 2013 10:13 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Netflix/Astynax Client for Cassandra Hi, Has anyone used Netflix/astynax java client library for Cassandra? I have used Hector before and would like to evaluate astynax. Not sure, how it is accepted in Cassandra community. Any issues with it or advantagest? API looks very clean and simple compare to Hector. Has anyone used it in production except Netflix themselves? Thanks LCassa
Re: Accessing Metadata of Column Familes
Through CQL, you see the logical schema. Through CLI, you see the physical schema. This may help: http://www.datastax.com/dev/blog/cql3-for-cassandra-experts -brian On Mon, Jan 28, 2013 at 7:26 AM, Rishabh Agrawal rishabh.agra...@impetus.co.in wrote: I found following issues while working on Cassandra version 1.2, CQL 3 and Thrift protocol 19.35.0. Case 1: Using CQL I created a table t1 with columns col1 and col2 with col1 being my primary key. When I access same data using CLI, I see col1 gets adopted as rowkey and col2 being another column. Now I have inserted value in another column (col3) in same row using CLI. Now when I query same table again from CQL I am unable to find col3. Case 2: Using CLI, I have created table t2. Now I added a row key row1 and two columns (keys) col1 and col2 with some values in each. When I access t2 from CQL I find following resultset with three columns: key | column1 | value row1| col1 | val1 row1| col2 | val2 This behavior raises certain questions: · What is the reason for such schema anomaly or is this a problem? · Which schema should be deemed as correct or consistent? · How to access meta data on the same? Thanks and Regards Rishabh Agrawal From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com] Sent: Monday, January 28, 2013 12:57 PM To: user@cassandra.apache.org Subject: RE: Accessing Metadata of Column Familes You can get storage attributes from /data/system/ keyspace. From: Rishabh Agrawal [mailto:rishabh.agra...@impetus.co.in] Sent: Monday, January 28, 2013 12:42 PM To: user@cassandra.apache.org Subject: RE: Accessing Metadata of Column Familes Thank for the reply. I do not want to go by API route. I wish to access files and column families which store meta data information From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com] Sent: Monday, January 28, 2013 12:25 PM To: user@cassandra.apache.org Subject: RE: Accessing Metadata of Column Familes Which API are you using? If you are using Hector use ColumnFamilyDefinition. Regards Harshvardhan OJha From: Rishabh Agrawal [mailto:rishabh.agra...@impetus.co.in] Sent: Monday, January 28, 2013 12:16 PM To: user@cassandra.apache.org Subject: Accessing Metadata of Column Familes Hello, I wish to access metadata information on column families. How can I do it? Any ideas? Thanks and Regards Rishabh Agrawal NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. The contents of this email, including the attachments, are PRIVILEGED AND CONFIDENTIAL to the intended recipient at the email address to which it has been addressed. If you receive it in error, please notify the sender immediately by return email and then permanently delete it from your system. The unauthorized use, distribution, copying or alteration of this email, including the attachments, is strictly forbidden. Please note that neither MakeMyTrip nor the sender accepts any responsibility for viruses and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of MakeMyTrip by means of email communications. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. The contents of this email, including the attachments, are PRIVILEGED AND CONFIDENTIAL to the intended recipient at the email address to which it has been addressed. If you receive it in error, please notify the sender immediately by return email and then permanently delete it from your system. The unauthorized use, distribution, copying or alteration of this email, including the attachments, is strictly forbidden. Please note that neither MakeMyTrip nor the sender accepts any responsibility for viruses and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of MakeMyTrip by means of email communications.
Re: cql: show tables in a keystone
cqlsh use keyspace; cqlsh:cirrus describe tables; For more info: cqlsh help describe -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 1/28/13 2:27 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote: Is there some way in cql to get a list of all tables or column families that belong to a keystore like show tables in sql?
Webinar: Using Storm for Distributed Processing on Cassandra
Just an FYI -- We will be hosting a webinar tomorrow demonstrating the use of Storm as a distributed processing layer on top of Cassandra. I'll be tag teaming with Taylor Goetz, the original author of storm-cassandra. http://www.datastax.com/resources/webinars/collegecredit It is part of the C*ollege Credit Webinar Series from Datastax. All are welcome. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Cassandra 1.2 Thrift and CQL 3 issue
I reported the issue here. You may be missing a component in your column name. https://issues.apache.org/jira/browse/CASSANDRA-5138 -brian On Jan 12, 2013, at 12:48 PM, Shahryar Sedghi wrote: Hi I am trying to test my application that runs with JDBC, CQL 3 with Cassandra 1.2. After getting many weird errors and downgrading from JDBC to thrift, I realized the thrift on Cassandra 1.2 has issues with wide rows. If I define the table as: CREATE TABLE test(interval int,id text, body text, primary key (interval, id)); select interval, id, body from test; fails with: ERROR [Thrift:16] 2013-01-11 18:23:35,997 CustomTThreadPoolServer.java (line 217) Error occurred during processing of message. java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1 at org.apache.cassandra.config.CFMetaData.getColumnDefinitionFromColumnName(CFMetaData.java:923) at org.apache.cassandra.cql.QueryProcessor.processStatement(QueryProcessor.java:502) at org.apache.cassandra.cql.QueryProcessor.process(QueryProcessor.java:789) at org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1652) at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:4048) at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:4036) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1121) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614) at java.lang.Thread.run(Thread.java:780) Same code works well with Cassandra 1.1. At the same time, if I define the table as: CREATE TABLE test1(interval int,id text, body text, primary key (interval)); everything works fine. I am using DataStax Community 1.2 apache-cassandra-clientutil-1.2.0.jar apache-cassandra-thrift-1.2.0.jar libthrift-0.7.0.jar Apparently client.set_cql_version(3.0.0); has no effect either. Is there a setting that I miss on the client side to dictate cql3 or it is a bug? Thanks in advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Astyanax
Not sure where you are on the learning curve, but I've put a couple getting started projects out on github: https://github.com/boneill42/astyanax-quickstart And the latest from the webinar is here: https://github.com/boneill42/naughty-or-nice http://brianoneill.blogspot.com/2013/01/creating-your-frist-java-application -w.html -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Radek Gruchalski radek.gruchal...@portico.io Reply-To: user@cassandra.apache.org Date: Tuesday, January 8, 2013 10:17 AM To: user@cassandra.apache.org user@cassandra.apache.org Cc: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Astyanax Hi, We are using astyanax and we found out that github wiki with stackoverflow is the most comprehensive set of documentation. Do you have any specific questions? Kind regards, Radek Gruchalski On 8 Jan 2013, at 15:46, Everton Lima peitin.inu...@gmail.com wrote: I was studing by there, but I would to know if anyone knows other sources. 2013/1/8 Markus Klems markuskl...@gmail.com The wiki? https://github.com/Netflix/astyanax/wiki On Tue, Jan 8, 2013 at 2:44 PM, Everton Lima peitin.inu...@gmail.com wrote: Hi, Someone has or could indicate some good tutorial or book to learn Astyanax? Thanks -- Everton Lima Aleixo Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA
Re: Best Java Driver for Cassandra?
Well, we'll talk a bit about this in my webinar later today http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-cre dit.html I put together a quick decision matrix for all of the options based on production-readiness, potential and momentum. I think the slides will be made available afterwards. I also have a laundry list here: (written before I knew about Firebrand) http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 12/13/12 9:03 AM, stephen.m.thomp...@wellsfargo.com stephen.m.thomp...@wellsfargo.com wrote: There seem to be a number of good options listed ... FireBrand and Hector seem to have the most attractive sites, but that doesn't necessarily mean anything. :) Can anybody make a case for one of the drivers over another, especially in terms of which ones seem to be most used in major implementations? Thanks Steve
Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Datatype Conversion in CQL-Client?
I don't think Michael and/or Jonathan have published the CQL java driver yet. (CCing them) Hopefully they'll find a public home for it soon, I hope to include it in the Webinar in December. (http://www.datastax.com/resources/webinars/collegecredit) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Tommi Laukkanen tlaukka...@gmail.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 2:36 AM To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? I think Timmy might be referring to the upcoming native CQL Java driver that might be coming with 1.2 - It was mentioned here: http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Chang es_in_Drivers.pdf I would also be interested on testing that but I can't find it from repositories. Any hints? Regards, Tommi L. From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: 18. marraskuuta 2012 17:47 To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? Importance: Low If you are talking about the CQL-client that comes with Cassandra (cqlsh), it is actually written in Python: https://github.com/apache/cassandra/blob/trunk/bin/cqlsh For information on datatypes (and conversion) take a look at the CQL definition: http://www.datastax.com/docs/1.0/references/cql/index (Look at the CQL Data Types section) If that's not the client you are referencing, let us know which one you mean: http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote: Thanks for the links, however I'm interested in the functionality that the official Cassandra client/API (which is in Java) offers. 2012/11/17 aaron morton aa...@thelastpickle.com Does the official/built-in Cassandra CQL client (in 1.2) What language ? Check the Java http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ and python http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/ drivers. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com http://www.thelastpickle.com/ On 16/11/2012, at 11:21 AM, Timmy Turner timm.t...@gmail.com wrote: Does the official/built-in Cassandra CQL client (in 1.2) offer any built-in option to get direct values/objects when reading a field, instead of just a byte array? -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com http://healthmarketscience.com/ ) mobile:215.588.6024 tel:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Datatype Conversion in CQL-Client?
Gotcha Timmy. That is the Thrift API. You are operating at a pretty low-level. I'm not sure that is considered the official CQL client. IMHO, you might be better off moving up a level. I'd probably either wait for the official CQL Java Driver, or access CQL via a higher-level client like Hector. If you stick with Thrift, I think you can access the Schema metadata: https://github.com/apache/cassandra/blob/trunk/interface/thrift/gen-java/org /apache/cassandra/thrift/CqlMetadata.java (Those are the generated classes for the Thrift interface) But I'm not sure where the code is to apply that metadata to the result set in: https://github.com/apache/cassandra/blob/trunk/interface/thrift/gen-java/org /apache/cassandra/thrift/CqlResult.java -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Timmy Turner timm.t...@gmail.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 9:48 AM To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? What I meant was the method that the Cassandra-jars give you when you include them in your project: TTransport tr = new TFramedTransport(new TSocket(localhost, 9160)); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); client.execute_cql_query(ByteBuffer.wrap(cql.getBytes()), Compression.NONE); 2012/11/19 Brian O'Neill b...@alumni.brown.edu I don't think Michael and/or Jonathan have published the CQL java driver yet. (CCing them) Hopefully they'll find a public home for it soon, I hope to include it in the Webinar in December. (http://www.datastax.com/resources/webinars/collegecredit) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 tel:215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Tommi Laukkanen tlaukka...@gmail.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 2:36 AM To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? I think Timmy might be referring to the upcoming native CQL Java driver that might be coming with 1.2 - It was mentioned here: http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Changes _in_Drivers.pdf I would also be interested on testing that but I can't find it from repositories. Any hints? Regards, Tommi L. From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: 18. marraskuuta 2012 17:47 To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? Importance: Low If you are talking about the CQL-client that comes with Cassandra (cqlsh), it is actually written in Python: https://github.com/apache/cassandra/blob/trunk/bin/cqlsh For information on datatypes (and conversion) take a look at the CQL definition: http://www.datastax.com/docs/1.0/references/cql/index (Look at the CQL Data Types section) If that's not the client you are referencing, let us know which one you mean: http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote: Thanks for the links, however I'm interested in the functionality that the official Cassandra client/API (which is in Java) offers. 2012/11/17 aaron morton aa...@thelastpickle.com Does the official/built-in Cassandra
Re: Datastax Java Driver
Woohoo! Thanks for making this available. --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Sylvain Lebresne sylv...@datastax.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 1:50 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Datastax Java Driver Everyone, We've just open-sourced a new Java driver we have been working on here at DataStax. This driver is CQL3 only and is built to use the new binary protocol that will be introduced with Cassandra 1.2. It will thus only work with Cassandra 1.2 onwards. Currently, it means that testing it requires 1.2.0-beta2. This is also alpha software at this point. You are welcome to try and play with it and we would very much welcome feedback, but be sure that break, it will. The driver is accessible at: http://github.com/datastax/java-driver Today we're open-sourcing the core part of this driver. This main goal of this core module is to handle connections to the Cassandra cluster with all the features that one would expect. The currently supported features are: - Asynchronous: the driver uses the new CQL binary protocol asynchronous capabilities. - Nodes discovery. - Configurable load balancing/routing. - Transparent fail-over. - C* tracing handling. - Convenient schema access. - Configurable retry policy. This core module provides a simple low-level API (that works directly with query strings). We plan to release a higher-level, thin object mapping API based on top of this core shortly. Please refer to the project README for more information. -- The DataStax Team
Re: Datatype Conversion in CQL-Client?
Hector does, but the newer clients/drivers no longer use Thrift. (Thrift is the legacy protocol) If you are still in early stages and you know you want your primary interface to be CQL, you may want to look at the java driver that Datastax just released. http://github.com/datastax/java-driver -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Timmy Turner timm.t...@gmail.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 3:37 PM To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? Do these other clients use the thrift API internaly? 2012/11/19 John Sanda john.sa...@gmail.com You might want to take look a org.apache.cassandra.transport.SimpleClient and org.apache.cassandra.transport.messages.ResultMessage. On Mon, Nov 19, 2012 at 9:48 AM, Timmy Turner timm.t...@gmail.com wrote: What I meant was the method that the Cassandra-jars give you when you include them in your project: TTransport tr = new TFramedTransport(new TSocket(localhost, 9160)); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); client.execute_cql_query(ByteBuffer.wrap(cql.getBytes()), Compression.NONE); 2012/11/19 Brian O'Neill b...@alumni.brown.edu I don't think Michael and/or Jonathan have published the CQL java driver yet. (CCing them) Hopefully they'll find a public home for it soon, I hope to include it in the Webinar in December. (http://www.datastax.com/resources/webinars/collegecredit) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 tel:215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Tommi Laukkanen tlaukka...@gmail.com Reply-To: user@cassandra.apache.org Date: Monday, November 19, 2012 2:36 AM To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? I think Timmy might be referring to the upcoming native CQL Java driver that might be coming with 1.2 - It was mentioned here: http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Chang es_in_Drivers.pdf I would also be interested on testing that but I can't find it from repositories. Any hints? Regards, Tommi L. From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: 18. marraskuuta 2012 17:47 To: user@cassandra.apache.org Subject: Re: Datatype Conversion in CQL-Client? Importance: Low If you are talking about the CQL-client that comes with Cassandra (cqlsh), it is actually written in Python: https://github.com/apache/cassandra/blob/trunk/bin/cqlsh For information on datatypes (and conversion) take a look at the CQL definition: http://www.datastax.com/docs/1.0/references/cql/index (Look at the CQL Data Types section) If that's not the client you are referencing, let us know which one you mean: http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote: Thanks for the links, however I'm interested in the functionality that the official Cassandra client/API (which is in Java) offers. 2012/11/17 aaron morton aa...@thelastpickle.com Does the official/built-in Cassandra CQL client (in 1.2) What language ? Check the Java http://code.google.com/a/apache-extras.org/p/cassandra-jdbc
Re: Datatype Conversion in CQL-Client?
If you are talking about the CQL-client that comes with Cassandra (cqlsh), it is actually written in Python: https://github.com/apache/cassandra/blob/trunk/bin/cqlsh For information on datatypes (and conversion) take a look at the CQL definition: http://www.datastax.com/docs/1.0/references/cql/index (Look at the CQL Data Types section) If that's not the client you are referencing, let us know which one you mean: http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html -brian On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote: Thanks for the links, however I'm interested in the functionality that the official Cassandra client/API (which is in Java) offers. 2012/11/17 aaron morton aa...@thelastpickle.com Does the official/built-in Cassandra CQL client (in 1.2) What language ? Check the Java http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ and python http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/ drivers. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/11/2012, at 11:21 AM, Timmy Turner timm.t...@gmail.com wrote: Does the official/built-in Cassandra CQL client (in 1.2) offer any built-in option to get direct values/objects when reading a field, instead of just a byte array? -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: [BETA RELEASE] Apache Cassandra 1.2.0-beta2 released
Wow...good catch. We had puppet scripts which automatically assigned the proper tokens given the cluster size. What is the range now? Got a link? -brian On Nov 10, 2012, at 9:27 PM, Edward Capriolo wrote: just a note for all. The default partitioner is no longer randompartitioner. It is now murmur, and the token range starts in negative numbers. So you don't chose tokens Luke your father taught you anymore. On Friday, November 9, 2012, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is pleased to announce the release of the second beta for the future Apache Cassandra 1.2.0. Let me first stress that this is beta software and as such is *not* ready for production use. This release is still beta so is likely not bug free. However, lots have been fixed since beta1 and if everything goes right, we are hopeful that a first release candidate may follow shortly. Please do help testing this beta to help make that happen. If you encounter any problem during your testing, please report[3,4] them. And be sure to a look at the change log[1] and the release notes[2] to see where Cassandra 1.2 differs from the previous series. Apache Cassandra 1.2.0-beta2[5] is available as usual from the cassandra website (http://cassandra.apache.org/download/) and a debian package is available using the 12x branch (see http://wiki.apache.org/cassandra/DebianPackaging). Thank you for your help in testing and have fun with it. [1]: http://goo.gl/wnDAV (CHANGES.txt) [2]: http://goo.gl/CBsqs (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: user@cassandra.apache.org [5]: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-1.2.0-beta2 -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Indexing Data in Cassandra with Elastic Search
For those looking to index data in Cassandra with Elastic Search, here is what we decided to do: http://brianoneill.blogspot.com/2012/11/big-data-quadfecta-cassandra-storm.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: logging servers? any interesting in one for cassandra?
Thanks Dean. We'll definitely take a look. (probably in January) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 11/6/12 11:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Sure, in our playing around, we have an awesome log back configuration for development time only that shows warning, severe in red in eclipse and let's you click on every single log taking you right to the code that logged it…(thought you might enjoy it)... https://github.com/deanhiller/playorm/blob/master/input/javasrc/logback.xm l The java appender is here(called CassandraAppender) https://github.com/deanhiller/playorm/tree/master/input/javasrc/com/alvaza n /play/logging The AsyncAppender there is different then log backs in that it allows bursting but once reaches the limit, it essentially becomes synchronous again which allows us to not drop logs like log backs and allow for bursts of performance The CircularBufferAppender is an inmemory buffer that flushes all logs X level and above to child appender when a warning or severe happens where X is configurable. We have only tested out the CassandraAppender at this point. Right now you have to call CassandraAppender.setFactory to set the NoSqlEntityManager factory to set it. It creates a LogEvent rows as well as an index on the session and partitions by the first two characters of the web session id so there is an index per partition. This allows us to the look at a single web session of a user. The only thing I don't like is we have to do a read when updating the index to be able to delete old values in the index(ick), but I couldn't figure any other way around that. Also, if you have high event rates, there is a MDCLevelFilter so you can tag the MDC with something like user=__program__ and ignore all logs for him unless they are warning logs which we use to limit the logs from just being huge. Later, Dean On 11/6/12 6:32 AM, Brian O'Neill b...@alumni.brown.edu wrote: Nice DeanŠ I'm not so sure we would run the server, but we'd definitely be interested in the logback adaptor. (We would then just access the data via Virgil (over REST), with a thin javascript UI) Let me/us know if you end up putting it out there. We intend centralize logging sometime over the next few months. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 11/1/12 10:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote: 2 questions 1. What are people using for logging servers for their web tier logging? 2. Would anyone be interested in a new logging server(any programming language) for web tier to log to your existing cassandra(it uses up disk space in proportion to number of web servers and just has a rolling window of logs along with a window of threshold dumps)? Context for second question: I like less systems since it is less maintenance/operations cost and so yesterday I quickly wrote up some log back appenders which support (SLF4J/log4j/jdk/commons libraries) and send the logs from our client tier into cassandra. It is simply a rolling window of logs so the space used in cassandra is proportional to the amount of web servers I have(currently, I have 4 web servers). I am also thinking about adding warning type logging such that on warning, the last N logs info and above are flushed along with the warning so basically two rolling windows. Then in the GUI, it simply shows
Re: logging servers? any interesting in one for cassandra?
Nice Dean I'm not so sure we would run the server, but we'd definitely be interested in the logback adaptor. (We would then just access the data via Virgil (over REST), with a thin javascript UI) Let me/us know if you end up putting it out there. We intend centralize logging sometime over the next few months. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 11/1/12 10:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote: 2 questions 1. What are people using for logging servers for their web tier logging? 2. Would anyone be interested in a new logging server(any programming language) for web tier to log to your existing cassandra(it uses up disk space in proportion to number of web servers and just has a rolling window of logs along with a window of threshold dumps)? Context for second question: I like less systems since it is less maintenance/operations cost and so yesterday I quickly wrote up some log back appenders which support (SLF4J/log4j/jdk/commons libraries) and send the logs from our client tier into cassandra. It is simply a rolling window of logs so the space used in cassandra is proportional to the amount of web servers I have(currently, I have 4 web servers). I am also thinking about adding warning type logging such that on warning, the last N logs info and above are flushed along with the warning so basically two rolling windows. Then in the GUI, it simply shows the logs and if you click on a session, it switches to a view with all the logs for that session(no matter which server since in our cluster the session switches servers on every request since we are stateless.our session id is in the cookie). Well, let me know if anyone is interested and would actually use such a thing and if so, we might create a server around it. Thanks, Dean
Keeping the record straight for Cassandra Benchmarks...
People probably saw... http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html To clarify things take a look at... http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Using compound primary key
Hey Vivek, The same thing happened to me the other day. You may be missing a component in your compound key. See this thread: http://mail-archives.apache.org/mod_mbox/cassandra-dev/201210.mbox/%3ccajhhpg20rrcajqjdnf8sf7wnhblo6j+aofksgbxyxwcoocg...@mail.gmail.com%3E I also wrote a couple blogs on it: http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html http://brianoneill.blogspot.com/2012/10/cql-astyanax-and-compoundcomposite-keys.html They've fixed this in the 1.2 beta, whereby it checks (at the thrift layer) to ensure you have the requisite number of components in the compound/composite key. -brian On Oct 8, 2012, at 10:32 PM, Vivek Mishra wrote: Certainly. As these are available with cql3 only! Example mentioned on datastax website is working fine, only difference is i tried with a compound primary key with 3 composite columns in place of 2 -Vivek On Tue, Oct 9, 2012 at 7:57 AM, Arindam Barua aba...@247-inc.com wrote: Did you use the “--cql3” option with the cqlsh command? From: Vivek Mishra [mailto:mishra.v...@gmail.com] Sent: Monday, October 08, 2012 7:22 PM To: user@cassandra.apache.org Subject: Using compound primary key Hi, I am trying to use compound primary key column name and i am referring to: http://www.datastax.com/dev/blog/whats-new-in-cql-3-0 As mentioned on this example, i tried to create a column family containing compound primary key (one or more) as: CREATE TABLE altercations ( instigator text, started_at timestamp, ships_destroyed int, energy_used float, alliance_involvement boolean, PRIMARY KEY (instigator,started_at,ships_destroyed) ); And i am getting: ** TSocket read 0 bytes cqlsh:testcomp ** Then followed by insert and select statements giving me following errors: cqlsh:testcompINSERT INTO altercations (instigator, started_at, ships_destroyed, ... energy_used, alliance_involvement) ... VALUES ('Jayne Cobb', '2012-07-23', 2, 4.6, 'false'); TSocket read 0 bytes cqlsh:testcomp select * from altercations; Traceback (most recent call last): File bin/cqlsh, line 1008, in perform_statement self.cursor.execute(statement, decoder=decoder) File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py, line 117, in execute response = self.handle_cql_execution_errors(doquery, prepared_q, compress) File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py, line 132, in handle_cql_execution_errors return executor(*args, **kwargs) File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py, line 1583, in execute_cql_query self.send_execute_cql_query(query, compression) File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py, line 1593, in send_execute_cql_query self._oprot.trans.flush() File bin/../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py, line 293, in flush self.__trans.write(buf) File bin/../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py, line 117, in write plus = self.handle.send(buff) error: [Errno 32] Broken pipe cqlsh:testcomp Any idea? Is it a problem with CQL3 or with cassandra? P.S: I did post same query on dev group as well to get a quick response. -Vivek -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: 1000's of column families
Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote: Dean, On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. That's a good point that I hadn't considered beforehand, especially as I'd like to run MR jobs against these CFs. Is this limitation inherent in the way that Cassandra is modelled as input for Hadoop or could you write a custom slice query to only feed in one particular prefix into Hadoop? Cheers, Ben
Re: 1000's of CF's. virtual CFs do NOT workŠ..map/reduce
Dean, Great point. I hadn't considered that either. Per my other email, think we would need a custom partitioner for this? (a mix of OrderPreservingPartitioner and RandomPartitioner, OPP for the prefix) -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive ? King of Prussia, PA ? 19406 M: 215.588.6024 ? @boneill42 http://www.twitter.com/boneill42 ? healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 8:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So basically, with moving towards the 1000's of CF all being put in one CF, our performance is going to tank on map/reduce, correct? I mean, from what I remember we could do map/reduce on a single CF, but by stuffing 1000's of virtual Cf's into one CF, our map/reduce will have to read in all 999 virtual CF's rows that we don't want just to map/reduce the ONE CF. Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(. Is this correct? This really sounds like highly undesirable behavior. There needs to be a way for people with 1000's of CF's to also run map/reduce on any one CF. Doing Map/reduce on 1000 times the number of rows will be 1000 times slowerŠ.and of course, we will most likely get up to 20,000 tables from my most recent projectionsŠ.our last test load, we ended up with 8k+ CF's. Since I kept two other keyspaces, cassandra started getting really REALLY slow when we got up to 15k+ CF's in the system though I didn't look into why. I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to map/reduce just the virtual CF! Ugh. Thanks, Dean On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote: On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu wrote: Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keyspa c es.html So given that it is possible to use a CF per tenant, should we assume that there at sufficient scale that there is less overhead to prefix keys than there is to manage multiple CFs? Ben
Re: 1000's of column families
Agreed. Do we know yet what the overhead is for each column family? What is the limit? If you have a SINGLE keyspace w/ 2+ CF's, what happens? Anyone know? -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks for the idea but…(but please keep thinking on it)... 100% what we don't want since partitioned data resides on the same node. I want to map/reduce the column families and leverage the parallel disks :( :( I am sure others would want to do the same…..We almost need a feature of virtual Column Families and column family should really not be column family but should be called ReplicationGroup or something where replication is configured for all CF's in that group. ANYONE have any other ideas??? Dean On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote: Dean, On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. That's a good point that I hadn't considered beforehand, especially as I'd like to run MR jobs against these CFs. Is this limitation inherent in the way that Cassandra is modelled as input for Hadoop or could you write a custom slice query to only feed in one particular prefix into Hadoop? Cheers, Ben
Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...
Dean, We moved away from Hadoop and M/R, and instead we are using Storm as our compute grid. We queue keys in Kafka, then Storm distributes the work to the grid. Its working well so far, but we haven't taken it to prod yet. Data is read from Cassandra using a Cassandra-bolt. If you end up using Storm, let me know. We have an unreleased version of the bolt that you probably want to use. (we're waiting on Nathan/Storm to fix some classpath loading issues) RE: a customer virtual keyspace Partitioner, point well taken -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive ? King of Prussia, PA ? 19406 M: 215.588.6024 ? @boneill42 http://www.twitter.com/boneill42 ? healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Well, I think I know the direction we may follow so we can 1. Have Virtual CF's 2. Be able to map/reduce ONE Virtual CF Well, not map/reduce exactly but really really close. We use PlayOrm with it's partitioning so I am now thinking what we will do is have a compute grid where we can have each node doing a findAll query into the partitions it is responsible for. In this way, I think we can 1000's of virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves the rows for that partition of one virtual CF. Anyone know of a computer grid we can dish out work to? That would be my only missing piece (well, that and the PlayOrm virtual CF feature but I can add that within a week probably though I am on vacation this Thursday to monday). Later, Dean On 10/2/12 6:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So basically, with moving towards the 1000's of CF all being put in one CF, our performance is going to tank on map/reduce, correct? I mean, from what I remember we could do map/reduce on a single CF, but by stuffing 1000's of virtual Cf's into one CF, our map/reduce will have to read in all 999 virtual CF's rows that we don't want just to map/reduce the ONE CF. Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(. Is this correct? This really sounds like highly undesirable behavior. There needs to be a way for people with 1000's of CF's to also run map/reduce on any one CF. Doing Map/reduce on 1000 times the number of rows will be 1000 times slowerŠ.and of course, we will most likely get up to 20,000 tables from my most recent projectionsŠ.our last test load, we ended up with 8k+ CF's. Since I kept two other keyspaces, cassandra started getting really REALLY slow when we got up to 15k+ CF's in the system though I didn't look into why. I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to map/reduce just the virtual CF! Ugh. Thanks, Dean On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote: On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu wrote: Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keysp a c es.html So given that it is possible to use a CF per tenant, should we assume that there at sufficient scale that there is less overhead to prefix keys than there is to manage multiple CFs? Ben
Re: 1000's of column families
Exactly. --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:55 AM, Ben Hood 0x6e6...@gmail.com wrote: Brian, On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. I might be barking up the wrong tree here, but looking at source of ColumnFamilyInputFormat, it seems that you can specify a KeyRange for the input, but only when you use an order preserving partitioner. So I presume that if you are using the RandomPartitioner, you are effectively doing a full CF scan (i.e. including all tenants in your system). Ben
Re: 1000's of column families
Dean, We have the same question... We have thousands of separate feeds of data as well (20,000+). To date, we've been using a CF per feed strategy, but as we scale this thing out to accommodate all of those feeds, we're trying to figure out if we're going to blow out the memory. The initial documentation for heap sizing had column families in the equation: http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing But in the more recent documentation, it looks like they removed the column family variable with the introduction of the universal key_cache_size. http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/? Definitely let us know what you decide. -brian On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti f.baro...@list-group.com wrote: We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe that's not an issue any more. Our problems were two: - You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, we had to use distributed locks (Hazelcast) to avoid concurrency. - You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the updates must be applied). As I said, this applied to version 0.7, maybe current versions solved these problems. Flavio Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com -
Re: 1000's of column families
Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html -brian On Mon, Oct 1, 2012 at 4:22 PM, Ben Hood 0x6e6...@gmail.com wrote: Brian, On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote: We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/? Is this doing something special or is this just a convenience way of prefixing keys to make the storage space multi-tenanted? Cheers, Ben -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Using the commit log for external synchronization
IMHO it's a better design to multiplex the data stream at the application level. +1, agreed. That is where we ended up. (and Storm is proving to be a solid framework for that) -brian On Fri, Sep 21, 2012 at 4:56 AM, aaron morton aa...@thelastpickle.com wrote: The commit log is essentially internal implementation. The total size of the commit log is restricted, and the multiple files used to represent segments are recycled. So once all the memtables have been flushed for segment it may be overwritten. To archive the segments see the conf/commitlog_archiving.properties file. Large rows will bypass the commit log. A write commited to the commit log may still be considered a failure if CL nodes do not succeed. IMHO it's a better design to multiplex the data stream at the application level. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 21/09/2012, at 11:51 AM, Brian O'Neill b...@alumni.brown.edu wrote: Along those lines... We sought to use triggers for external synchronization. If you read through this issue: https://issues.apache.org/jira/browse/CASSANDRA-1311 You'll see the idea of leveraging a commit log for synchronization, via triggers. We went ahead and implemented this concept in: https://github.com/hmsonline/cassandra-triggers With that, via AOP, you get handed the mutation as things change. We used it for synchronizing SOLR. fwiw, -brian On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote: +1. Would be a pretty cool feature Right now I write once to cassandra and once to kafka. On 9/20/12 4:13 PM, Data Craftsman 木匠 database.crafts...@gmail.com wrote: This will be a good new feature. I guess the development team don't have time on this yet. ;) On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood 0x6e6...@gmail.com wrote: Hi, I'd like to incrementally synchronize data written to Cassandra into an external store without having to maintain an index to do this, so I was wondering whether anybody is using the commit log to establish what updates have taken place since a given point in time? Cheers, Ben -- Thanks, Charlie (@mujiang) 木匠 === Data Architect Developer 汉唐 田园牧歌DBA http://mujiang.blogspot.com 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) Apache Cassandra MVP mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Kundera 2.1 released
Well done, Vivek and team!! This release was much anticipated. I'll give this a test with Spring Data JPA when I return from vacation. thanks, -brian On Sep 21, 2012, at 9:15 PM, Vivek Mishra wrote: Hi All, We are happy to announce release of Kundera 2.0.7. Kundera is a JPA 2.0 based, object-datastore papping library for NoSQL datastores. The idea behind Kundera is to make working with NoSQL Databases drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB and relational databases. Major Changes in this release: --- * Allow user to set specific CQL versioning. * Batch insert/update for Cassandra/MongoDB/HBase. * Extended JPA Metamodel/TypedQuery/ProviderUtil implementation. * Another Thrift client implementation for Cassandra. * Deprecated support for properties with XML based Column family/Table/server specific property configuration for Cassandra, MongoDB and HBase. * Stronger query support: a) JPQL support over all data types and associations. b) JPQL support to query using primary key alongwith other columns. * Fixed github issues: https://github.com/impetus-opensource/Kundera/issues/90 https://github.com/impetus-opensource/Kundera/issues/91 https://github.com/impetus-opensource/Kundera/issues/92 https://github.com/impetus-opensource/Kundera/issues/93 https://github.com/impetus-opensource/Kundera/issues/94 https://github.com/impetus-opensource/Kundera/issues/96 https://github.com/impetus-opensource/Kundera/issues/98 https://github.com/impetus-opensource/Kundera/issues/99 https://github.com/impetus-opensource/Kundera/issues/100 https://github.com/impetus-opensource/Kundera/issues/101 https://github.com/impetus-opensource/Kundera/issues/102 https://github.com/impetus-opensource/Kundera/issues/104 https://github.com/impetus-opensource/Kundera/issues/106 https://github.com/impetus-opensource/Kundera/issues/107 https://github.com/impetus-opensource/Kundera/issues/108 https://github.com/impetus-opensource/Kundera/issues/109 https://github.com/impetus-opensource/Kundera/issues/111 https://github.com/impetus-opensource/Kundera/issues/112 https://github.com/impetus-opensource/Kundera/issues/116 To download, use or contribute to Kundera, visit: http://github.com/impetus-opensource/Kundera Latest released tag version is 2.1. Kundera maven libraries are now available at: https://oss.sonatype.org/content/repositories/releases/com/impetus and http://kundera.googlecode.com/svn/maven2/maven-missing-resources. Sample codes and examples for using Kundera can be found here: http://github.com/impetus-opensource/Kundera-Examples and https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests Thank you all for your contributions! Regards, Kundera Team. -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Using the commit log for external synchronization
Along those lines... We sought to use triggers for external synchronization. If you read through this issue: https://issues.apache.org/jira/browse/CASSANDRA-1311 You'll see the idea of leveraging a commit log for synchronization, via triggers. We went ahead and implemented this concept in: https://github.com/hmsonline/cassandra-triggers With that, via AOP, you get handed the mutation as things change. We used it for synchronizing SOLR. fwiw, -brian On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote: +1. Would be a pretty cool feature Right now I write once to cassandra and once to kafka. On 9/20/12 4:13 PM, Data Craftsman 木匠 database.crafts...@gmail.com wrote: This will be a good new feature. I guess the development team don't have time on this yet. ;) On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood 0x6e6...@gmail.com wrote: Hi, I'd like to incrementally synchronize data written to Cassandra into an external store without having to maintain an index to do this, so I was wondering whether anybody is using the commit log to establish what updates have taken place since a given point in time? Cheers, Ben -- Thanks, Charlie (@mujiang) 木匠 === Data Architect Developer 汉唐 田园牧歌DBA http://mujiang.blogspot.com 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Data Modeling - JSON vs Composite columns
Roshni, We're going through the same debate right now. I believe native support for JSON (or collections) is on the docket for Cassandra. Here is a discussion we had a few months ago on the topic: http://comments.gmane.org/gmane.comp.db.cassandra.devel/5233 We presently store JSON, but we're considering a change to composite keys. Presently, each client has to parse the JSON value. If you are retrieving lots of values, that's a lot of parsing. Also, storing the raw values allows for better integration with other tools, such as reporting engines (e.g. JasperSoft). Also, if you do want to update a single value inside the json, you get into real trouble, because you first need to read the value, update the field, then write the column again. The read before write is a problem, especially if you have a lot of concurrency in your system. (Two clients could read the old value, then update different fields, and the second would overwrite the firsts change) One final note... (As a side not, JSON values also complicated our wide-row indexing mechanism: (https://github.com/hmsonline/cassandra-indexing)) For those reasons, we're considering a data model shift away from JSON. That said, I'm keeping a close watch on: https://issues.apache.org/jira/browse/CASSANDRA-3647 But if this is CQL only, I'm not sure how much use it will be for us since we're coming in from different clients. Anyone know how/if collections will be available from other clients? -brian On Wed, Sep 19, 2012 at 8:00 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Hi, There was a conversation on this some time earlier, and to continue it Suppose I want to associate a user to an item, and I want to also store 3 commonly used attributes without needing to go to an entity item column family , I have 2 options :- A) use composite columns UserId1 : { itemid1:Name = Betty Crocker, itemid1:Descr = Cake itemid1:Qty = 5 itemid2:Name = Nutella, itemid2:Descr = Choc spread itemid2:Qty = 15 } B) use a json with the data UserId1 : { itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5}, itemid2 ={name: Nutella,descr: Choc spread, Qty: 15} } Essentially A is better if one wants to update individual fields , while B is better if one wants easier paging, reading multiple items at once in one read. etc. The details are in this discussion thread http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-another-question-td7581967.html I had an additional question, as its being said, that CQL is the direction in which cassandra is moving, and there's a lot of effort in making CQL the standard, How does approach B work in CQL. Can we read/write a JSON easily in CQL? Can we extract a field from a JSON in CQL or would that need to be done via the client code? Regards, Roshni -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) Apache Cassandra MVP mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Solr Use Cases
Roshni, We're using SOLR to support ad hoc queries and fuzzy searches against unstructured data stored in Cassandra. Cassandra is great for storage and you can create data models and indexes that support your queries, provided you can anticipate those queries. When you can't anticipate the queries, or if you need to support a large permutation of multi-dimensional queries, your probably better off using an index like SOLR. Since SOLR only supports a flat document structure, you may need to perform transformation before inserting into SOLR. We chose not to use DSE, so we used a cassandra-triggers as our mechanism to integrate SOLR. (https://github.com/hmsonline/cassandra-triggers) We intercept the mutation, transform the data into a document (w/ multi-value fields) and POST it to SOLR. More recently though, we're looking to roll out ElasticSearch. As our query demand increases, we expect SOLR to quickly become a PITA to administrer. (master-slave relationships) IMHO, ElasticSearch's architecture is a better match for Cassandra. We are also looking to substitute cassandra-triggers for Storm, allowing us to build a data processing flow using Cassandra and ElasticSearch bolts. (we've open sourced the Cassandra bolt and we'll be open sourcing the elastic search bolt shortly) -brian On Wed, Sep 19, 2012 at 8:27 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Hi, Im new to Solr, and I hear that Solr is a great tool for improving search performance Im unsure whether Solr or DSE Search is a must for all cassandra deployments 1. For performance - I thought cassandra had great read write performance. When should solr be used ? Taking the following use cases for cassandra from the datastax FAQ page, in which cases would Solr be useful, and whether for all? Time series data management High-velocity device data ingestion and analysis Media streaming (e.g., music, movies) Social media input and analysis Online web retail (e.g., shopping carts, user transactions) Web log management / analysis Web click-stream analysis Real-time data analytics Online gaming (e.g., real-time messaging) Write-intensive transaction systems Buyer event analytics Risk analysis and management 2. what changes to cassandra data modeling does Solr bring? We have some guidelines best practices around cassandra data modeling. Is Solr so powerful, that it does not matter how data is modelled in cassandra? Are there different best practices for cassandra data modeling when Solr is in the picture? Is this something we should keep in mind while modeling for cassandra today- that it should be good to be used via Solr in future? 3. Does Solr come with any drawbacks like its not real time ? I can should read the manual, but it will be great if someone can explain at a high level. Thank you! Regards, Roshni -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) Apache Cassandra MVP mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Compound Keys: Connecting the dots between CQL3 and Java APIs
Our data architects (ex-Oracle DBA types) are jumping on the CQL3 bandwagon and creating schemas for us. That triggered me to write a quick article mapping the CQL3 schemas to how they are accessed via Java APIs (for our dev team). I hope others find this useful as well: http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) Apache Cassandra MVP mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Cassandra API Library.
You got it. (done) -brian On Tue, Sep 4, 2012 at 7:08 AM, Filipe Gonçalves the.wa.syndr...@gmail.com wrote: @Brian: you can add the Cassandra::Simple Perl client http://fmgoncalves.github.com/p5-cassandra-simple/ 2012/8/27 Paolo Bernardi berna...@gmail.com On 08/23/2012 01:40 PM, Thomas Spengler wrote: 4) pelops (Thrift,Java) I've been using Pelops for quite some time with pretty good results; it felt much cleaner than Hector. Paolo -- @bernarpa http://paolobernardi.wordpress.com -- Filipe Gonçalves -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) Apache Cassandra MVP mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Spring - cassandra
Yes. I'm in contact with Oliver Gierke and Erez Mazor of Spring Data. We are working on two fronts: 1) Spring Data support via JPA (using Kundera underneath) - Initial attempt here: http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.h tml - Most recently (an hour ago): The issues w/ MetaModel are fixed, now waiting on an enhancement to the EntityManager to fully support type queries. For this one, we're in a holding pattern until Kundera is fully JPA compliant. 2) Spring Data support via Astyanax - The project I'm working below should mimic Spring Data MongoDB's approach and capabilities, allowing people to use Spring Data with Cassandra without the constraints of JPA. I'd love some help working on the project. Once we have it functional we should be able to push it to Spring. (with Oliver's help) Go ahead and fork. Feel free to email me directly so we don't spam this list. (or setup a googlegroup just in case others want to contribute) -brian --- Brian O'Neill Lead Architect, Software Development Apache Cassandra MVP Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/30/12 9:01 AM, Radim Kolar h...@filez.com wrote: You looking for the author of Spring Data Cassandra? https://github.com/boneill42/spring-data-cassandra If so, I guess that is me. =) Did you get in touch with spring guys? They have cassandra support on their spring data todo list. They might have some todo or feature list they want to implement for cassandra, i am willing to code something to make official spring cassandra support happen faster.
Re: Spring - cassandra
You looking for the author of Spring Data Cassandra? https://github.com/boneill42/spring-data-cassandra If so, I guess that is me. =) -brian --- Brian O'Neill Lead Architect, Software Development Apache Cassandra MVP Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/29/12 10:38 AM, Radim Kolar h...@filez.com wrote: is author of Spring - Cassandra here? I am interested in getting this merged into upstream spring. They have cassandra support on their todo list.
Re: Cassandra API Library.
We've used 'em all and (IMHO) 1) I would avoid Thrift directly. 2) Hector is a sure bet. 3) Astyanax is the up and comer. 4) Kundera is good, but works like an ORM -- so not so good if your columns aren't defined ahead of time. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de wrote: 4) pelops (Thrift,Java) On 08/23/2012 01:28 PM, Baskar Sikkayan wrote: I would vote for Hector :) On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com wrote: hi, kindly let me know which java client api is more matured, and easy to use with all features(Super Columns, caching, pooling, etc) of Cassandra 1.X. Right now i come to know that following client exists: 1) Hector(Java) 2) Thrift (Java) 3) Kundera (Java) With Regards, Amit -- Thomas Spengler Chief Technology Officer TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin Tel.: (030) 2000912 0 | Fax: (030) 2000912 100 thomas.speng...@toptarif.de | www.toptarif.de Amtsgericht Charlottenburg, HRB 113287 B Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor -
Re: Cassandra API Library.
Thanks Dean… I hadn't played with that one. I wonder if that would better fit the bill for the Spring Data Cassandra module I'm hacking on. https://github.com/boneill42/spring-data-cassandra I'll poke around. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote: playOrm has a raw layer that if your columns are not defined ahead of time and SQL with no limitations on , =, =, etc. etc. as well as joins being added shortly BUT joins are for joining partitions so that your system can still scale to infinity. Also has an in-memory database as well for unit testing that you can do TDD with built in. So if you like JQL but want infinite scale JQL, try playOrm. All 45 tests are passing. We expect 100 unit tests to be in place by the end of the year. Dean On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote: We've used 'em all andŠ (IMHO) 1) I would avoid Thrift directly. 2) Hector is a sure bet. 3) Astyanax is the up and comer. 4) Kundera is good, but works like an ORM -- so not so good if your columns aren't defined ahead of time. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de wrote: 4) pelops (Thrift,Java) On 08/23/2012 01:28 PM, Baskar Sikkayan wrote: I would vote for Hector :) On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com wrote: hi, kindly let me know which java client api is more matured, and easy to use with all features(Super Columns, caching, pooling, etc) of Cassandra 1.X. Right now i come to know that following client exists: 1) Hector(Java) 2) Thrift (Java) 3) Kundera (Java) With Regards, Amit -- Thomas Spengler Chief Technology Officer TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin Tel.: (030) 2000912 0 | Fax: (030) 2000912 100 thomas.speng...@toptarif.de | www.toptarif.de Amtsgericht Charlottenburg, HRB 113287 B Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor -
Re: Cassandra API Library.
FWIW.. I just threw this together... http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html Let me know if I missed any others. (I didn't have playorm on there) -brian On Thu, Aug 23, 2012 at 9:51 AM, Brian O'Neill boneil...@gmail.com wrote: Thanks Dean… I hadn't played with that one. I wonder if that would better fit the bill for the Spring Data Cassandra module I'm hacking on. https://github.com/boneill42/spring-data-cassandra I'll poke around. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote: playOrm has a raw layer that if your columns are not defined ahead of time and SQL with no limitations on , =, =, etc. etc. as well as joins being added shortly BUT joins are for joining partitions so that your system can still scale to infinity. Also has an in-memory database as well for unit testing that you can do TDD with built in. So if you like JQL but want infinite scale JQL, try playOrm. All 45 tests are passing. We expect 100 unit tests to be in place by the end of the year. Dean On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote: We've used 'em all andŠ (IMHO) 1) I would avoid Thrift directly. 2) Hector is a sure bet. 3) Astyanax is the up and comer. 4) Kundera is good, but works like an ORM -- so not so good if your columns aren't defined ahead of time. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de wrote: 4) pelops (Thrift,Java) On 08/23/2012 01:28 PM, Baskar Sikkayan wrote: I would vote for Hector :) On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com wrote: hi, kindly let me know which java client api is more matured, and easy to use with all features(Super Columns, caching, pooling, etc) of Cassandra 1.X. Right now i come to know that following client exists: 1) Hector(Java) 2) Thrift (Java) 3) Kundera (Java) With Regards, Amit -- Thomas Spengler Chief Technology Officer TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin Tel.: (030) 2000912 0 | Fax: (030) 2000912 100 thomas.speng...@toptarif.de | www.toptarif.de Amtsgericht Charlottenburg, HRB 113287 B Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor - -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra API Library.
Ha… how could I forget? =) Adding it now. --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Robin Verlangen ro...@us2.nl Reply-To: user@cassandra.apache.org Date: Thursday, August 23, 2012 9:56 AM To: user@cassandra.apache.org Subject: Re: Cassandra API Library. @Brian: You're missing PhpCassa (PHP library) With kind regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/8/23 Hiller, Dean dean.hil...@nrel.gov No problem, if you like SQL at all and don't mind adding a PARTITIONS clause, we have a raw ad-hoc layer(if you have properly added meta data which the ORM objects do for you but can be done manually) you get a query like this PARTITIONS p('account56') SELECT tr FROM Trades as tr WHERE tr. price 70; So it queries just the partition of the Trades table. We are still investigating how large partitions can be but we know it is quite large from previous nosql projects. Dean On 8/23/12 7:51 AM, Brian O'Neill boneil...@gmail.com wrote: Thanks Dean… I hadn't played with that one. I wonder if that would better fit the bill for the Spring Data Cassandra module I'm hacking on. https://github.com/boneill42/spring-data-cassandra I'll poke around. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 tel:215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com http://healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote: playOrm has a raw layer that if your columns are not defined ahead of time and SQL with no limitations on , =, =, etc. etc. as well as joins being added shortly BUT joins are for joining partitions so that your system can still scale to infinity. Also has an in-memory database as well for unit testing that you can do TDD with built in. So if you like JQL but want infinite scale JQL, try playOrm. All 45 tests are passing. We expect 100 unit tests to be in place by the end of the year. Dean On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote: We've used 'em all andŠ (IMHO) 1) I would avoid Thrift directly. 2) Hector is a sure bet. 3) Astyanax is the up and comer. 4) Kundera is good, but works like an ORM -- so not so good if your columns aren't defined ahead of time. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 tel:215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com http://healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please
A Big Data Trifecta: Storm, Kafka and Cassandra
Philip, I figured I would reply via blog post. =) http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html That blog post shows how we pieced together Kafka and Cassandra (via Storm). With LinkedIn behind Kafka, it is well supported. They use it in production. (and most likely we will too =) Let me know if you end up using it. Thus far, I think it pairs nicely with Cassandra, but we don't have it in production yet. -brian On Fri, Aug 3, 2012 at 3:41 PM, Milind Parikh milindpar...@gmail.com wrote: Kafka is relatively stable and has a active well-supported news-group as well. As discussed by Brian, you would be inverting the paradigm of store-process. Essentially in your original approach, you are storing the messages first and then processing them after the fact. In the Kafka model, you would process the messages as they come in. Since you are thinking about parallelism anyways, I trust that your processing paradigm is inherently paralleizable. Regards Milind On Fri, Aug 3, 2012 at 12:22 PM, Philip Nelson philipomailbox-c...@yahoo.com wrote: Brian -- thanks. We were looking to do the same thing, but in the end decided to go with Kafka. Given your throughput requirements, Kafka might be a good option for you as well. This might be off-topic, so I'll keep it short. Kafka is reasonably stable? Mature (I see it's in the Incubator)? Relative to Cassandra? Philip -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: How to process new rows in parallel?
If you are deleting the messages after processing, it sounds like you are using Cassandra as a work queue. Here are some links for implementing a distributed queue in Cassandra: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html http://comments.gmane.org/gmane.comp.db.cassandra.user/16633 There is a placeholder on the use cases wiki for this, but no info: http://wiki.apache.org/cassandra/UseCases#A_distributed_Priority_Job_Queue We were looking to do the same thing, but in the end decided to go with Kafka. Given your throughput requirements, Kafka might be a good option for you as well. -brian On Fri, Aug 3, 2012 at 2:18 PM, Philip Nelson philipomailbox-c...@yahoo.com wrote: Hello, I am using a Column Family in Cassandra to store incoming messages, which arrive at a high rate (100s of thousands per second). I then have a process wake up periodically to work on those messages, and then delete them. I'd like to understand how I could have multiple processes running, each pulling off a bunch of messages in parallel. It would be nice to be able to add processes dynamically, and not have to explicitly assign message ranges to various processes. Any suggestions on how to ensure that each process pulls off a different bunch of messages? Any recommended design patterns? I was going to look at qsandra too, for inspiration. Would this be worthwhile? If this was a relational database, I would have the processes lock the table (or perhaps a row), set flags on a row indicating that it's being processed, and then unlock. Processes would choose messages by SELECTing on unflagged messages. I'm not sure how this might map to Cassandra. I realise it may not. Even if I configure the cluster such that seting a flag on a row requires all nodes to be written, two processes could still race setting that flag, right? I am open to the idea that it might help to store the messages in wide rows, if that helps. Thanks, Philip -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: How to manually build and maintain secondary indexes
Alon, We came to the same conclusion regarding secondary indexes, and instead of using them we implemented our own wide-row indexing capability and open-sourced it. Its available here: https://github.com/hmsonline/cassandra-indexing We still have challenges rebuilding indexes, etc. It doesn't address all of your concerns, but I tried to capture the motivation behind our implementation here: http://brianoneill.blogspot.com/2012/03/cassandra-indexing-good-bad-and-ugl y.html -brian -- Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024 www.healthmarketscience.com On 7/26/12 2:05 PM, Alon Pilberg alo...@taboola.com wrote: Hello, My company is working on transition of our relational data model to Cassandra. Naturally, one of the basic demands is to have secondary indexes to answer queries quickly according to the application's needs. After looking at Cassandra's native support for secondary indexes, we decided not to use them due to the poor performance for high-cardinality values. Instead, we decide to implement secondary indexes manually. Some search led us to http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html which details a schema for such indexes. However, the method employed there specifically adds an index entries column family, whereas it seems like only 2 CFs are needed - one for the items and one for the indexes (assuming one has access to both old and new values when updating an item). The article actually mentioned that this is indeed not the obvious solution, for a number of reasons related to Cassandra's model of eventual consistency ... will not reliably work and it's a really good idea to make sure you understand why this CF is necessary. However, no additional information is provided on what might be a critical issue, as dealing with corrupt indexes in a large production environment is surely to be a nightmare. What are the community's thoughts on this matter? Given the writer's credentials in the Cassandra realm, specifically regarding indexes, I'm inclined not to ignore his remarks. References to a document / system that implement similar indexes would be greatly appreciated as well. - alon
An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)
This is just an FYI. I experimented w/ Spring Data JPA w/ Cassandra leveraging Kundera. It sort of worked: https://github.com/boneill42/spring-data-jpa-cassandra http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.html I'm now working on a pure Spring Data adapter using Astyanax: https://github.com/boneill42/spring-data-cassandra I'll keep you posted. (Thanks to all those that helped out w/ advice) -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Trigger and customized filter
While Jonathan and crew work on the infrastructure to support triggers: https://issues.apache.org/jira/browse/CASSANDRA-4285 We have a project going over here that provides a trigger-like capability: https://github.com/hmsonline/cassandra-triggers/ https://github.com/hmsonline/cassandra-triggers/wiki/GettingStarted We are working enhancements that would support synchronous triggers w/ javascript. For now, they are processed asynchronously, and you implement a Java interface. -brian On Tue, Jul 10, 2012 at 9:24 AM, Felipe Schmidt felipef...@gmail.com wrote: Does anyone know something about the following questions? 1. Does Cassandra support customized filter? customized filter means programmer can define his desired filter to select the data. 2. Does Cassandra support trigger? trigger has the same meaning as in RDBMS. Thanks in advance. Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra and Tableau
Robin, We have the same issue right now. We use Tableau for all of our reporting needs, but we couldn't find any acceptable bridge between it and Cassandra. We ended up using cassandra-triggers to replicate the data to Oracle. https://github.com/hmsonline/cassandra-triggers/ Let us know if you get things setup with a direct connection. We'd be *very* interested int helping out if you find a way to do it. -brian On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen ro...@us2.nl wrote: Hi there, Is there anyone out there who's using Tableau in combination with a Cassandra cluster? There seems to be no standard solution to connect, at least I couldn't find one. Does anyone know how to tackle this problem? With kind regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: which high level Java client
FWIW, We keep most of our system level integrations behind REST using Virgil: https://github.com/hmsonline/virgil When a lower-level integration is necessary we use Hector, but recently we've started using Astyanax and plan to port our Hector dependencies over to Astyanax when given a chance. I've also been looking to implement a Spring Data JPA adaptor like what is available for MongoDB. https://github.com/boneill42/spring-data-mongodb I've forked the SpringSource Cassandra repo here if anyone wants to help out: https://github.com/boneill42/spring-data-cassandra -brian On Thu, Jun 28, 2012 at 9:02 AM, Vivek Mishra mishra.v...@gmail.com wrote: Would like to add one more https://github.com/impetus-opensource/Kundera . Next release is planned with many distinguishing features. -Vivek On Thu, Jun 28, 2012 at 6:23 PM, Sasha Dolgy sdo...@gmail.com wrote: Not following this thread too much, but there is also https://github.com/Netflix/astyanax/ Astyanax is currently in use at Netflix. Issues generally are fixed as quickly as possbile and releases done frequently. -sd On Thu, Jun 28, 2012 at 2:39 PM, Poziombka, Wade L wade.l.poziom...@intel.com wrote: I use Pelops and have been very happy. In my opinion the interface is cleaner than that with Hector. I personally do like the serializer business. -Original Message- From: Radim Kolar [mailto:h...@filez.com] Sent: Thursday, June 28, 2012 5:06 AM To: user@cassandra.apache.org Subject: Re: which high level Java client i do not have experience with other clients, only hector. But timeout management in hector is really broken. If you expect your nodes to timeout often (for example, if you are using WAN) better to try something else first. -- Sasha Dolgy sasha.do...@gmail.com -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Ball is rolling on High Performance Cassandra Cookbook second edition
RE: API method signatures changing That triggers another thought... What terminology will you use in the book to describe the data model? CQL? When we wrote the RefCard on DZonehttp://refcardz.dzone.com/refcardz/apache-cassandra, we intentionally favored/used CQL terminology. On advisement from Jonathan and Kris Hahn, we wanted to start the process of sunsetting the legacy terms (keyspace, column family, etc.) in favor of the more familiar CQL terms (schema, table, etc.). I've gone on recordhttp://css.dzone.com/articles/new-refcard-apache-cassandrain favor of the switch, but it is probably something worth noting in the book since that terminology does not yet align with all the client APIs yet. (e.g. Hector, Astyanax, etc.) I'm not sure when the client APIs will catch up to the new terminology, but we may want to inquire as to future proof the recipes as much as possible. -brian On Wed, Jun 27, 2012 at 4:18 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Jun 27, 2012 at 3:08 PM, Courtney Robinson court...@crlog.info wrote: Sounds good. One thing I'd like to see is more coverage on Cassandra Internals. Out of the box Cassandra's great but having a little inside knowledge can be very useful because it helps you design your applications to work with Cassandra; rather than having to later make endless optimizations that could probably have been avoided had you done your implementation slightly differently. Another thing that may be worth adding would be a recipe that showed an approach to evaluating Cassandra for your organization/use case. I realize that's going to vary on a case by case basis but one thing I've noticed is that some people dive in without really thinking through whether Cassandra is actually the right fit for what they're doing. It sort of becomes a hammer for anything that looks like a nail. On Tue, Jun 26, 2012 at 10:25 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Hello all, It has not been very long since the first book was published but several things have been added to Cassandra and a few things have changed. I am putting together a list of changed content, for example features like the old per Column family memtable flush settings versus the new system with the global variable. My editors have given me the green light to grow the second edition from ~200 pages currently up to 300 pages! This gives us the ability to add more items/sections to the text. Some things were missing from the first edition such as Hector support. Nate has offered to help me in this area. Please feel contact me with any ideas and suggestions of recipes you would like to see in the book. Also get in touch if you want to write a recipe. Several people added content to the first edition and it would be great to see that type of participation again. Thank you, Edward -- Courtney Robinson court...@crlog.info http://crlog.info 07535691628 (No private #s) Thanks for the comments. Yes the INTERNALS chapter was a bit tricky. The challenge of writing about internals is they go stale fairly quickly. I was considering writing a partitioner for the internals chapter but then I thought about it more: 1) Its hard 2) The APIs can change. (They work the same way across versions but they may have a different signature etc) 3) 99.99% of people should be using the random partitioner :) But I agree the external chapter can be made much stronger then it is. The recipe format strict. It naturally conflicts with the typical use case style. In a use case where you write a good amount of text talking about problem domain, previous solutions, bragging about company X. We can not do that with the recipe style, but we can do our best to make the recipes as real world as possible. I tried to do that throughout the text, you do not find many examples like 'writing foo records to bar column families'. However the format does not allow extensive text blocks mentioned above so it is difficult to set the stage for a complex and detailed real world problem. Still, I think for some examples we can take the next step and make the recipe more real world practical and more use-case like. -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Indexing JSON in Cassandra
I know we had this conversation over on the dev list a while back: http://www.mail-archive.com/dev@cassandra.apache.org/msg03914.html I just wanted to let people know that we added the capability to our cassandra-indexing extension. http://brianoneill.blogspot.com/2012/06/indexing-json-in-cassandra.html Let us know if you have any trouble with it. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Server Side Logic/Script - Triggers / StoreProc
Praveen, We are certainly interested. To get things moving we implemented an add-on for Cassandra to demonstrate the viability (using AOP): https://github.com/hmsonline/cassandra-triggers Right now the implementation executes triggers asynchronously, allowing you to implement a java interface and plugin your own java class that will get called for every insert. Per the discussion on 1311, we intend to extend our proof of concept to be able to invoke scripts as well. (minimally we'll enable javascript, but we'll probably allow for ruby and groovy as well) -brian On Apr 22, 2012, at 12:23 PM, Praveen Baratam wrote: I found that Triggers are coming in Cassandra 1.2 (https://issues.apache.org/jira/browse/CASSANDRA-1311) but no mention of any StoreProc like pattern. I know this has been discussed so many times but never met with any initiative. Even Groovy was staged out of the trunk. Cassandra is great for logging and as such will be infinitely more useful if some logic can be pushed into the Cassandra cluster nearer to the location of Data to generate a materialized view useful for applications. Server Side Scripts/Routines in Distributed Databases could soon prove to be the differentiating factor. Let me reiterate things with a use case. In our application we store time series data in wide rows with TTL set on each point to prevent data from growing beyond acceptable limits. Still the data size can be a limiting factor to move all of it from the cluster node to the querying node and then to the application via thrift for processing and presentation. Ideally we should process the data on the residing node and pass only the materialized view of the data upstream. This should be trivial if Cassandra implements some sort of server side scripting and CQL semantics to call it. Is anybody else interested in a similar feature? Is it being worked on? Are there any alternative strategies to this problem? Praveen -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: cassandra gui
If you give Virgil a try, let me know how it goes. The REST layer is pretty solid, but the gui is just a PoC which makes it easy to see what's in the CFs during development/testing. (It's only a couple hundred lines of ExtJS code built on the REST layer) We had plans to add CQL to the gui for CRUD, but never got around to it. -brian On Fri, Mar 30, 2012 at 5:20 PM, Ben McCann b...@benmccann.com wrote: If you want a REST interface and a GUI then Virgil may be interesting. I just came across it and haven't tried it myself yet. http://brianoneill.blogspot.com/2011/10/virgil-gui-and-rest-layer-for-cassandra.html On Fri, Mar 30, 2012 at 2:15 PM, John Liberty libjac...@gmail.com wrote: I made some updates to a cassandra-gui project I found, which seemed to be stuck at version 0.7, and posted to github: https://github.com/libjack/cassandra-gui Besides updating to work with version 1.0+, main improvements I added were to obey validation types, including column metadata, when displaying or accepting data. This includes support for Composite types, both keys and columns. I often create CF with non string keys, columns, values, and especially Composite types... And I need a tool to browse/verify and then add/edit test data, and this works quite well for me. -- John Liberty libjac...@gmail.com (585) 466-4249 -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Cassandra Triggers Capability published out to GitHub
FYI -- http://brianoneill.blogspot.com/2012/03/cassandra-triggers-for-indexing-and.html https://github.com/hmsonline/cassandra-triggers Feedback welcome. Contribution and involvement is even better. ;) -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Virgil Moved (and Cassandra-Triggers coming soon)
FYI -- we moved Virgil to Github to make it easier for people to contribute. https://github.com/hmsonline/virgil Also, we created an organization profile (hmsonline) to house all of our storm/cassandra related work. https://github.com/hmsonline Under that profile, we'll be releasing cassandra-triggers. It is an AOP-based trigger solution that provides a simple trigger/event-log that can be used for data replication and indexing reacting to column family mutations. https://github.com/hmsonline/cassandra-triggers -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Remote Hadoop Job Deployment
FYI... we finally got around to releasing a version of Virgil that includes the ability to deploy jobs to remote Hadoop clusters running against Cassandra Column Families. http://brianoneill.blogspot.com/2012/01/virgil-remote-hadoop-job-deployment-via.html This has enabled an army of people to write and deploy Hadoop jobs against our Cassandra cluster. (Literally, we'll probably have 100 M/R jobs by the end of the month) Yes, we still plan to implement a javascript engine as well, but first we intend to tackle Triggers for indexing, data replication and materialized views. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
a potential performance problem. On 1/20/2012 7:55 PM, Mohit Anchlia wrote: I think the problem stems when you have data in a column that you need to run adhoc query on which is not denormalized. In most cases it's difficult to predict the type of query that would be required. Another way of solving this could be to index the fields in search engine. On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov wrote: What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
Eric, Thinking even a little bit more about this... We could go the distributed counter approach with additional column families to support the ad hoc queries, but use triggers to implement it. That would allow us to keep the client-side code thin, but achieve the same result... without necessarily replicating to Oracle for the attributes we can predict. Maybe we'll take a look at that this week as well. thanks again, brian On Jan 21, 2012, at 8:35 AM, Eric Czech wrote: Hi Brian, We're trying to do the exact same thing and I find myself asking very similar questions. Our solution though has been to find what kind of queries we need to satisfy on a preemptive basis and leverage cassandra's built-in indexing features to build those result sets beforehand. The whole point here then is that our gain in cost efficiency comes from the fact that disk space is really cheap and serving up result sets from disk is fast provided that those result sets are pre-calculated and reasonable in size (even if we don't know all the values upfront). For example, when you're writing to your CF X, you could also make writes to column family A like this: - write A[Z][Y] = 1 where A = CF, Z = key, Y = column Answering the question select count(distinct Y) from X group by Z then is as simple as getting a list of rows for CF A and counting the distinct values of Y and grouping them by Z on the client side. Alternatively, there are much better ways to do this with composite keys/columns and distributed counters but it's hard for me to tell what makes the most sense without knowing more about your data / product requirements. Either way, I feel your pain in getting things like this to work with Cassandra when the domain of values for a particular key or column is unknown and secondary indexing doesn't apply, but I'm positive there's a much cheaper way to make it work than paying for Oracle if you have at least a decent idea about what kinds of queries you need to satisfy (which it sounds like you do). To Maxim's death by index point, you could certainly go overboard with this concept and cross a pricing threshold with some other database technology, but I can't imagine you're even close to being in that boat given how concise your query needs seem to be. If you're interested, I'd be happy to share how we do these things to save lots of money over commercial databases and try to relate that to your use case, but if not, then I hope at least some of that this useful for you. Good luck either way! On Fri, Jan 20, 2012 at 9:27 PM, Maxim Potekhin potek...@bnl.gov wrote: I certainly agree with difficult to predict. There is a Danish proverb, which goes it's difficult to make predictions, especially about the future. My point was that it's equally difficult with noSQL and RDBMS. The latter requires indexing to operate well, and that's a potential performance problem. On 1/20/2012 7:55 PM, Mohit Anchlia wrote: I think the problem stems when you have data in a column that you need to run adhoc query on which is not denormalized. In most cases it's difficult to predict the type of query that would be required. Another way of solving this could be to index the fields in search engine. On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov wrote: What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http
Re: Cassandra to Oracle?
Good point Milind. (RE: Client-side AOP) I was thinking server-side to stay with the trigger concept, but we could just as easily intercept on the client-side. We'd just need to make sure that all clients got the AOP code injected. (including all of our map/reduce jobs) If we get the point-cut right (using the Cassandra.Iface), we could probably make it portable. People could drop it in client-side or server-side. -brian On Jan 22, 2012, at 9:45 AM, Milind Parikh wrote: My bad ~s/X:X-Value/Y:Y-Value/ after rereading the SELECT. /*** sent from my android...please pardon occasional typos as I respond @ the speed of thought / On Jan 22, 2012 6:40 AM, Milind Parikh milindpar...@gmail.com wrote: The composite-key approach with counters would work very well in this case. It will also obviate the concern of not knowing the exact column names apriori...although for efficiencies, you might to look at maintaining a secondary cachelike cf for lookup Depending on your data patterns(not to hit 2b columns) and actual queries, you could store each Zs as one row and composite key on Z - value + X:X-value and then as counter-column. Other optimizations may be possible. If you're using AOP, as I read it, there's really no need to intercept your own writes at the C* level; instead do it (use aop)at the client level. Your migration also needs to be attended to and might need a MR first and AOP intercepted writes. Hth Milind /*** sent from my android...please pardon occasional typos as I respond @ the speed of thought / On Jan 22, 2012 4:42 AM, Brian Oapos;Neill boneil...@gmail.com wrote: Thanks for all the ideas... Since we can't predict all the values, we actually cut to Oracle... On Jan 21, 2012, at 8:35 AM, Eric Czech wrote: Hi Brian, We're trying to do the exact same... -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Cassandra to Oracle?
I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
Not terribly large ~50 million rows, each row has ~100-300 columns. But big enough that a map/reduce job takes longer than users would like. Actually maybe that is another question... Does anyone have any benchmarks running map/reduce against Cassandra? (even a simple count / or copy CF benchmark would be helpful) -brian On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson j.zach.richard...@gmail.com wrote: How much data do you think you will need ad hoc query ability for? On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Ad Hoc Queries
Interesting articles... (changing the subject line to broaden the scope) http://codemonkeyism.com/dark-side-nosql/ http://www.reportsanywhere.com/pebble/2010/04/16/127143774.html These articulate the exact challenge we're trying to overcome. -brian On Fri, Jan 20, 2012 at 12:57 PM, Brian O'Neill b...@alumni.brown.eduwrote: Not terribly large ~50 million rows, each row has ~100-300 columns. But big enough that a map/reduce job takes longer than users would like. Actually maybe that is another question... Does anyone have any benchmarks running map/reduce against Cassandra? (even a simple count / or copy CF benchmark would be helpful) -brian On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson j.zach.richard...@gmail.com wrote: How much data do you think you will need ad hoc query ability for? On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Triggers?
Anyone know if there is any activity to deliver triggers? I saw this quote: http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php Ellis says that he's just starting to think about the post-1.0 world for Cassandra. Two features do come to mind, though, that missed the boat for 1.0 and that were on a lot of wishlists. The first is triggers. Database triggers let you define rules in the database, such as updating table X when table Y is updated. Ellis says that triggers will be necessary for Cassandra as it grows in popularity. As more tools use it, that's something more users are going to be asking for. But grepping the trunk code, I don't see any work on triggers. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Copy a column family?
What is the fastest way to copy a column family? We were headed down the map/reduce path, but that seems silly. Any file level mechanisms for this? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Copy a column family?
Excellent. We'll give it a try. Thanks Brandon. -brian Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 1/9/12 10:31 AM, Brandon Williams dri...@gmail.com wrote: On Mon, Jan 9, 2012 at 9:14 AM, Brian O'Neill b...@alumni.brown.edu wrote: What is the fastest way to copy a column family? We were headed down the map/reduce path, but that seems silly. Any file level mechanisms for this? Copy all the sstables 1:1 renaming them to the new CF name. Then create the schema for the CF. -Brandon