Re: are there any free Cassandra -> ElasticSearch connector / plugin ?

2016-10-13 Thread Brian O'Neill
I haven't used it yet, but
https://github.com/vroyer/elassandra <https://github.com/vroyer/elassandra>

-- 
Brian O'Neill
Principal Architect @ Monetate
m: 215.588.6024
bone...@monetate.com <mailto:bone...@monetate.com>


> On Oct 13, 2016, at 6:02 PM, Eric Ho <e...@analyticsmd.com> wrote:
> 
> I don't want to change my code to write into C* and then to ES.
> So, I'm looking for some sort of a sync tool that will sync my C* table into 
> ES and it should be smart enough to avoid duplicates or gaps.
> Is there such a tool / plugin ?
> I'm using stock apache Cassandra 3.7.
> I know that some premium Cassandra has ES builtin or integrated but I can't 
> afford premium right now...
> Thanks.
> 
> -eric ho
> 



Re: Support for ad-hoc query

2015-06-09 Thread Brian O'Neill
Cassandra isn¹t great at ad hoc queries.  Many of us have paired it with an
indexing engine like SOLR or Elastic Search.
(built-into the DSE solution)

As of late, I think there are a few of us exploring Spark SQL.  (which you
can then use via JDBC or REST)

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Srinivasa T N seen...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, June 9, 2015 at 2:38 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Support for ad-hoc query

Hi All,
   I have an web application running with my backend data stored in
cassandra.  Now I want to do some analysis on the data stored which requires
some ad-hoc queries fired on cassandra.  How can I do the same?

Regards,
Seenu.




Re: Spark SQL JDBC Server + DSE

2015-06-03 Thread Brian O'Neill

Kudos Ben.  We¹ve been tracking Zeppelin, and considered doing the same
thing.
You beat us to it.  Well done.

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Ben Bromhead b...@instaclustr.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, June 2, 2015 at 5:05 PM
To:  user@cassandra.apache.org
Subject:  Re: Spark SQL JDBC Server + DSE

If you want a web based notebook style approach (similar to ipython) check
out https://github.com/apache/incubator-zeppelin

And https://github.com/apache/incubator-zeppelin/pull/86

Bonus free pretty graphs!

On 1 June 2015 at 11:41, Sebastian Estevez sebastian.este...@datastax.com
wrote:
 Have you looked at job server?
 
 https://github.com/spark-jobserver/spark-jobserver
 https://www.youtube.com/watch?v=8k9ToZ4m6os
 http://planetcassandra.org/blog/post/fast-spark-queries-on-in-memory-datasets/
 
 All the best,
 
 
  http://www.datastax.com/
 Sebastián Estévez
 Solutions Architect | 954 905 8615 tel:954%20905%208615  |
 sebastian.este...@datastax.com
  https://www.linkedin.com/company/datastax
 https://www.facebook.com/datastax   https://twitter.com/datastax
 https://plus.google.com/+Datastax/about
 http://feeds.feedburner.com/datastax
 
  http://cassandrasummit-datastax.com/
 
 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world¹s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the database
 technology and transactional backbone of choice for the worlds most innovative
 companies such as Netflix, Adobe, Intuit, and eBay.
 
 On Mon, Jun 1, 2015 at 8:13 AM, Mohammed Guller moham...@glassbeam.com
 wrote:
 Brian,
 We haven¹t open sourced the REST server, but not  opposed to doing it. Just
 need to carve out some time to clean up the code and carve it out from all
 the other stuff that we do in that REST server.  Will try to do it in the
 next few weeks. If you need it sooner, let me know.
  
 I did consider the option of writing our own Spark SQL JDBC driver for C*,
 but it is lower on the priority list right now.
  
 
 Mohammed
  
 
 From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: Saturday, May 30, 2015 3:12 AM
 
 
 To: user@cassandra.apache.org
 Subject: Re: Spark SQL JDBC Server + DSE
  
 
  
 
 Any chance you open-sourced, or could open-source the REST server? ;)
 
  
 
 In thinking about itŠ
 
 It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver
 against Cassandra, akin to what they have for hive:
 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-t
 hrift-jdbcodbc-server
 
  
 
 I wouldn¹t mind collaborating on that, if you are headed in that direction.
 
 (and then I could write the REST server on top of that)
 
  
 
 LMK,
 
  
 
 -brian
 
  
 
 ---
 Brian O'Neill 
 Chief Technology Officer
 Health Market Science, a LexisNexis Company
 215.588.6024 tel:215.588.6024  Mobile € @boneill42
 http://www.twitter.com/boneill42
  
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.
  
 
  
 
 From: Mohammed Guller moham...@glassbeam.com
 Reply-To: user@cassandra.apache.org
 Date: Friday, May 29, 2015 at 2:15 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: RE: Spark SQL JDBC Server + DSE
 
  
 
 Brian,
 I implemented a similar REST server last year and it works great. Now we have
 a requirement to support JDBC connectivity in addition to the REST API. We
 want to allow users to use tools like Tableau to connect to C* through the
 Spark SQL JDBC/Thift server.
  
 
 Mohammed
  
 
 From: Brian O'Neill [mailto:boneil

Re: Spark SQL JDBC Server + DSE

2015-05-30 Thread Brian O'Neill

Any chance you open-sourced, or could open-source the REST server? ;)

In thinking about itŠ
It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver
against Cassandra, akin to what they have for hive:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-
thrift-jdbcodbc-server

I wouldn¹t mind collaborating on that, if you are headed in that direction.
(and then I could write the REST server on top of that)

LMK,

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Mohammed Guller moham...@glassbeam.com
Reply-To:  user@cassandra.apache.org
Date:  Friday, May 29, 2015 at 2:15 PM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  RE: Spark SQL JDBC Server + DSE

Brian,
I implemented a similar REST server last year and it works great. Now we
have a requirement to support JDBC connectivity in addition to the REST API.
We want to allow users to use tools like Tableau to connect to C* through
the Spark SQL JDBC/Thift server.
 

Mohammed
 

From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
Sent: Thursday, May 28, 2015 6:16 PM
To: user@cassandra.apache.org
Subject: Re: Spark SQL JDBC Server + DSE
 

Mohammed,

 

This doesn¹t really answer your question, but I¹m working on a new REST
server that allows people to submit SQL queries over REST, which get
executed via Spark SQL.   Based on what I started here:

http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example.
html

 

I assume you need JDBC connectivity specifically?

 

-brian

 

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42
 
This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 

 

From: Mohammed Guller moham...@glassbeam.com
Reply-To: user@cassandra.apache.org
Date: Thursday, May 28, 2015 at 8:26 PM
To: user@cassandra.apache.org user@cassandra.apache.org
Subject: RE: Spark SQL JDBC Server + DSE

 

Anybody out there using DSE + Spark SQL JDBC server?
 

Mohammed
 

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Tuesday, May 26, 2015 6:17 PM
To: user@cassandra.apache.org
Subject: Spark SQL JDBC Server + DSE
 
Hi ­
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the
open source C*. Only DSE supports  the Spark SQL JDBC server.
 
We would like to find out whether how many organizations are using this
combination. If you do use DSE + Spark SQL JDBC server, it would be great if
you could share your experience. For example, what kind of issues you have
run into? How is the performance? What reporting tools you are using?
 
Thank  you!
 
Mohammed 
 




Re: Spark SQL JDBC Server + DSE

2015-05-28 Thread Brian O'Neill
Mohammed,

This doesn¹t really answer your question, but I¹m working on a new REST
server that allows people to submit SQL queries over REST, which get
executed via Spark SQL.   Based on what I started here:
http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example.
html

I assume you need JDBC connectivity specifically?

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Mohammed Guller moham...@glassbeam.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, May 28, 2015 at 8:26 PM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  RE: Spark SQL JDBC Server + DSE

Anybody out there using DSE + Spark SQL JDBC server?
 

Mohammed
 

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Tuesday, May 26, 2015 6:17 PM
To: user@cassandra.apache.org
Subject: Spark SQL JDBC Server + DSE
 
Hi ­
As I understand, the Spark SQL Thrift/JDBC server cannot be used with the
open source C*. Only DSE supports  the Spark SQL JDBC server.
 
We would like to find out whether how many organizations are using this
combination. If you do use DSE + Spark SQL JDBC server, it would be great if
you could share your experience. For example, what kind of issues you have
run into? How is the performance? What reporting tools you are using?
 
Thank  you!
 
Mohammed 
 




Re: cassandra and spark from cloudera distirbution

2015-04-22 Thread Brian O'Neill
Depends which veresion of Spark you are running on Cloudera.

Once you know that ‹ have a look at the compatibility chart here:
https://github.com/datastax/spark-cassandra-connector

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Serega Sheypak serega.shey...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, April 22, 2015 at 1:48 PM
To:  user user@cassandra.apache.org
Subject:  Re: cassandra and spark from cloudera distirbution

We already use it. Would like to use Spark from cloudera distribution.
Should it work?

2015-04-22 19:43 GMT+02:00 Jay Ken jaytechg...@gmail.com:
 There is a Enerprise Edition from Datastax; where they have Spark and
 Cassandra Integration.
 
 http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
 
 Thanks,
 Jay
 
 On Wed, Apr 22, 2015 at 6:41 AM, Serega Sheypak serega.shey...@gmail.com
 wrote:
 Hi, are Cassandra and Spark from Cloudera compatible?
 Where can I find these compatilibity notes?
 





Re: Adhoc querying in Cassandra?

2015-04-22 Thread Brian O'Neill

+1, I think many organizations (including ours) pair Elastic Search with
Cassandra.
Use Cassandra as your system of record, then index the data with ES.

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Ali Akhtar ali.rac...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, April 22, 2015 at 7:52 AM
To:  user@cassandra.apache.org
Subject:  Re: Adhoc querying in Cassandra?


You might find it better to use elasticsearch for your aggregate queries and
analytics. Cassandra is more of just a data store.

On Apr 22, 2015 4:42 PM, Matthew Johnson matt.john...@algomi.com wrote:
 Hi all,
  
 Currently we are setting up a ³big² data cluster, but we are only going to
 have a couple of servers to start with but we need to be able to scale out
 quickly when usage ramps up. Previously we have used Hadoop/HBase for our big
 data cluster, but since we are starting this one on only two nodes I think
 Cassandra will be a much better fit, as Hadoop and HBase really need at least
 3 to achieve any sort of resilience (zookeeper quorum etc).
  
 My question is this:
  
 I have used Apache Phoenix as a JDBC layer on top of HBase, which allows me to
 issue ad-hoc SQL-style queries. (eg count the number of times users have
 clicked on a certain button after clicking a different button in the last 3
 weeks etc). My understanding is that CQL does not support this style of adhoc
 aggregate querying out of the box. Is there a recommended way to do count,
 sum, average etc without writing client code (in my case Java) every time I
 want to run one? I have been looking at projects like Drill, Spark etc that
 could potentially sit on top of Cassandra but without actually setting
 everything up and testing them it is difficult to figure out what they would
 give us.
  
 Does anyone else interactively issue adhoc aggregate queries to Cassandra, and
 if so, what stack do you use?
  
 Thanks!
 Matt
  




Re: Adhoc querying in Cassandra?

2015-04-22 Thread Brian O'Neill
Again ‹ agreed.

They have different usage patterns (C* heavy writes, ES heavy read), I would
separate them.
SOLR should be sufficient.  I believe DSE is a tight integration between
SOLR and C*.

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Ali Akhtar ali.rac...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, April 22, 2015 at 8:10 AM
To:  user@cassandra.apache.org
Subject:  Re: Adhoc querying in Cassandra?

I believe ElasticSearch has better support for scaling horizontally (by
adding nodes) than Solr does. Some benchmarks that I've looked at, also show
it as performing better under high load.

I probably wouldn't run them both on the same node, or you might see low
performance as they compete for resources.

What type of usage do you expect - mostly read, or mostly write?

On Wed, Apr 22, 2015 at 5:06 PM, Matthew Johnson matt.john...@algomi.com
wrote:
 Hi Ali, Brian,
  
 Thanks for the suggestion ­ we have previously used Solr (SolrCloud for
 distribution) for a lot of other products, presumably this will do the same
 job as ElasticSearch? Or does ElasticSearch have specifically better
 integration with Cassandra or better support for aggregate queries?
  
 Would it be an ok architecture to have a Cassandra node and a Solr/ES instance
 on each box, so they scale together? Or is it better to have separate servers
 for storage and search?
  
 Cheers,
 Matt
  
 
 From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: 22 April 2015 12:56
 To: user@cassandra.apache.org
 Subject: Re: Adhoc querying in Cassandra?
  
 
  
 
 +1, I think many organizations (including ours) pair Elastic Search with
 Cassandra.
 
 Use Cassandra as your system of record, then index the data with ES.
 
  
 
 -brian
 
  
 
 ---
 Brian O'Neill 
 Chief Technology Officer
 Health Market Science, a LexisNexis Company
 215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42
  
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If you
 received this email in error and are not the intended recipient, or the person
 responsible to deliver it to the intended recipient, please contact the sender
 at the email above and delete this email and any attachments and destroy any
 copies thereof. Any review, retransmission, dissemination, copying or other
 use of, or taking any action in reliance upon, this information by persons or
 entities other than the intended recipient is strictly prohibited.
  
 
  
 
 From: Ali Akhtar ali.rac...@gmail.com
 Reply-To: user@cassandra.apache.org
 Date: Wednesday, April 22, 2015 at 7:52 AM
 To: user@cassandra.apache.org
 Subject: Re: Adhoc querying in Cassandra?
 
  
 You might find it better to use elasticsearch for your aggregate queries and
 analytics. Cassandra is more of just a data store.
 
 On Apr 22, 2015 4:42 PM, Matthew Johnson matt.john...@algomi.com wrote:
 
 Hi all,
  
 Currently we are setting up a ³big² data cluster, but we are only going to
 have a couple of servers to start with but we need to be able to scale out
 quickly when usage ramps up. Previously we have used Hadoop/HBase for our big
 data cluster, but since we are starting this one on only two nodes I think
 Cassandra will be a much better fit, as Hadoop and HBase really need at least
 3 to achieve any sort of resilience (zookeeper quorum etc).
  
 My question is this:
  
 I have used Apache Phoenix as a JDBC layer on top of HBase, which allows me to
 issue ad-hoc SQL-style queries. (eg count the number of times users have
 clicked on a certain button after clicking a different button in the last 3
 weeks etc). My understanding is that CQL does not support this style of adhoc
 aggregate querying out of the box. Is there a recommended way to do count,
 sum, average etc without writing client code (in my case Java) every time I
 want to run one? I have been looking at projects like Drill, Spark etc that
 could potentially sit on top of Cassandra but without actually setting
 everything up and testing them it is difficult to figure out what they would
 give us.
  
 Does anyone else interactively issue adhoc aggregate queries to Cassandra, and
 if so, what stack do you use

Re: Cassandra - Storm

2015-04-03 Thread Brian O'Neill

I¹d recommend using Storm¹s State abstraction.

Check out:
https://github.com/hmsonline/storm-cassandra-cql

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Vanessa Gligor vanessagli...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Friday, April 3, 2015 at 1:13 AM
To:  user@cassandra.apache.org
Subject:  Cassandra - Storm

Hi all,

Did anybody use Cassandra for the tuple storage in Storm? I have this
scenario: I have a spout (getting messages from RabbitMQ) and I want to save
all these messages in Cassandra using a bolt. What is the best choice
regarding the connection to the DB? I have read about Hector API. I used it,
but for now I wasn't able to add a new row in a column family.

Any help would be appreciated.

Regards,
Vanessa.




Re: Frequent timeout issues

2015-04-01 Thread Brian O'Neill

Are you using the storm-cassandra-cql driver?
(https://github.com/hmsonline/storm-cassandra-cql)

If so, what version?
Batching or no batching?

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Amlan Roy amlan@cleartrip.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, April 1, 2015 at 11:37 AM
To:  user@cassandra.apache.org
Subject:  Re: Frequent timeout issues

Replication factor is 2.
CREATE KEYSPACE ct_keyspace WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC1': '2'
};

Inserts are happening from Storm using java driver. Using prepared statement
without batch.


On 01-Apr-2015, at 8:42 pm, Brice Dutheil brice.duth...@gmail.com wrote:

 And the keyspace? What is the replication factor.
 
 Also how are the inserts done?
 
 On Wednesday, April 1, 2015, Amlan Roy amlan@cleartrip.com wrote:
 Write consistency level is ONE.
 
 This is the describe output for one of the tables.
 
 CREATE TABLE event_data (
   event text,
   week text,
   bucket int,
   date timestamp,
   unique text,
   adt int,
   age listint,
   arrival listtimestamp,
   bank text,
   bf double,
   cabin text,
   card text,
   carrier listtext,
   cb double,
   channel text,
   chd int,
   company text,
   cookie text,
   coupon listtext,
   depart listtimestamp,
   dest listtext,
   device text,
   dis double,
   domain text,
   duration bigint,
   emi int,
   expressway boolean,
   flight listtext,
   freq_flyer listtext,
   host text,
   host_ip text,
   inf int,
   instance text,
   insurance text,
   intl boolean,
   itinerary text,
   journey text,
   meal_pref listtext,
   mkp double,
   name listtext,
   origin listtext,
   pax_type listtext,
   payment text,
   pref_carrier listtext,
   referrer text,
   result_cnt int,
   search text,
   src text,
   src_ip text,
   stops int,
   supplier listtext,
   tags listtext,
   total double,
   trip text,
   user text,
   user_agent text,
   PRIMARY KEY ((event, week, bucket), date, unique)
 ) WITH CLUSTERING ORDER BY (date DESC, unique ASC) AND
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor¹};
 
 
 On 01-Apr-2015, at 8:00 pm, Eric R Medley emed...@xylocore.com
 javascript:_e(%7B%7D,'cvml','emed...@xylocore.com');  wrote:
 
 Also, can you provide the table details and the consistency level you are
 using?
 
 Regards,
 
 Eric R Medley
 
 On Apr 1, 2015, at 9:13 AM, Eric R Medley emed...@xylocore.com
 javascript:_e(%7B%7D,'cvml','emed...@xylocore.com');  wrote:
 
 Amlan,
 
 Can you provide information on how much data is being written? Are any of
 the columns really large? Are any writes succeeding or are all timing out?
 
 Regards,
 
 Eric R Medley
 
 On Apr 1, 2015, at 9:03 AM, Amlan Roy amlan@cleartrip.com
 javascript:_e(%7B%7D,'cvml','amlan@cleartrip.com');  wrote:
 
 Hi,
 
 I am new to Cassandra. I have setup a cluster with Cassandra 2.0.13. I am
 writing the same data in HBase and Cassandra and find that the writes are
 extremely slow in Cassandra and frequently seeing exception ³Cassandra
 timeout during write query at consistency ONE. The cluster size for both
 HBase and Cassandra are same.
 
 Looks like something is wrong with my cluster setup. What can be the
 possible issue? Data and commit logs are written into two separate disks.
 
 Regards,
 Amlan
 
 
 
 
 
 -- 
 Brice





Re: cassandra source code

2015-03-24 Thread Brian O'Neill
FWIW ‹ I just went through this, and posted the process I used to get up and
running:
http://brianoneill.blogspot.com/2015/03/getting-started-with-cassandra.html

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Divya Divs divya.divi2...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, March 24, 2015 at 1:29 AM
To:  user@cassandra.apache.org, Jason Wee peich...@gmail.com, Eric
Stevens migh...@gmail.com
Subject:  cassandra source code

Hi
I'm Divya, I'm trying to run the source code of cassandra in eclipse. I'm
taking the source code from github. I'm using windows 64-bit, I'm following
the instructions from this website.
http://runningcassandraineclipse.blogspot.in/. In the github
cassandra-trunk, conf/log4j-server.properies directories and
org.apache.cassandra.thrift.CassandraDaemon, main class is not there. please
give me a document to run the source code of cassandra. Please kindly help
me to proceed. Please reply me as soon as possible.
   Thanking you







Re: IF NOT EXISTS on UPDATE statements?

2014-11-18 Thread Brian O'Neill

FWIW ‹ we have the exact same need.
And we have been struggling with the differences in CQL between UPDATE and
INSERT.

Our use case:

We do in-memory dimensional aggregations that we want to write to C* using
LWT.  
(so, it¹s a low-volume of writes, because we are doing aggregations across
time windows)

On ³commit², we:
1) Read current value for time window
(which returns null if not exists for time window, or current_value if
exists)
2) Then we need to UPSERT new_value for window
where new_value = current_value + agg_value
but only if no other node has updated the value

For (2), we would love to see:
UPSERT value=new_value where (not exists || value=read_value)

(ignoring some intricacies)

-brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Robert Stupp sn...@snazy.de
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, November 18, 2014 at 12:35 PM
To:  user@cassandra.apache.org
Subject:  Re: IF NOT EXISTS on UPDATE statements?


  There is no way to mimic IF NOT EXISTS on UPDATE and it's not a bug. INSERT
 and UPDATE are not totally orthogonal
 in CQL and you should use INSERT for actual insertion and UPDATE for updates
 (granted, the database will not reject
 our query if you break this rule but it's nonetheless the way it's intended to
 be used).
 
 OK.. (and not trying to be difficult here).  We can¹t have it both ways. One
 of these use cases is a bugŠ
 
 You¹re essentially saying ³don¹t do that, but yeah, you can do it.. ³
 
 Either UPDATE should support IF NOT EXISTS or UPDATE should not perform
 INSERTs.
 

UPDATE performs like INSERT in the meaning of an UPSERT - means: INSERT
allows to write the same primary key again and UPDATE allows to write data
to a non-existing primary key (effectively inserting data).
(That¹s what NoSQL databases do.)
Take that as an advantage / feature not present on other DBs.

UPDATE Š IF EXISTS³ and INSERT Š IF NOT EXISTS³ are *expensive* operations
(require serial-consistency/LWT which requires some more network
roundtrips).
IF [NOT] EXISTS³ is basically some kind of convenience³.
And please take into account that UPDATE also has IF column = value
³ condition (using LWT).





Re: IF NOT EXISTS on UPDATE statements?

2014-11-18 Thread Brian O'Neill
Exactly.  Perfect.  Will do.
Thanks Robert.

-brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Robert Stupp sn...@snazy.de
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, November 18, 2014 at 2:26 PM
To:  user@cassandra.apache.org
Subject:  Re: IF NOT EXISTS on UPDATE statements?

 
 For (2), we would love to see:
 UPSERT value=new_value where (not exists || value=read_value)
 

That would be something like UPDATE Š IF column=value OR NOT EXISTS³.

Took at the C* source and that feels like a LHF (for 3.0) so I opened
https://issues.apache.org/jira/browse/CASSANDRA-8335 for that.
Fell free to comment on that :)





Re: [ANN] SparkSQL support for Cassandra with Calliope

2014-10-03 Thread Brian O'Neill
Well done Rohit. (and crew)

-brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Rohit Rai ro...@tuplejump.com
Reply-To:  user@cassandra.apache.org
Date:  Friday, October 3, 2014 at 2:16 PM
To:  user@cassandra.apache.org
Subject:  [ANN] SparkSQL support for Cassandra with Calliope

Hi All,

An year ago we started this journey and laid the path for Spark + Cassandra
stack. We established the ground work and direction for Spark Cassandra
connectors and we have been happy seeing the results.

With Spark 1.1.0 and SparkSQL release, we its time to take Calliope
http://tuplejump.github.io/calliope/  to the logical next level also
paving the way for much more advanced functionality to come.

Yesterday we released Calliope 1.1.0 Community Tech Preview
https://twitter.com/tuplejump/status/517739186124627968 , which brings
Native SparkSQL support for Cassandra. The further details are available
here http://tuplejump.github.io/calliope/tech-preview.html .

This release showcases in core spark-sql
http://tuplejump.github.io/calliope/start-with-sql.html , hiveql
http://tuplejump.github.io/calliope/start-with-hive.html  and
HiveThriftServer http://tuplejump.github.io/calliope/calliope-server.html
support. 

I differentiate it as native spark-sql integration as it doesn't rely on
Cassandra's hive connectors (like Cash or DSE) and saves a level of
indirection through Hive.

It also allows us to harness Spark's analyzer and optimizer in future to
work out the best execution plan targeting a balance between Cassandra's
querying restrictions and Sparks in memory processing.

As far as we know this it the first and only third party data store
connector for SparkSQL. This is a CTP release as it relies on Spark
internals that still don't have/stabilized a developer API and we will work
with the Spark Community in documenting the requirements and working towards
a standard and stable API for third party data store integration.

On another note, we no longer require you to signup to access the early
access code repository.

Inviting all of you try it and give us your valuable feedback.

Regards,

Rohit
Founder  CEO, Tuplejump, Inc.

www.tuplejump.com http://www.tuplejump.com
The Data Engineering Platform




Re: Cassandra blob storage

2014-03-18 Thread Brian O'Neill
You may want to look at:
https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store

-brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  prem yadav ipremya...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, March 18, 2014 at 1:41 PM
To:  user@cassandra.apache.org
Subject:  Cassandra blob storage

Hi,
I have been spending some time looking into whether large files(100mb) can
be stores in Cassandra. As per Cassandra faq:

Currently Cassandra isn't optimized specifically for large file or BLOB
storage. However, files of around 64Mb and smaller can be easily stored in
the database without splitting them into smaller chunks. This is primarily
due to the fact that Cassandra's public API is based on Thrift, which offers
no streaming abilities; any value written or fetched has to fit in to
memory.

Does the above statement still hold? Thrift supports framed data transport,
does that change the above statement. If not, why does casssandra not adopt
the Thrift framed data transfer support?

Thanks





Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Brian O'Neill

just when you thought the thread diedŠ


First, let me say we are *WAY* off topic.  But that is a good thing.
I love this community because there are a ton of passionate, smart people.
(often with differing perspectives ;)

RE: Reporting against C* (@Peter Lin)
We¹ve had the same experience.  Pig + Hadoop is painful.  We are
experimenting with Spark/Shark, operating directly against the data.
http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

The Shark layer gives you SQL and caching capabilities that make it easy to
use and fast (for smaller data sets).  In front of this, we are going to add
dimensional aggregations so we can operate at larger scales.  (then the Hive
reports will run against the aggregations)

RE: REST Server (@Russel Bradbury)
We had moderate success with Virgil, which was a REST server built directly
on Thrift.  We built it directly on top of Thrift, so one day it could be
easily embedded in the C* server itself.   It could be deployed separately,
or run an embedded C*.  More often than not, we ended up running it
separately to separate the layers.  (just like Titan and Rexster)  I¹ve
started on a rewrite of Virgil called Memnon that rides on top of CQL. (I¹d
love some help)
https://github.com/boneill42/memnon

RE: CQL vs. Thrift
We¹ve hitched our wagons to CQL.  CQL != Relational.
We¹ve had success translating our ³native² schemas into CQL, including all
the NoSQL goodness of wide-rows, etc.  You just need a good understanding of
how things translate into storage and underlying CFs.  If anything, I think
we could add some DESCRIBE information, which would help users with this,
along the lines of:
(https://issues.apache.org/jira/browse/CASSANDRA-6676)

CQL does open up the *opportunity* for users to articulate more complex
queries using more familiar syntax.  (including future things such as joins,
grouping, etc.)   To me, that is exciting, and again ‹ one of the reasons we
are leaning on it.

my two cents,
brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Peter Lin wool...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, March 12, 2014 at 8:44 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Proposal: freeze Thrift starting with 2.1.0


yes, I was looking at intravert last nite.

For the kinds of reports my customers ask us to do, joins and subqueries are
important. Having tried to do a simple join in PIG, the level of pain is
high. I'm a masochist, so I don't mind breaking a simple join into multiple
MR tasks, though I do find myself asking why the hell does it need to be so
painful in PIG? Many of my friends say what is this crap! or this is
better than writing sql queries to run reports?

Plus, using ETL techniques to extract summaries only works for cases where
the data is small enough. Once it gets beyond a certain size, it's not
practical, which means we're back to crappy reporting languages that make
life painful. Lots of big healthcare companies have thousands of MOLAP cubes
on dozens of mainframes. The old OLTP - DW/OLAP creates it's own set of
management headaches.

being able to report directly on the raw data avoids many of the issues, but
that's my bias perspective.




On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan doanduy...@gmail.com wrote:
 I would love to see Cassandra get to the point where users can define complex
 queries with subqueries, like, group by and joins -- Did you have a look at
 Intravert ? I think it does union  intersection on server side for you. Not
 sure about join though..
 
 
 On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin wool...@gmail.com wrote:
 
 Hi Ed,
 
 I agree Solr is deeply integrated into DSE. I've looked at Solandra in the
 past and studied the code.
 
 My understanding is DSE uses Cassandra for storage and the user has both API
 available. I do think it can be integrated further to make moderate to
 complex queries easier and probably faster. That's why we built our own
 JPA-like object query API. I would love to see Cassandra get to the point
 where users can define complex queries with subqueries, like, group by and
 joins. Clearly lots of people want these features and even

[Blog] : Storm and Cassandra : A Three Year Retrospective

2014-02-13 Thread Brian O'Neill

A community member asked for a blog post on Storm + Cassandra.

FWIW, here was our journey.
http://brianoneill.blogspot.com/2014/02/storm-and-cassandra-three-year.html

-brian

---
Brian O'Neill
Chief Technology Officer


Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 




Re: CQL list command

2014-02-07 Thread Brian O'Neill

+1, agreed.  I do the same thing.

If cli is going away, we¹ll need this ability in cqlsh.

I created a JIRA issue for it:
https://issues.apache.org/jira/browse/CASSANDRA-6676


We¹ll see what the crew come back with.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 2/7/14, 2:33 AM, Ben Hood 0x6e6...@gmail.com wrote:

On Thu, Feb 6, 2014 at 9:01 PM, Andrew Cobley a.e.cob...@dundee.ac.uk
wrote:
 I often use the CLI command LIST for debugging or when teaching
students showing them what's going on under the hood of CQL.  I see that
CLI swill be removed in Cassandra 3 and we will lose this ability.  It
would be nice if CQL retained it, or something like it for debugging and
etching purposes.

I agree. I use LIST every now and then to verify the storage layout of
partitioning and cluster columns. What would be cool is to do
something like:

cqlsh:y CREATE TABLE x (
  ... a int,
  ... b int,
  ... c int,
  ... PRIMARY KEY (a,b)
  ... );
cqlsh:y insert into x (a,b,c) values (1,1,1);
cqlsh:y insert into x (a,b,c) values (2,1,1);
cqlsh:y insert into x (a,b,c) values (2,2,1);
cqlsh:y select * from x;
 a | b | c
---+---+---
 1 | 1 | 1
 2 | 1 | 1
 2 | 2 | 1

(3 rows)

cqlsh:y select * from x show storage; // requires monospace font

   +---+
+---+  |b:1|
|a:1| +-- |---|
+---+  |c:1|
   +---+

   +---+---+
+---+  |b:1|b:2|
|a:2| +-- |---|---|
+---+  |c:1|c:2|
   +---+---+

(2 rows)




Re: Dimensional SUM, COUNT, DISTINCT in C* (replacing Acunu)

2013-12-18 Thread Brian O'Neill

Thanks for the pointer Alain.

At a quick glance, it looks like people are looking for query time
filtering/aggregation, which will suffice for small data sets.  Hopefully we
might be able to extend that to perform pre-computations as well. (which
would support much larger data sets / volumes)

I¹ll continue the discussion on the issue.

thanks again,
brian


---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Alain RODRIGUEZ arodr...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, December 18, 2013 at 5:13 AM
To:  user@cassandra.apache.org
Cc:  d...@cassandra.apache.org d...@cassandra.apache.org
Subject:  Re: Dimensional SUM, COUNT,  DISTINCT in C* (replacing Acunu)

Hi, this would indeed be much appreciated by a lot of people.

There is this issue, existing about this subject

 https://issues.apache.org/jira/browse/CASSANDRA-4914

Maybe could you help commiters there.

Hope this will be usefull to you.

Please let us know when you find a way to do these operations.

Cheers.


2013/12/18 Brian O'Neill b...@alumni.brown.edu
 We are seeking to replace Acunu in our technology stack / platform.  It is the
 only component in our stack that is not open source.
 
 In preparation, over the last few weeks I¹ve migrated Virgil to CQL.   The
 vision is that Virgil could receive a REST request to upsert/delete data
 (hierarchical JSON to support collections).  Virgil would lookup the
 dimensions/aggregations for that table, add the key to the pertinent
 dimensional tables (e.g. DISTINCT), incorporate values into aggregations (e.g.
 SUMs) and increment/decrement relevant counters (COUNT).  (using additional
 CF¹s)
 
 This seems straight-forward, but appears to require a read-before-write.
 (e.g. read the current value of a SUM, incorporate the new value, then use the
 lightweight transactions of C* 2.0 to conditionally update the value.)
 
 Before I go down this path, I was wondering if anyone is designing/working on
 the same, perhaps at a lower level?  (CQL?)
 
 Is there any intent to support aggregations/filters (COUNT, SUM, DISTINCT,
 etc) at the CQL level?  If so, is there a preliminary design?
 
 I can see a lower-level approach, which would leverage the commit logs (and
 mem/sstables) and perform the aggregation during read-operations (and
 flush/compaction).
 
 thoughts?  i'm open to all ideas.
 
 -brian
 -- 
 Brian ONeill
 Chief Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024 tel:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





Dimensional SUM, COUNT, DISTINCT in C* (replacing Acunu)

2013-12-17 Thread Brian O'Neill
We are seeking to replace Acunu in our technology stack / platform.  It is
the only component in our stack that is not open source.

In preparation, over the last few weeks I’ve migrated Virgil to CQL.   The
vision is that Virgil could receive a REST request to upsert/delete data
(hierarchical JSON to support collections).  Virgil would lookup the
dimensions/aggregations for that table, add the key to the pertinent
dimensional tables (e.g. DISTINCT), incorporate values into aggregations
(e.g. SUMs) and increment/decrement relevant counters (COUNT).  (using
additional CF’s)

This seems straight-forward, but appears to require a read-before-write.
 (e.g. read the current value of a SUM, incorporate the new value, then use
the lightweight transactions of C* 2.0 to conditionally update the value.)

Before I go down this path, I was wondering if anyone is designing/working
on the same, perhaps at a lower level?  (CQL?)

Is there any intent to support aggregations/filters (COUNT, SUM, DISTINCT,
etc) at the CQL level?  If so, is there a preliminary design?

I can see a lower-level approach, which would leverage the commit logs (and
mem/sstables) and perform the aggregation during read-operations (and
flush/compaction).

thoughts?  i'm open to all ideas.

-brian
-- 
Brian ONeill
Chief Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Drop keyspace via CQL hanging on master/trunk.

2013-12-10 Thread Brian O'Neill

Great.  Thanks Aaron.

FWIW, I am/was porting Virgil over CQL. 

I should be able to release a new REST API for C* (using CQL) shortly.

-brian

---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  •  healthmarketscience.com

This information transmitted in this email message is for the intended 
recipient only and may contain confidential and/or privileged material. If you 
received this email in error and are not the intended recipient, or the person 
responsible to deliver it to the intended recipient, please contact the sender 
at the email above and delete this email and any attachments and destroy any 
copies thereof. Any review, retransmission, dissemination, copying or other use 
of, or taking any action in reliance upon, this information by persons or 
entities other than the intended recipient is strictly prohibited.

On Dec 10, 2013, at 1:51 PM, Aaron Morton aa...@thelastpickle.com wrote:

 Looks like a bug, will try to fix today 
 https://issues.apache.org/jira/browse/CASSANDRA-6472
 
 Cheers
 
 -
 Aaron Morton
 New Zealand
 @aaronmorton
 
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
 
 On 6/12/2013, at 10:25 am, Brian O'Neill b...@alumni.brown.edu wrote:
 
 
 I removed the data directory just to make sure I had a clean environment. 
 (eliminating the possibility of corrupt keyspaces/files causing problems)
 
 -brian
 
 ---
 Brian O'Neill
 Chief Architect
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42  •  
 healthmarketscience.com
 
 This information transmitted in this email message is for the intended 
 recipient only and may contain confidential and/or privileged material. If 
 you received this email in error and are not the intended recipient, or the 
 person responsible to deliver it to the intended recipient, please contact 
 the sender at the email above and delete this email and any attachments and 
 destroy any copies thereof. Any review, retransmission, dissemination, 
 copying or other use of, or taking any action in reliance upon, this 
 information by persons or entities other than the intended recipient is 
 strictly prohibited.
  
 
 
 From: Jason Wee peich...@gmail.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, December 5, 2013 at 4:03 PM
 To: user@cassandra.apache.org
 Subject: Re: Drop keyspace via CQL hanging on master/trunk.
 
 Hey Brian, just out of curiosity, why would you remove cassandra data 
 directory entirely?
 
 /Jason
 
 
 On Fri, Dec 6, 2013 at 2:38 AM, Brian O'Neill b...@alumni.brown.edu wrote:
 When running Cassandra from trunk/master, I see a drop keyspace command 
 hang at the CQL prompt.
 
 To reproduce:
 1) Removed my cassandra data directory entirely
 2) Fired up cqlsh, and executed the following CQL commands in succession:
 
 bone@zen:~/git/boneill42/cassandra- bin/cqlsh
 Connected to Test Cluster at localhost:9160.
 [cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 
 19.38.0]
 Use HELP for help.
 cqlsh describe keyspaces;
 
 system  system_traces
 
 cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS=
 trategy', 'replication_factor':'1'};
 cqlsh describe keyspaces;
 
 system  test_keyspace  system_traces
 
 cqlsh drop keyspace test_keyspace;
 
 THIS HANGS INDEFINITELY
 
 thoughts?  user error? worth filing an issue?
 One other note — this happens using the CQL java driver as well.
 
 -brian
 
 ---
 Brian O'Neill
 Chief Architect
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42  •  
 healthmarketscience.com
 
 This information transmitted in this email message is for the intended 
 recipient only and may contain confidential and/or privileged material. If 
 you received this email in error and are not the intended recipient, or the 
 person responsible to deliver it to the intended recipient, please contact 
 the sender at the email above and delete this email and any attachments and 
 destroy any copies thereof. Any review, retransmission, dissemination, 
 copying or other use of, or taking any action in reliance upon, this 
 information by persons or entities other than the intended recipient is 
 strictly prohibited.
  
 
 
 



Drop keyspace via CQL hanging on master/trunk.

2013-12-05 Thread Brian O'Neill
When running Cassandra from trunk/master, I see a drop keyspace command hang
at the CQL prompt.

To reproduce:
1) Removed my cassandra data directory entirely
2) Fired up cqlsh, and executed the following CQL commands in succession:

bone@zen:~/git/boneill42/cassandra- bin/cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol
19.38.0]
Use HELP for help.
cqlsh describe keyspaces;

system  system_traces

cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS=
trategy', 'replication_factor':'1'};
cqlsh describe keyspaces;

system  test_keyspace  system_traces

cqlsh drop keyspace test_keyspace;

THIS HANGS INDEFINITELY

thoughts?  user error? worth filing an issue?
One other note ‹ this happens using the CQL java driver as well.

-brian

---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 




Re: Drop keyspace via CQL hanging on master/trunk.

2013-12-05 Thread Brian O'Neill

I removed the data directory just to make sure I had a clean environment.
(eliminating the possibility of corrupt keyspaces/files causing problems)

-brian

---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Jason Wee peich...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, December 5, 2013 at 4:03 PM
To:  user@cassandra.apache.org
Subject:  Re: Drop keyspace via CQL hanging on master/trunk.

Hey Brian, just out of curiosity, why would you remove cassandra data
directory entirely?

/Jason


On Fri, Dec 6, 2013 at 2:38 AM, Brian O'Neill b...@alumni.brown.edu wrote:
 When running Cassandra from trunk/master, I see a drop keyspace command hang
 at the CQL prompt.
 
 To reproduce:
 1) Removed my cassandra data directory entirely
 2) Fired up cqlsh, and executed the following CQL commands in succession:
 
 bone@zen:~/git/boneill42/cassandra- bin/cqlsh
 Connected to Test Cluster at localhost:9160.
 [cqlsh 4.1.0 | Cassandra 2.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol
 19.38.0]
 Use HELP for help.
 cqlsh describe keyspaces;
 
 system  system_traces
 
 cqlsh create keyspace test_keyspace with replication =3D {'class':'SimpleS=
 trategy', 'replication_factor':'1'};
 cqlsh describe keyspaces;
 
 system  test_keyspace  system_traces
 
 cqlsh drop keyspace test_keyspace;
 
 THIS HANGS INDEFINITELY
 
 thoughts?  user error? worth filing an issue?
 One other note ‹ this happens using the CQL java driver as well.
 
 -brian
 
 ---
 Brian O'Neill
 Chief Architect
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
 healthmarketscience.com
 
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If you
 received this email in error and are not the intended recipient, or the person
 responsible to deliver it to the intended recipient, please contact the sender
 at the email above and delete this email and any attachments and destroy any
 copies thereof. Any review, retransmission, dissemination, copying or other
 use of, or taking any action in reliance upon, this information by persons or
 entities other than the intended recipient is strictly prohibited.
  





Re: Main method not found in class org.apache.cassandra.service.CassandraDaemon

2013-07-17 Thread Brian O'Neill
Vivek,

The location of CassandraDaemon changed between versions.  (from
org.apache.cassandra.thrift to org.apache.cassandra.service)

It is likely that the start scripts are picking up the old version on the
classpath, which results in the main method not being found.

Do you have CASSANDRA_HOME set?  I believe the start scripts will use that.
 Perhaps you have that set and pointed to the older 1.1.X version?

-brian


On Wed, Jul 17, 2013 at 8:31 AM, Vivek Mishra mishra.v...@gmail.com wrote:

 Finally,
 i have to delete all rpm installed files to get this working, folders are:
 /usr/share/cassandra
 /etc/alternatives/cassandra
 /usr/bin/cassandra
 /usr/bin/cassandra.in.sh
 /usr/bin/cassandra-cli

 Still don't understand why it's giving me such weird error:
 
 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main method
 as:
public static void main(String[] args)
 ***

 This is not informative at all and does not even Help!

 -Vivek


 On Wed, Jul 17, 2013 at 3:49 PM, Vivek Mishra mishra.v...@gmail.comwrote:

 @aaron
 Thanks for your reply. I did have a look rpm installed files
 1.  /etc/alternatives/cassandra, it contains configuration files only.
 and .sh files are installed within /usr/bin folder.

 Even if i try to run from extracted tar ball folder as

 /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -f

 same error.

 /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -v

 gives me 1.1.12 though it should give me 1.2.4


 -Vivek
 it gives me same error.


 On Wed, Jul 17, 2013 at 3:37 PM, aaron morton aa...@thelastpickle.comwrote:

 Something is messed up in your install.  Can you try scrubbing the
 install and restarting ?

 Cheers

-
 Aaron Morton
 Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 17/07/2013, at 6:47 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main method
 as:
public static void main(String[] args)
 

 Hi,
 I am getting this error. Earlier it was working fine for me, when i
 simply downloaded the tarball installation and ran cassandra server.
 Recently i did rpm package installation of Cassandra and which is working
 fine. But somehow when i try to run it via originally extracted tar
 package. i am getting:

 *
 xss =  -ea
 -javaagent:/home/impadmin/software/apache-cassandra-1.2.4//lib/jamm-0.2.5.jar
 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1024M -Xmx1024M
 -Xmn256M -XX:+HeapDumpOnOutOfMemoryError -Xss180k
 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main method
 as:
public static void main(String[] args)
 *

 I tried setting CASSANDRA_HOME directory, but no luck.

 Error is bit confusing, Any suggestions???

 -Vivek







-- 
Brian ONeill
Chief Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Main method not found in class org.apache.cassandra.service.CassandraDaemon

2013-07-17 Thread Brian O'Neill
Vivek,

You could try echoing the CLASSPATH to double check.  Drop an echo into the
launch_service function in the cassandra shell script.  (~line 121)

Let us know the output.

-brian

---
Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Vivek Mishra mishra.v...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Wednesday, July 17, 2013 10:24 AM
To:  user@cassandra.apache.org
Subject:  Re: Main method not found in class
org.apache.cassandra.service.CassandraDaemon

Hi Brian,
Thanks for your response.
I think i did change CASSANDRA_HOME to point to new directory.

-Vivek


On Wed, Jul 17, 2013 at 7:03 PM, Brian O'Neill b...@alumni.brown.edu
wrote:
 Vivek,
 
 The location of CassandraDaemon changed between versions.  (from
 org.apache.cassandra.thrift to org.apache.cassandra.service)
 
 It is likely that the start scripts are picking up the old version on the
 classpath, which results in the main method not being found.
 
 Do you have CASSANDRA_HOME set?  I believe the start scripts will use that.
 Perhaps you have that set and pointed to the older 1.1.X version?
 
 -brian
 
 
 On Wed, Jul 17, 2013 at 8:31 AM, Vivek Mishra mishra.v...@gmail.com wrote:
 Finally,
 i have to delete all rpm installed files to get this working, folders are:
 /usr/share/cassandra
 /etc/alternatives/cassandra
 /usr/bin/cassandra
 /usr/bin/cassandra.in.sh http://cassandra.in.sh
 /usr/bin/cassandra-cli
 
 Still don't understand why it's giving me such weird error:
 
 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main method
 as:
public static void main(String[] args)
 ***
 
 This is not informative at all and does not even Help!
 
 -Vivek
 
 
 On Wed, Jul 17, 2013 at 3:49 PM, Vivek Mishra mishra.v...@gmail.com wrote:
 @aaron
 Thanks for your reply. I did have a look rpm installed files
 1.  /etc/alternatives/cassandra, it contains configuration files only.
 and .sh files are installed within /usr/bin folder.
 
 Even if i try to run from extracted tar ball folder as
 
 /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -f
 
 same error.  
 
 /home/impadmin/apache-cassandra-1.2.4/bin/cassandra -v
 
 gives me 1.1.12 though it should give me 1.2.4
 
 
 -Vivek
 it gives me same error.
 
 
 On Wed, Jul 17, 2013 at 3:37 PM, aaron morton aa...@thelastpickle.com
 wrote:
 Something is messed up in your install.  Can you try scrubbing the install
 and restarting ?
 
 Cheers
 
 -
 Aaron Morton
 Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 17/07/2013, at 6:47 PM, Vivek Mishra mishra.v...@gmail.com wrote:
 
 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main
 method as:
public static void main(String[] args)
 
 
 Hi,
 I am getting this error. Earlier it was working fine for me, when i simply
 downloaded the tarball installation and ran cassandra server. Recently i
 did rpm package installation of Cassandra and which is working fine. But
 somehow when i try to run it via originally extracted tar package. i am
 getting:
 
 *
 xss =  -ea 
 -javaagent:/home/impadmin/software/apache-cassandra-1.2.4//lib/jamm-0.2.5.
 jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1024M
 -Xmx1024M -Xmn256M -XX:+HeapDumpOnOutOfMemoryError -Xss180k
 Error: Main method not found in class
 org.apache.cassandra.service.CassandraDaemon, please define the main
 method as:
public static void main(String[] args)
 *
 
 I tried setting CASSANDRA_HOME directory, but no luck.
 
 Error is bit confusing, Any suggestions???
 
 -Vivek
 
 
 
 
 
 
 -- 
 Brian ONeill
 Chief Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





SQL Injection C* (via CQL Thrift)

2013-06-18 Thread Brian O'Neill
Mostly for fun, I wanted to throw this out there...

We are undergoing a security audit for our platform (C* + Elastic Search +
Storm).  One component of that audit is susceptibility to SQL injection.  I
was wondering if anyone has attempted to construct a SQL injection attack
against Cassandra?  Is it even possible?

I know the code paths fairly well, but...
Does there exists a path in the code whereby user data gets interpreted,
which could be exploited to perform user operations?

From the Thrift side of things, I've always felt safe.  Data is opaque.
 Serializers are used to convert it to Bytes, and C* doesn't ever really do
anything with the data.

In examining the CQL java-driver, it looks like there might be a bit more
exposure to injection.  (or even CQL over Thrift)  I haven't dug into the
code yet, but dependent on which flavor of the API you are using, you may
be including user data in your statements.

Does anyone know if the CQL java-driver does anything to protect against
injection?  Or is it possible to say that the syntax is strict enough that
any embedded operations in data would not parse?

just some food for thought...
I'll be digging into this over the next couple weeks.  If people are
interested, I can throw a blog post out there with the findings.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: SQL Injection C* (via CQL Thrift)

2013-06-18 Thread Brian O'Neill

Perfect.  Thanks Sylvain.  That is exactly the input I was looking for, and
I agree completely.
(t's easy enough to protect against)

As for the thrift side (i.e. using Hector or Astyanax), anyone have a crafty
way to inject something?

At first glance, it doesn't appear possible, but I'm not 100% confident
making that assertion.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, June 18, 2013 8:51 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: SQL Injection C* (via CQL  Thrift)

If you're not careful, then CQL injection is possible.

Say you naively build you query with
  UPDATE foo SET col=' + user_input + ' WHERE key = 'k'
then if user_input is foo' AND col2='bar, your user will have overwritten
a column it shouldn't have been able to. And something equivalent in a BATCH
statement could allow to overwrite/delete some random row in some random
table.

Now CQL being much more restricted than SQL (no subqueries, no generic
transaction, ...), the extent of what you can do with a CQL injection is way
smaller than in SQL. But you do have to be careful.

As far as the Datastax java driver is concerned, you can fairly easily
protect yourself by using either:
1) prepared statements: if the user input is a prepared variable, there is
nothing the user can do (it's equivalent to the thrift situation).
2) using the query builder: it will escape quotes in the strings you
provided, thuse avoiding injection.

So I would say that injections are definitively possible if you concatenate
strings too naively, but I don't think preventing them is very hard.

--
Sylvain


On Tue, Jun 18, 2013 at 2:02 PM, Brian O'Neill b...@alumni.brown.edu
wrote:
 
 Mostly for fun, I wanted to throw this out there...
 
 We are undergoing a security audit for our platform (C* + Elastic Search +
 Storm).  One component of that audit is susceptibility to SQL injection.  I
 was wondering if anyone has attempted to construct a SQL injection attack
 against Cassandra?  Is it even possible?
 
 I know the code paths fairly well, but...
 Does there exists a path in the code whereby user data gets interpreted, which
 could be exploited to perform user operations?
 
 From the Thrift side of things, I've always felt safe.  Data is opaque.
 Serializers are used to convert it to Bytes, and C* doesn't ever really do
 anything with the data.
 
 In examining the CQL java-driver, it looks like there might be a bit more
 exposure to injection.  (or even CQL over Thrift)  I haven't dug into the code
 yet, but dependent on which flavor of the API you are using, you may be
 including user data in your statements.
 
 Does anyone know if the CQL java-driver does anything to protect against
 injection?  Or is it possible to say that the syntax is strict enough that any
 embedded operations in data would not parse?
 
 just some food for thought...
 I'll be digging into this over the next couple weeks.  If people are
 interested, I can throw a blog post out there with the findings.
 
 -brian
 
 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





[BLOG] : Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine

2013-05-17 Thread Brian O'Neill
FWIW, we were able to integrate Druid and Cassandra.

Its only in PoC right now, but it seems like a powerful combination:
http://brianoneill.blogspot.com/2013/05/cassandra-as-deep-storage-mechanism-for.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: multitenant support with key spaces

2013-05-06 Thread Brian O'Neill

You may want to look at using virtual keyspaces:
http://hector-client.github.io/hector/build/html/content/virtual_keyspaces.html

And follow these tickets:
http://wiki.apache.org/cassandra/MultiTenant

-brian


On May 6, 2013, at 2:37 AM, Darren Smythe wrote:

 How many keyspaces can you reasonably have? We have around 500 customers and 
 expect that to double end of year. We're looking into C* and wondering if it 
 makes sense for a separate KS per customer?
 
 If we have 1000 customers, so one KS per customer is 1000 keyspaces. Is that 
 something C* can handle efficiently? Each customer has about 10 GB of data 
 (not taking replication into account).
 
 Or is this symptomatic of a bad design?
 
 I guess the same question applies to our notion of breaking up the column 
 families into time ranges. We're naively trying to avoid having few large 
 CFs/KSs. Is/should that be a concern?
 
 What are the tradeoffs of a smaller number of heavyweight KS/CFs vs. manually 
 sharding the data into more granular KSs/CFs?
 
 Thanks for any info.

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Exporting all data within a keyspace

2013-04-30 Thread Brian O'Neill

You could always do something like this as well:
http://brianoneill.blogspot.com/2012/05/dumping-data-from-cassandra-like.htm
l

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Kumar Ranjan winnerd...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, April 30, 2013 9:11 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Exporting all data within a keyspace

Try sstable2json and json2sstable. But it works on column family so you can
fetch all column family and iterate over list of CF and use sstable2json
tool to extract data. Remember this will only fetch on disk data do anything
in memtable/cache which is to be flushed will be missed. So run compaction
and then run the written script.

On Tuesday, April 30, 2013, Chidambaran Subramanian  wrote:
 Is there any easy way of exporting all data for a keyspace (and conversely)
 importing it.
 
 Regards
 Chiddu




Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Great!

Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
I couldn't find the part of  the API that allowed you to pass in the byte
array.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

Hi Brian,

I'm using the blobs to store images in cassandra(1.2.3) using the
java-driver version 1.0.0-beta1.
There is no need to convert a byte array into hex.

Br,
Gabi

On 4/11/13 3:21 PM, Brian O'Neill wrote:

 I started playing around with the CQL driver.
 Has anyone used blobs with it yet?

 Are you forced to convert a byte[] to hex?
 (e.g. I have a photo that I want to store in C* using the java-driver
API)

 -brian

 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42





Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Cool.  That might be it.  I'll take a look at PreparedStatement.

For query building, I took a look under the covers, and even when I was
passing in a ByteBuffer, it runs through the following code in the
java-driver:

Utils.java:
   if (value instanceof ByteBuffer) {
  sb.append(0x);
  sb.append(ByteBufferUtil.bytesToHex((ByteBuffer)value));
   }

Hopefully, the prepared statement doesn't do the conversion.
(I'm not sure if it is a limitation of the CQL protocol itself)

thanks again,
-brian



---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

I'm not using the query builder but the PreparedStatement.

Here is the sample code: https://gist.github.com/devsprint/5363023

Gabi
On 4/11/13 3:27 PM, Brian O'Neill wrote:
 Great!

 Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
 I couldn't find the part of  the API that allowed you to pass in the
byte
 array.

 -brian

 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.
   






 On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 Hi Brian,

 I'm using the blobs to store images in cassandra(1.2.3) using the
 java-driver version 1.0.0-beta1.
 There is no need to convert a byte array into hex.

 Br,
 Gabi

 On 4/11/13 3:21 PM, Brian O'Neill wrote:
 I started playing around with the CQL driver.
 Has anyone used blobs with it yet?

 Are you forced to convert a byte[] to hex?
 (e.g. I have a photo that I want to store in C* using the java-driver
 API)

 -brian

 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42






Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Yep, it worked like a charm.  (PreparedStatement avoided the hex conversion)

But now, I'm seeing a few extra bytes come back in the select….
(I'll keep digging, but maybe you have some insight?)

I see this:
ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
repository.add() byte.length()=[259804]

ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
repository.get() [foo.jpeg] byte.length()=[259861]


(Notice the length's don't match up)

Using this code:
public void addContent(String key, byte[] data)

throws NoHostAvailableException {

LOG.error(repository.add() byte.length()=[ + data.length + ]);

String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
data) VALUES (?, ?);

PreparedStatement ps = session.prepare(statement);

BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

session.execute(bs);

}



public byte[] getContent(String key) throws NoHostAvailableException {

Query select = select(data).from(KEYSPACE, TABLE).where(eq(key,
key));

ResultSet resultSet = session.execute(select);

byte[] data = resultSet.one().getBytes(data).array();

LOG.error(repository.get() [ + key + ] byte.length()=[ +
data.length + ]);

return data;

}


---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42   •
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, April 11, 2013 8:48 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Cc:  Gabriel Ciuloaica gciuloa...@gmail.com
Subject:  Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.

It does not.
 
 (I'm not sure if it is a limitation of the CQL protocol itself)
 
 thanks again,
 -brian
 
 
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 tel:215.588.6024  • @boneill42
 http://www.twitter.com/boneill42  •
 healthmarketscience.com http://healthmarketscience.com
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.
 
 
 
 
 
 
 
 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:
 
 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the API that allowed you to pass in the
 byte
  array.
 
  -brian
 
  ---
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science
  The Science of Better Results
  2700 Horizon Drive € King of Prussia, PA € 19406
  M: 215.588.6024 tel:215.588.6024  € @boneill42
 http://www.twitter.com/boneill42  €
  healthmarketscience.com http://healthmarketscience.com
 
  This information transmitted in this email message is for the intended
  recipient only and may contain confidential and/or privileged material.
 If
  you received this email in error and are not the intended recipient, or
  the person responsible to deliver it to the intended recipient, please
  contact the sender at the email above and delete this email and any
  attachments and destroy any copies thereof. Any review, retransmission,
  dissemination, copying or other use of, or taking any action in reliance
  upon, this information by persons or entities other than the intended
  recipient is strictly prohibited.
 
 
 
 
 
 
 
  On 4/11/13 8:25 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:
 
  Hi Brian,
 
  I'm using

Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Sylvain,

Interesting, when I look at the actual bytes returned, I see the byte array
is prefixed with the keyspace and table name.

I assume I'm doing something wrong in the select.  Am I incorrectly using
the ResultSet?

-brian

On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 Yep, it worked like a charm.  (PreparedStatement avoided the hex
 conversion)

 But now, I'm seeing a few extra bytes come back in the select….
 (I'll keep digging, but maybe you have some insight?)

 I see this:

 ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
 repository.add() byte.length()=[259804]

 ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
 repository.get() [foo.jpeg] byte.length()=[259861]

 (Notice the length's don't match up)

 Using this code:

 public void addContent(String key, byte[] data)

 throws NoHostAvailableException {

 LOG.error(repository.add() byte.length()=[ + data.length + ]);

 String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
 data) VALUES (?, ?);

 PreparedStatement ps = session.prepare(statement);

 BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

 session.execute(bs);

 }


 public byte[] getContent(String key) throws NoHostAvailableException {

 Query select = select(data).from(KEYSPACE, TABLE).where(eq(key,
 key));

 ResultSet resultSet = session.execute(select);

 byte[] data = resultSet.one().getBytes(data).array();

 LOG.error(repository.get() [ + key + ] byte.length()=[ + data.
 length + ]);

 return data;

 }

 ---

 Brian O'Neill

 Lead Architect, Software Development

 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive • King of Prussia, PA • 19406

 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.

 ** **


 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, April 11, 2013 8:48 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Gabriel Ciuloaica gciuloa...@gmail.com
 Subject: Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.


 It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa...@gmail.com wrote:

 I'm not using the query builder but the PreparedStatement.
 
 Here is the sample code: https://gist.github.com/devsprint/5363023
 
 Gabi
 On 4/11/13 3:27 PM, Brian O'Neill wrote:
  Great!
 
  Thanks Gabriel.  Do you have an example? (are using QueryBuilder?)
  I couldn't find the part of  the API that allowed you to pass in the
 byte
  array.
 
  -brian
 
  ---
  Brian O'Neill
  Lead Architect, Software Development
  Health Market Science
  The Science of Better Results
  2700 Horizon Drive € King of Prussia, PA € 19406
  M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
  healthmarketscience.com
 
  This information transmitted in this email message is for the intended
  recipient only and may contain confidential and/or privileged material.
 If
  you received this email in error and are not the intended recipient, or
  the person responsible to deliver it to the intended recipient, please
  contact the sender at the email above and delete this email and any
  attachments and destroy any copies thereof. Any review, retransmission,
  dissemination, copying

Re: Blobs in CQL?

2013-04-11 Thread Brian O'Neill
Bingo! Thanks to both of you.  (the C* community rocks)

A few hours worth of work, and I've got a working REST-based photo
repository backed by  C* using the CQL java driver. =)

rock on, thanks again,
-brian


On Thu, Apr 11, 2013 at 9:33 AM, Sylvain Lebresne sylv...@datastax.comwrote:


 I assume I'm doing something wrong in the select.  Am I incorrectly using
 the ResultSet?


 You're incorrectly using the returned ByteBuffer. But you should not feel
 bad, that API kinda
 sucks.

 The short version is that .array() returns the backing array of the
 ByteBuffer. But there is no
 guarantee that you'll have a one-to-one correspondence between the valid
 content of the
 ByteBuffer and the backing array, the backing array can be bigger in
 particular (long story short,
 this allows multiple ByteBuffer to share the same backing array, which can
 avoid doing copies).

 I also note that there is no guarantee that .array() will work unless
 you've called .hasArray().

 Anyway, what you could do is:
 ByteBuffer bb = resultSet.one().getBytes(data);
 byte[] data = new byte[bb.remaining()];
 bb.get(data);

 Alternatively, you can use the result of .array(), but you should only
 consider the bb.remaining()
 bytes starting at bb.arrayOffset() + bb.position() (where bb is the
 returned ByteBuffer).

 --
 Sylvain




 -brian

 On Thu, Apr 11, 2013 at 9:09 AM, Brian O'Neill b...@alumni.brown.eduwrote:

 Yep, it worked like a charm.  (PreparedStatement avoided the hex
 conversion)

 But now, I'm seeing a few extra bytes come back in the select….
 (I'll keep digging, but maybe you have some insight?)

 I see this:

 ERROR [2013-04-11 13:05:03,461] com.skookle.dao.RepositoryDao:
 repository.add() byte.length()=[259804]

 ERROR [2013-04-11 13:08:08,487] com.skookle.dao.RepositoryDao:
 repository.get() [foo.jpeg] byte.length()=[259861]

 (Notice the length's don't match up)

 Using this code:

 public void addContent(String key, byte[] data)

 throws NoHostAvailableException {

 LOG.error(repository.add() byte.length()=[ + data.length + ]
 );

 String statement = INSERT INTO  + KEYSPACE + . + TABLE + (key,
 data) VALUES (?, ?);

 PreparedStatement ps = session.prepare(statement);

 BoundStatement bs = ps.bind(key, ByteBuffer.wrap(data));

 session.execute(bs);

 }


 public byte[] getContent(String key) throwsNoHostAvailableException {

 Query select = select(data).from(KEYSPACE, TABLE).where(eq(
 key, key));

 ResultSet resultSet = session.execute(select);

 byte[] data = resultSet.one().getBytes(data).array();

 LOG.error(repository.get() [ + key + ] byte.length()=[ +
 data.length + ]);

 return data;

 }

 ---

 Brian O'Neill

 Lead Architect, Software Development

 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive • King of Prussia, PA • 19406

 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.

 ** **


 From: Sylvain Lebresne sylv...@datastax.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, April 11, 2013 8:48 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Cc: Gabriel Ciuloaica gciuloa...@gmail.com
 Subject: Re: Blobs in CQL?


 Hopefully, the prepared statement doesn't do the conversion.


 It does not.


 (I'm not sure if it is a limitation of the CQL protocol itself)

 thanks again,
 -brian



 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
 If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 4/11/13 8:34 AM, Gabriel Ciuloaica gciuloa

Re: Bitmap indexes - reviving CASSANDRA-1472

2013-04-10 Thread Brian O'Neill

changing to user@
(at least until we can determine if this can/should be proposed under 1472)

For those interested in analytics and set-based queries, see below...

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/10/13 10:43 PM, Matt Stump mrevilgn...@gmail.com wrote:

Druid was our inspiration to layer bitmap indexes on top of Cassandra.
Druid doesn't work for us because or data set is too large. We would need
many hundreds of nodes just for the pre-processed data. What I envisioned
was the ability to perform druid style queries (no aggregation) without
the
limitations imposed by having the entire dataset in memory. I primarily
need to query whether a user performed some event, but I also intend to
add
trigram indexes for LIKE, ILIKE or possibly regex style matching.

I wasn't aware of CONCISE, thanks for the pointer. We are currently
evaluating fastbit, which is a very similar project:
https://sdm.lbl.gov/fastbit/


On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill
b...@alumni.brown.eduwrote:


 How does this compare with Druid?
 https://github.com/metamx/druid

 We're currently evaluating Acunu, Vertica and Druid...

 
http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.
html

 With its bitmapped indexes, Druid appears to have the most potential.
 They boast some pretty impressive stats, especially WRT handling
 real-time updates and adding new dimensions.

 They also use a compression algorithm, CONCISE, to cut down on the space
 requirements.
 http://ricerca.mat.uniroma3.it/users/colanton/concise.html

 I haven't looked too deep into the Druid code, but I've been meaning to
 see if it could be backed by C*.

 We'd be game to join the hunt if you pursue such a beast. (with your
code,
 or with portions of Druid)

 -brian


 On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:

  What do you think about set manipulation via indexes in Cassandra? I'm
  interested in answering queries such as give me all users that
performed
  event 1, 2, and 3, but not 4. If the answer is yes than I can make a
case
  for spending my time on C*. The only downside for us would be our
current
  prototype is in C++ so we would loose some performance and the
ability to
  dedicate an entire machine to caching/performing queries.
 
 
  On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  If you mean, Can someone help me figure out how to get started
updating
  these old patches to trunk and cleaning out the Avro? then yes, I've
 been
  knee-deep in indexing code recently.
 
 
  On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome mrevilgn...@gmail.com
  wrote:
 
  I'm currently building a distributed cluster on top of cassandra to
  perform
  fast set manipulation via bitmap indexes. This gives me the ability
to
  perform unions, intersections, and set subtraction across
sub-queries.
  Currently I'm storing index information for thousands of dimensions
as
  cassandra rows, and my cluster keeps this information cached,
 distributed
  and replicated in order to answer queries.
 
  Every couple of days I think to myself this should really exist in
C*.
  Given all the benifits would there be any interest in
  reviving CASSANDRA-1472?
 
  Some downsides are that this is very memory intensive, even for
sparse
  bitmaps.
 
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder, http://www.datastax.com
  @spyced
 

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/






BI/Analtyics/Warehousing for data in C*

2013-04-01 Thread Brian O'Neill
We are trudging through an options analysis for BI/DW solutions for data
stored in C*.

I'd love to hear people's experiences.  Here is what we've found so far:
http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.html

Maybe we just use Intravert with a custom handler to handle the dimensional
cubes?
https://github.com/zznate/intravert-ug

Then, we could slap a javascript charting framework on it and call it
cubert. =)
http://www.classicgamesarcade.com/game/21652/q*bert.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: any other NYC* attendees find your usb stick of the proceedings empty?

2013-03-25 Thread Brian O'Neill
I think the recorded sessions will be posted to the PlanetCassandra Youtube
channel:
http://www.planetcassandra.org/blog/post/nyc-big-data-tech-day-update

Some of the slides have been posted up to slideshare:
http://www.slideshare.net/boneill42/hms-nyc-talk
http://www.slideshare.net/edwardcapriolo/intravert

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Brian Tarbox tar...@cabotresearch.com
Reply-To:  user@cassandra.apache.org
Date:  Monday, March 25, 2013 11:43 AM
To:  user@cassandra.apache.org
Subject:  any other NYC* attendees find your usb stick of the proceedings
empty?

Last week I attended DataStax's NYC* conference and one of the give-aways
was a wooden USB stick.  Finally getting around to loading it I find it
empty.

Anyone else have this problem?  Are the conference presentations available
somewhere else?

Brian Tarbox




Re: Netflix/Astynax Client for Cassandra

2013-02-07 Thread Brian O'Neill

Incidentally, we run Astyanax against 1.2.1. We haven't had any issues.

When running against 1.2.0, we ran into this:
https://github.com/Netflix/astyanax/issues/191


-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 2/7/13 6:58 AM, Peter Lin wool...@gmail.com wrote:

if i'm not mistaken, isn't this due to limitations of thrift versus
binary protocol? That's my understanding from datastax blogs.

unless someone really needs all the features of 1.2 like asynchronous
queries, astyanax and hector should work fine.

On Thu, Feb 7, 2013 at 1:20 AM, Gabriel Ciuloaica gciuloa...@gmail.com
wrote:
 Astyanax is not working with Cassandra 1.2.1. Only java-driver is
working
 very well with both Cassandra 1.2 and 1.2.1.

 Cheers,
 Gabi

 On 2/7/13 8:16 AM, Michael Kjellman wrote:

 It's a really great library and definitely recommended by me and many
who
 are reading this.

 And if you are just starting out on 1.2.1 with C* you might also want to
 evaluate https://github.com/datastax/java-driver and the new binary
 protocol.

 Best,
 michael

 From: Cassa L lcas...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Wednesday, February 6, 2013 10:13 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Netflix/Astynax Client for Cassandra

 Hi,
  Has anyone used Netflix/astynax java client library for Cassandra? I
have
 used Hector before and would like to evaluate astynax. Not sure, how it
is
 accepted in Cassandra community. Any issues with it or advantagest? API
 looks very clean and simple compare to Hector. Has anyone used it in
 production except Netflix themselves?

 Thanks
 LCassa






Re: Accessing Metadata of Column Familes

2013-01-28 Thread Brian O'Neill
Through CQL, you see the logical schema.
Through CLI, you see the physical schema.

This may help:
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts

-brian

On Mon, Jan 28, 2013 at 7:26 AM, Rishabh Agrawal
rishabh.agra...@impetus.co.in wrote:
 I found following issues while working on Cassandra version 1.2, CQL 3 and
 Thrift protocol 19.35.0.



 Case 1:

 Using CQL I created a table t1 with columns col1 and col2 with col1 being my
 primary key.



 When I access same data using CLI, I see col1 gets adopted as rowkey and
 col2 being another column. Now I have inserted value in another column
 (col3) in same row using CLI.  Now when I query same table again from CQL I
 am unable to find col3.



 Case 2:



 Using CLI, I have created table t2. Now I added a row key  row1 and two
 columns (keys)  col1 and col2 with some values in each. When I access t2
 from CQL I find following resultset with three columns:



   key | column1 | value

 row1| col1  | val1

 row1| col2  | val2





 This behavior raises certain questions:



 · What is the reason for such schema anomaly or is this a problem?

 · Which schema should be deemed as correct or consistent?

 · How to access meta data on the same?





 Thanks and Regards

 Rishabh Agrawal





 From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com]
 Sent: Monday, January 28, 2013 12:57 PM


 To: user@cassandra.apache.org
 Subject: RE: Accessing Metadata of Column Familes



 You can get storage attributes from /data/system/ keyspace.



 From: Rishabh Agrawal [mailto:rishabh.agra...@impetus.co.in]
 Sent: Monday, January 28, 2013 12:42 PM
 To: user@cassandra.apache.org
 Subject: RE: Accessing Metadata of Column Familes



 Thank for the reply.



 I do not want to go by API route. I wish to access files and column families
 which store meta data information



 From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com]
 Sent: Monday, January 28, 2013 12:25 PM
 To: user@cassandra.apache.org
 Subject: RE: Accessing Metadata of Column Familes



 Which API are you using?

 If you are using Hector use ColumnFamilyDefinition.



 Regards

 Harshvardhan OJha



 From: Rishabh Agrawal [mailto:rishabh.agra...@impetus.co.in]
 Sent: Monday, January 28, 2013 12:16 PM
 To: user@cassandra.apache.org
 Subject: Accessing Metadata of Column Familes



 Hello,



 I wish to access metadata information on column families. How can I do it?
 Any ideas?



 Thanks and Regards

 Rishabh Agrawal





 







 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

 The contents of this email, including the attachments, are PRIVILEGED AND
 CONFIDENTIAL to the intended recipient at the email address to which it has
 been addressed. If you receive it in error, please notify the sender
 immediately by return email and then permanently delete it from your system.
 The unauthorized use, distribution, copying or alteration of this email,
 including the attachments, is strictly forbidden. Please note that neither
 MakeMyTrip nor the sender accepts any responsibility for viruses and it is
 your responsibility to scan the email and attachments (if any). No contracts
 may be concluded on behalf of MakeMyTrip by means of email communications.



 







 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

 The contents of this email, including the attachments, are PRIVILEGED AND
 CONFIDENTIAL to the intended recipient at the email address to which it has
 been addressed. If you receive it in error, please notify the sender
 immediately by return email and then permanently delete it from your system.
 The unauthorized use, distribution, copying or alteration of this email,
 including the attachments, is strictly forbidden. Please note that neither
 MakeMyTrip nor the sender accepts any responsibility for viruses and it is
 your responsibility to scan the email and attachments (if any). No contracts
 may be concluded on behalf of MakeMyTrip by means of email communications.


 

Re: cql: show tables in a keystone

2013-01-28 Thread Brian O'Neill

cqlsh use keyspace;
cqlsh:cirrus describe tables;

For more info:
cqlsh help describe

-brian


---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 1/28/13 2:27 PM, Paul van Hoven paul.van.ho...@googlemail.com wrote:

Is there some way in cql to get a list of all tables or column
families that belong to a keystore like show tables in sql?




Webinar: Using Storm for Distributed Processing on Cassandra

2013-01-16 Thread Brian O'Neill
Just an FYI --

We will be hosting a webinar tomorrow demonstrating the use of Storm
as a distributed processing layer on top of Cassandra.

I'll be tag teaming with Taylor Goetz, the original author of storm-cassandra.
http://www.datastax.com/resources/webinars/collegecredit

It is part of the C*ollege Credit Webinar Series from Datastax.

All are welcome.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Cassandra 1.2 Thrift and CQL 3 issue

2013-01-12 Thread Brian O'Neill

I reported the issue here.  You may be missing a component in your column name.

https://issues.apache.org/jira/browse/CASSANDRA-5138

-brian


On Jan 12, 2013, at 12:48 PM, Shahryar Sedghi wrote:

 Hi
 
 I am trying to test my application that runs with JDBC, CQL 3 with Cassandra 
 1.2. After getting many weird errors and downgrading from JDBC to thrift, I 
 realized the thrift on Cassandra 1.2 has issues with wide rows. If I define 
 the table as:
 
 CREATE TABLE  test(interval int,id text, body text, primary key (interval, 
 id));
 
 select interval, id, body from test;
 
  fails with:
 
 ERROR [Thrift:16] 2013-01-11 18:23:35,997 CustomTThreadPoolServer.java (line 
 217) Error occurred during processing of message.
 java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
 at 
 org.apache.cassandra.config.CFMetaData.getColumnDefinitionFromColumnName(CFMetaData.java:923)
 at 
 org.apache.cassandra.cql.QueryProcessor.processStatement(QueryProcessor.java:502)
 at 
 org.apache.cassandra.cql.QueryProcessor.process(QueryProcessor.java:789)
 at 
 org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1652)
 at 
 org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:4048)
 at 
 org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:4036)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
 at 
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1121)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
 at java.lang.Thread.run(Thread.java:780)
 
 Same code works well with Cassandra 1.1. 
 
 At the same time, if I define the table as:
 CREATE TABLE  test1(interval int,id text, body text, primary key (interval));
 
 everything works fine. I am using 
 
 DataStax Community 1.2
 
 apache-cassandra-clientutil-1.2.0.jar
 apache-cassandra-thrift-1.2.0.jar
 libthrift-0.7.0.jar
 
 Apparently client.set_cql_version(3.0.0); has no effect either. Is there a 
 setting that I miss on the client side to dictate cql3 or it is a bug?
 
 Thanks in advance
 
 Shahryar
 
 -- 
 Life is what happens while you are making other plans. ~ John Lennon

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Astyanax

2013-01-08 Thread Brian O'Neill
Not sure where you are on the learning curve, but I've put a couple getting
started projects out on github:
https://github.com/boneill42/astyanax-quickstart

And the latest from the webinar is here:
https://github.com/boneill42/naughty-or-nice
http://brianoneill.blogspot.com/2013/01/creating-your-frist-java-application
-w.html

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Radek Gruchalski radek.gruchal...@portico.io
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, January 8, 2013 10:17 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Cc:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Astyanax

Hi,

We are using astyanax and we found out that github wiki with stackoverflow
is the most comprehensive set of documentation.

Do you have any specific questions?

Kind regards,
Radek Gruchalski

On 8 Jan 2013, at 15:46, Everton Lima peitin.inu...@gmail.com wrote:

 I was studing by there, but I would to know if anyone knows other sources.
 
 2013/1/8 Markus Klems markuskl...@gmail.com
 The wiki? https://github.com/Netflix/astyanax/wiki
 
 
 On Tue, Jan 8, 2013 at 2:44 PM, Everton Lima peitin.inu...@gmail.com wrote:
 Hi,
 Someone has or could indicate some good tutorial or book to learn Astyanax?
 
 Thanks
 
 -- 
 Everton Lima Aleixo
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA
 
 
 
 
 
 -- 
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA
 




Re: Best Java Driver for Cassandra?

2012-12-13 Thread Brian O'Neill

Well, we'll talk a bit about this in my webinar later todayŠ
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-cre
dit.html

I put together a quick decision matrix for all of the options based on
production-readiness, potential and momentum.  I think the slides will be
made available afterwards.

I also have a laundry list here: (written before I knew about Firebrand)
http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 12/13/12 9:03 AM, stephen.m.thomp...@wellsfargo.com
stephen.m.thomp...@wellsfargo.com wrote:

There seem to be a number of good options listed ... FireBrand and Hector
seem to have the most attractive sites, but that doesn't necessarily mean
anything.  :)  Can anybody make a case for one of the drivers over
another, especially in terms of which ones seem to be most used in major
implementations?

Thanks
Steve




Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-12 Thread Brian O'Neill
FWIW --
I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

I hope to make CQL part of the presentation and show how it integrates
with the Java APIs.
If you are interested, drop in.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Datatype Conversion in CQL-Client?

2012-11-19 Thread Brian O'Neill
I don't think Michael and/or Jonathan have published the CQL java driver
yet.  (CCing them)

Hopefully they'll find a public home for it soon, I hope to include it in
the Webinar in December.
(http://www.datastax.com/resources/webinars/collegecredit)

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Tommi Laukkanen tlaukka...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Monday, November 19, 2012 2:36 AM
To:  user@cassandra.apache.org
Subject:  Re: Datatype Conversion in CQL-Client?

I think Timmy might be referring to the upcoming native CQL Java driver that
might be coming with 1.2 - It was mentioned here:
http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Chang
es_in_Drivers.pdf

I would also be interested on testing that but I can't find it from
repositories. Any hints?

Regards,
Tommi L.

From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: 18. marraskuuta 2012 17:47
 To: user@cassandra.apache.org
 Subject: Re: Datatype Conversion in CQL-Client?
 Importance: Low
  
 
  
 If you are talking about the CQL-client that comes with Cassandra (cqlsh), it
 is actually written in Python:
 
 https://github.com/apache/cassandra/blob/trunk/bin/cqlsh
 
  
 
 For information on datatypes (and conversion) take a look at the CQL
 definition:
 
 http://www.datastax.com/docs/1.0/references/cql/index
 
 (Look at the CQL Data Types section)
 
  
 
 If that's not the client you are referencing, let us know which one you mean:
 
 http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html
 
  
 
 -brian
 
  
 
 On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote:
 
 
 
 Thanks for the links, however I'm interested in the functionality that the
 official Cassandra client/API (which is in Java) offers.
 
  
 
 2012/11/17 aaron morton aa...@thelastpickle.com
 
 Does the official/built-in Cassandra CQL client (in 1.2)
 What language ?
 
  
 
 Check the Java http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/
 and python http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/
 drivers.
 
  
 
 Cheers
 
  
 
  
 
 -
 
 Aaron Morton
 
 Freelance Cassandra Developer
 
 New Zealand
 
  
 
 @aaronmorton
 
 http://www.thelastpickle.com http://www.thelastpickle.com/
 
  
 
 On 16/11/2012, at 11:21 AM, Timmy Turner timm.t...@gmail.com wrote:
 
 
 
 Does the official/built-in Cassandra CQL client (in 1.2) offer any built-in
 option to get direct values/objects when reading a field, instead of just a
 byte array? 
  
  
  
 
 -- 
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com
 http://healthmarketscience.com/ )
 mobile:215.588.6024 tel:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/
  





Re: Datatype Conversion in CQL-Client?

2012-11-19 Thread Brian O'Neill

Gotcha Timmy.  That is the Thrift API.  You are operating at a pretty
low-level.   I'm not sure that is considered the official CQL client.
IMHO, you might be better off moving up a level.  I'd probably either wait
for the official CQL Java Driver, or access CQL via a higher-level client
like Hector.

If you stick with Thrift, I think you can access the Schema metadata:
https://github.com/apache/cassandra/blob/trunk/interface/thrift/gen-java/org
/apache/cassandra/thrift/CqlMetadata.java
(Those are the generated classes for the Thrift interface)

But I'm not sure where the code is to apply that metadata to the result set
in:
https://github.com/apache/cassandra/blob/trunk/interface/thrift/gen-java/org
/apache/cassandra/thrift/CqlResult.java

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Timmy Turner timm.t...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Monday, November 19, 2012 9:48 AM
To:  user@cassandra.apache.org
Subject:  Re: Datatype Conversion in CQL-Client?

What I meant was the method that the Cassandra-jars give you when you
include them in your project:

  TTransport tr = new TFramedTransport(new TSocket(localhost, 9160));
  TProtocol proto = new TBinaryProtocol(tr);
  Cassandra.Client client = new Cassandra.Client(proto);
  tr.open();
  client.execute_cql_query(ByteBuffer.wrap(cql.getBytes()),
Compression.NONE);



2012/11/19 Brian O'Neill b...@alumni.brown.edu
 I don't think Michael and/or Jonathan have published the CQL java driver yet.
 (CCing them)
 
 Hopefully they'll find a public home for it soon, I hope to include it in the
 Webinar in December.
 (http://www.datastax.com/resources/webinars/collegecredit)
 
 -brian
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 tel:215.588.6024  € @boneill42
 http://www.twitter.com/boneill42   €
 healthmarketscience.com
 
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If you
 received this email in error and are not the intended recipient, or the person
 responsible to deliver it to the intended recipient, please contact the sender
 at the email above and delete this email and any attachments and destroy any
 copies thereof. Any review, retransmission, dissemination, copying or other
 use of, or taking any action in reliance upon, this information by persons or
 entities other than the intended recipient is strictly prohibited.
  
 
 
 From:  Tommi Laukkanen tlaukka...@gmail.com
 Reply-To:  user@cassandra.apache.org
 Date:  Monday, November 19, 2012 2:36 AM
 
 To:  user@cassandra.apache.org
 Subject:  Re: Datatype Conversion in CQL-Client?
 
 I think Timmy might be referring to the upcoming native CQL Java driver that
 might be coming with 1.2 - It was mentioned here:
 http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Changes
 _in_Drivers.pdf
 
 I would also be interested on testing that but I can't find it from
 repositories. Any hints?
 
 Regards,
 Tommi L.
 
 From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: 18. marraskuuta 2012 17:47
 To: user@cassandra.apache.org
 Subject: Re: Datatype Conversion in CQL-Client?
 Importance: Low
  
 
  
 If you are talking about the CQL-client that comes with Cassandra (cqlsh), it
 is actually written in Python:
 
 https://github.com/apache/cassandra/blob/trunk/bin/cqlsh
 
  
 
 For information on datatypes (and conversion) take a look at the CQL
 definition:
 
 http://www.datastax.com/docs/1.0/references/cql/index
 
 (Look at the CQL Data Types section)
 
  
 
 If that's not the client you are referencing, let us know which one you mean:
 
 http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html
 
  
 
 -brian
 
  
 
 On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote:
 
 
 
 Thanks for the links, however I'm interested in the functionality that the
 official Cassandra client/API (which is in Java) offers.
 
  
 
 2012/11/17 aaron morton aa...@thelastpickle.com
 
 Does the official/built-in Cassandra

Re: Datastax Java Driver

2012-11-19 Thread Brian O'Neill
Woohoo!

Thanks for making this available.

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Sylvain Lebresne sylv...@datastax.com
Reply-To:  user@cassandra.apache.org
Date:  Monday, November 19, 2012 1:50 PM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Datastax Java Driver

Everyone,

We've just open-sourced a new Java driver we have been working on here at
DataStax. This driver is CQL3 only and is built to use the new binary
protocol
that will be introduced with Cassandra 1.2. It will thus only work with
Cassandra 1.2 onwards. Currently, it means that testing it requires
1.2.0-beta2. This is also alpha software at this point. You are welcome to
try
and play with it and we would very much welcome feedback, but be sure that
break, it will. The driver is accessible at:
  http://github.com/datastax/java-driver

Today we're open-sourcing the core part of this driver. This main goal of
this
core module is to handle connections to the Cassandra cluster with all the
features that one would expect. The currently supported features are:
  - Asynchronous: the driver uses the new CQL binary protocol asynchronous
capabilities.
  - Nodes discovery.
  - Configurable load balancing/routing.
  - Transparent fail-over.
  - C* tracing handling.
  - Convenient schema access.
  - Configurable retry policy.

This core module provides a simple low-level API (that works directly with
query strings). We plan to release a higher-level, thin object mapping API
based on top of this core shortly.

Please refer to the project README for more information.

--
The DataStax Team




Re: Datatype Conversion in CQL-Client?

2012-11-19 Thread Brian O'Neill

Hector does, but the newer clients/drivers no longer use Thrift.  (Thrift is
the legacy protocol)

If you are still in early stages and you know you want your primary
interface to be CQL, you may want to look at the java driver that Datastax
just released.
  http://github.com/datastax/java-driver

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42   €
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Timmy Turner timm.t...@gmail.com
Reply-To:  user@cassandra.apache.org
Date:  Monday, November 19, 2012 3:37 PM
To:  user@cassandra.apache.org
Subject:  Re: Datatype Conversion in CQL-Client?

Do these other clients use the thrift API internaly?


2012/11/19 John Sanda john.sa...@gmail.com
 You might want to take  look a org.apache.cassandra.transport.SimpleClient and
 org.apache.cassandra.transport.messages.ResultMessage.
 
 
 On Mon, Nov 19, 2012 at 9:48 AM, Timmy Turner timm.t...@gmail.com wrote:
 What I meant was the method that the Cassandra-jars give you when you include
 them in your project:
 
   TTransport tr = new TFramedTransport(new TSocket(localhost, 9160));
   TProtocol proto = new TBinaryProtocol(tr);
   Cassandra.Client client = new Cassandra.Client(proto);
   tr.open();
   client.execute_cql_query(ByteBuffer.wrap(cql.getBytes()),
 Compression.NONE);
 
 
 
 2012/11/19 Brian O'Neill b...@alumni.brown.edu
 I don't think Michael and/or Jonathan have published the CQL java driver
 yet.  (CCing them)
 
 Hopefully they'll find a public home for it soon, I hope to include it in
 the Webinar in December.
 (http://www.datastax.com/resources/webinars/collegecredit)
 
 -brian
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 tel:215.588.6024  € @boneill42
 http://www.twitter.com/boneill42   €
 healthmarketscience.com
 
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.
  
 
 
 From:  Tommi Laukkanen tlaukka...@gmail.com
 Reply-To:  user@cassandra.apache.org
 Date:  Monday, November 19, 2012 2:36 AM
 
 To:  user@cassandra.apache.org
 Subject:  Re: Datatype Conversion in CQL-Client?
 
 I think Timmy might be referring to the upcoming native CQL Java driver that
 might be coming with 1.2 - It was mentioned here:
 http://www.datastax.com/wp-content/uploads/2012/08/7_Datastax_Upcoming_Chang
 es_in_Drivers.pdf
 
 I would also be interested on testing that but I can't find it from
 repositories. Any hints?
 
 Regards,
 Tommi L.
 
 From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: 18. marraskuuta 2012 17:47
 To: user@cassandra.apache.org
 Subject: Re: Datatype Conversion in CQL-Client?
 Importance: Low
  
 
  
 If you are talking about the CQL-client that comes with Cassandra (cqlsh),
 it is actually written in Python:
 
 https://github.com/apache/cassandra/blob/trunk/bin/cqlsh
 
  
 
 For information on datatypes (and conversion) take a look at the CQL
 definition:
 
 http://www.datastax.com/docs/1.0/references/cql/index
 
 (Look at the CQL Data Types section)
 
  
 
 If that's not the client you are referencing, let us know which one you
 mean:
 
 http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html
 
  
 
 -brian
 
  
 
 On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote:
 
 
 
 Thanks for the links, however I'm interested in the functionality that the
 official Cassandra client/API (which is in Java) offers.
 
  
 
 2012/11/17 aaron morton aa...@thelastpickle.com
 
 Does the official/built-in Cassandra CQL client (in 1.2)
 What language ?
 
  
 
 Check the Java
 http://code.google.com/a/apache-extras.org/p/cassandra-jdbc

Re: Datatype Conversion in CQL-Client?

2012-11-18 Thread Brian O'Neill

If you are talking about the CQL-client that comes with Cassandra (cqlsh), it 
is actually written in Python:
https://github.com/apache/cassandra/blob/trunk/bin/cqlsh

For information on datatypes (and conversion) take a look at the CQL definition:
http://www.datastax.com/docs/1.0/references/cql/index
(Look at the CQL Data Types section)

If that's not the client you are referencing, let us know which one you mean:
http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

-brian

On Nov 17, 2012, at 9:54 PM, Timmy Turner wrote:

 Thanks for the links, however I'm interested in the functionality that the 
 official Cassandra client/API (which is in Java) offers.
 
 
 2012/11/17 aaron morton aa...@thelastpickle.com
 Does the official/built-in Cassandra CQL client (in 1.2) 
 What language ? 
 
 Check the Java http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ 
 and python http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/ 
 drivers.
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 16/11/2012, at 11:21 AM, Timmy Turner timm.t...@gmail.com wrote:
 
 Does the official/built-in Cassandra CQL client (in 1.2) offer any built-in 
 option to get direct values/objects when reading a field, instead of just a 
 byte array?
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: [BETA RELEASE] Apache Cassandra 1.2.0-beta2 released

2012-11-10 Thread Brian O'Neill

Wow...good catch.

We had puppet scripts which automatically assigned the proper tokens given the 
cluster size.
What is the range now?  Got a link?

-brian

On Nov 10, 2012, at 9:27 PM, Edward Capriolo wrote:

 just a note for all. The default partitioner is no longer randompartitioner. 
 It is now murmur, and the token range starts in negative numbers. So you 
 don't chose tokens Luke your father taught you anymore.
 
 On Friday, November 9, 2012, Sylvain Lebresne sylv...@datastax.com wrote:
  The Cassandra team is pleased to announce the release of the second beta for
  the future Apache Cassandra 1.2.0.
  Let me first stress that this is beta software and as such is *not* ready 
  for
  production use.
  This release is still beta so is likely not bug free. However, lots have 
  been
  fixed since beta1 and if everything goes right, we are hopeful that a first
  release candidate may follow shortly. Please do help testing this beta to 
  help
  make that happen. If you encounter any problem during your testing, please
  report[3,4] them. And be sure to a look at the change log[1] and the release
  notes[2] to see where Cassandra 1.2 differs from the previous series.
  Apache Cassandra 1.2.0-beta2[5] is available as usual from the cassandra
  website (http://cassandra.apache.org/download/) and a debian package is
  available using the 12x branch (see 
  http://wiki.apache.org/cassandra/DebianPackaging).
  Thank you for your help in testing and have fun with it.
  [1]: http://goo.gl/wnDAV (CHANGES.txt)
  [2]: http://goo.gl/CBsqs (NEWS.txt)
  [3]: https://issues.apache.org/jira/browse/CASSANDRA
  [4]: user@cassandra.apache.org
  [5]: 
  http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-1.2.0-beta2
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Indexing Data in Cassandra with Elastic Search

2012-11-08 Thread Brian O'Neill
For those looking to index data in Cassandra with Elastic Search, here
is what we decided to do:
http://brianoneill.blogspot.com/2012/11/big-data-quadfecta-cassandra-storm.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: logging servers? any interesting in one for cassandra?

2012-11-07 Thread Brian O'Neill

Thanks Dean.  We'll definitely take a look.  (probably in January)

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 11/6/12 11:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

Sure, in our playing around, we have an awesome log back configuration for
development time only that shows warning, severe in red in eclipse and
let's you click on every single log taking you right to the code that
logged it…(thought you might enjoy it)...

https://github.com/deanhiller/playorm/blob/master/input/javasrc/logback.xm
l


The java appender is here(called CassandraAppender)
https://github.com/deanhiller/playorm/tree/master/input/javasrc/com/alvaza
n
/play/logging


The AsyncAppender there is different then log backs in that it allows
bursting but once reaches the limit, it essentially becomes synchronous
again which allows us to not drop logs like log backs and allow for bursts
of performance

The CircularBufferAppender is an inmemory buffer that flushes all logs X
level and above to child appender when a warning or severe happens where X
is configurable.  

We have only tested out the CassandraAppender at this point.  Right now
you have to call CassandraAppender.setFactory to set the
NoSqlEntityManager factory to set it.  It creates a LogEvent rows as well
as an index on the session and partitions by the first two characters of
the web session id so there is an index per partition.  This allows us to
the look at a single web session of a user.  The only thing I don't like
is we have to do a read when updating the index to be able to delete old
values in the index(ick), but I couldn't figure any other way around that.

Also, if you have high event rates, there is a MDCLevelFilter so you can
tag the MDC with something like user=__program__ and ignore all logs for
him unless they are warning logs which we use to limit the logs from just
being huge.

Later,
Dean


On 11/6/12 6:32 AM, Brian O'Neill b...@alumni.brown.edu wrote:

Nice DeanŠ

I'm not so sure we would run the server, but we'd definitely be
interested
in the logback adaptor.
(We would then just access the data via Virgil (over REST), with a thin
javascript UI)

Let me/us know if you end up putting it out there.  We intend centralize
logging sometime over the next few months.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material.
If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 11/1/12 10:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

2 questions

 1.  What are people using for logging servers for their web tier
logging?
 2.  Would anyone be interested in a new logging server(any programming
language) for web tier to log to your existing cassandra(it uses up disk
space in proportion to number of web servers and just has a rolling
window of logs along with a window of threshold dumps)?

Context for second question: I like less systems since it is less
maintenance/operations cost and so yesterday I quickly wrote up some log
back appenders which support (SLF4J/log4j/jdk/commons libraries) and
send
the logs from our client tier into cassandra.  It is simply a rolling
window of logs so the space used in cassandra is proportional to the
amount of web  servers I have(currently, I have 4 web servers).  I am
also thinking about adding warning type logging such that on warning,
the
last N logs info and above are flushed along with the warning so
basically two rolling windows.  Then in the GUI, it simply shows

Re: logging servers? any interesting in one for cassandra?

2012-11-06 Thread Brian O'Neill
Nice DeanŠ

I'm not so sure we would run the server, but we'd definitely be interested
in the logback adaptor.
(We would then just access the data via Virgil (over REST), with a thin
javascript UI)

Let me/us know if you end up putting it out there.  We intend centralize
logging sometime over the next few months.

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 11/1/12 10:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

2 questions

 1.  What are people using for logging servers for their web tier logging?
 2.  Would anyone be interested in a new logging server(any programming
language) for web tier to log to your existing cassandra(it uses up disk
space in proportion to number of web servers and just has a rolling
window of logs along with a window of threshold dumps)?

Context for second question: I like less systems since it is less
maintenance/operations cost and so yesterday I quickly wrote up some log
back appenders which support (SLF4J/log4j/jdk/commons libraries) and send
the logs from our client tier into cassandra.  It is simply a rolling
window of logs so the space used in cassandra is proportional to the
amount of web  servers I have(currently, I have 4 web servers).  I am
also thinking about adding warning type logging such that on warning, the
last N logs info and above are flushed along with the warning so
basically two rolling windows.  Then in the GUI, it simply shows the logs
and if you click on a session, it switches to a view with all the logs
for that session(no matter which server since in our cluster the session
switches servers on every request since we are statelessŠ.our session id
is in the cookie).

Well, let me know if anyone is interested and would actually use such a
thing and if so, we might create a server around it.

Thanks,
Dean




Keeping the record straight for Cassandra Benchmarks...

2012-10-25 Thread Brian O'Neill
People probably saw...
http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

To clarify things take a look at...
http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Using compound primary key

2012-10-08 Thread Brian O'Neill
Hey Vivek,

The same thing happened to me the other day.  You may be missing a component in 
your compound key.

See this thread:
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201210.mbox/%3ccajhhpg20rrcajqjdnf8sf7wnhblo6j+aofksgbxyxwcoocg...@mail.gmail.com%3E

I also wrote a couple blogs on it:
http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html
http://brianoneill.blogspot.com/2012/10/cql-astyanax-and-compoundcomposite-keys.html

They've fixed this in the 1.2 beta, whereby it checks (at the thrift layer) to 
ensure you have the requisite number of components in the compound/composite 
key.

-brian


On Oct 8, 2012, at 10:32 PM, Vivek Mishra wrote:

 Certainly. As these are available with cql3 only! 
 Example mentioned on datastax website is working fine, only difference is i 
 tried with a compound primary key with 3 composite columns in place of 2
 
 -Vivek
 
 On Tue, Oct 9, 2012 at 7:57 AM, Arindam Barua aba...@247-inc.com wrote:
  
 
 Did you use the “--cql3” option with the cqlsh command?
 
  
 
 From: Vivek Mishra [mailto:mishra.v...@gmail.com] 
 Sent: Monday, October 08, 2012 7:22 PM
 To: user@cassandra.apache.org
 
 
 Subject: Using compound primary key
 
  
 
 Hi,
 
  
 
 I am trying to use compound primary key column name and i am referring to:
 
 http://www.datastax.com/dev/blog/whats-new-in-cql-3-0
 
  
 
 As mentioned on this example, i tried to create a column family containing 
 compound primary key (one or more) as:
 
  
 
  CREATE TABLE altercations (
 
instigator text,
 
started_at timestamp,
 
ships_destroyed int,
 
energy_used float,
 
alliance_involvement boolean,
 
PRIMARY KEY (instigator,started_at,ships_destroyed)
 
);
 
  
 
 And i am getting:
 
  
 
 **
 
 TSocket read 0 bytes
 
 cqlsh:testcomp 
 
 **
 
  
 
  
 
 Then followed by insert and select statements giving me following errors:
 
  
 
 
 
  
 
 cqlsh:testcompINSERT INTO altercations (instigator, started_at, 
 ships_destroyed,
 
 ...  energy_used, 
 alliance_involvement)
 
 ...  VALUES ('Jayne Cobb', '2012-07-23', 2, 
 4.6, 'false');
 
 TSocket read 0 bytes
 
  
 
 cqlsh:testcomp select * from altercations;
 
 Traceback (most recent call last):
 
   File bin/cqlsh, line 1008, in perform_statement
 
 self.cursor.execute(statement, decoder=decoder)
 
   File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py, 
 line 117, in execute
 
 response = self.handle_cql_execution_errors(doquery, prepared_q, compress)
 
   File bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py, 
 line 132, in handle_cql_execution_errors
 
 return executor(*args, **kwargs)
 
   File 
 bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py,
  line 1583, in execute_cql_query
 
 self.send_execute_cql_query(query, compression)
 
   File 
 bin/../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py,
  line 1593, in send_execute_cql_query
 
 self._oprot.trans.flush()
 
   File 
 bin/../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py,
  line 293, in flush
 
 self.__trans.write(buf)
 
   File 
 bin/../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py,
  line 117, in write
 
 plus = self.handle.send(buff)
 
 error: [Errno 32] Broken pipe
 
  
 
 cqlsh:testcomp 
 
  
 
 
 
  
 
  
 
  
 
 Any idea?  Is it a problem with CQL3 or with cassandra?
 
  
 
 P.S: I did post same query on dev group as well to get a quick response.
 
  
 
  
 
 -Vivek
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: 1000's of column families

2012-10-02 Thread Brian O'Neill

Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote:

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 Ben,
   to address your question, read my last post but to summarize, yes,
there
 is less overhead in memory to prefix keys than manage multiple Cfs
EXCEPT
 when doing map/reduce.  Doing map/reduce, you will now have HUGE
overhead
 in reading a whole slew of rows you don't care about as you can't
 map/reduce a single virtual CF but must map/reduce the whole CF wasting
 TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben




Re: 1000's of CF's. virtual CFs do NOT workŠ..map/reduce

2012-10-02 Thread Brian O'Neill
Dean,

Great point.  I hadn't considered that either.  Per my other email, think
we would need a custom partitioner for this? (a mix of
OrderPreservingPartitioner and RandomPartitioner, OPP for the prefix)

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive ? King of Prussia, PA ? 19406
M: 215.588.6024 ? @boneill42 http://www.twitter.com/boneill42  ?
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 8:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

So basically, with moving towards the 1000's of CF all being put in one
CF, our performance is going to tank on map/reduce, correct?  I mean, from
what I remember we could do map/reduce on a single CF, but by stuffing
1000's of virtual Cf's into one CF, our map/reduce will have to read in
all 999 virtual CF's rows that we don't want just to map/reduce the ONE
CF.

Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.

Is this correct?  This really sounds like highly undesirable behavior.
There needs to be a way for people with 1000's of CF's to also run
map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
rows will be 1000 times slowerŠ.and of course, we will most likely get up
to 20,000 tables from my most recent projectionsŠ.our last test load, we
ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
started getting really REALLY slow when we got up to 15k+ CF's in the
system though I didn't look into why.

I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
map/reduce just the virtual CF!  Ugh.

Thanks,
Dean

On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote:

On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu
wrote:
 Its just a convenient way of prefixing:
 
http://hector-client.github.com/hector/build/html/content/virtual_keyspa
c
es.html

So given that it is possible to use a CF per tenant, should we assume
that there at sufficient scale that there is less overhead to prefix
keys than there is to manage multiple CFs?

Ben





Re: 1000's of column families

2012-10-02 Thread Brian O'Neill

Agreed. 

Do we know yet what the overhead is for each column family?  What is the
limit?
If you have a SINGLE keyspace w/ 2+ CF's, what happens?  Anyone know?

-brian


---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

Thanks for the idea but…(but please keep thinking on it)...

100% what we don't want since partitioned data resides on the same node.
I want to map/reduce the column families and leverage the parallel disks

:( :(

I am sure others would want to do the same…..We almost need a feature of
virtual Column Families and column family should really not be column
family but should be called ReplicationGroup or something where
replication is configured for all CF's in that group.

ANYONE have any other ideas???

Dean

On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote:


Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material.
If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote:

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 Ben,
   to address your question, read my last post but to summarize, yes,
there
 is less overhead in memory to prefix keys than manage multiple Cfs
EXCEPT
 when doing map/reduce.  Doing map/reduce, you will now have HUGE
overhead
 in reading a whole slew of rows you don't care about as you can't
 map/reduce a single virtual CF but must map/reduce the whole CF
wasting
 TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben







Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...

2012-10-02 Thread Brian O'Neill

Dean,

We moved away from Hadoop and M/R, and instead we are using Storm as our
compute grid.  We queue keys in Kafka, then Storm distributes the work to
the grid.  Its working well so far, but we haven't taken it to prod yet.
Data is read from Cassandra using a Cassandra-bolt.

If you end up using Storm, let me know.  We have an unreleased version of
the bolt that you probably want to use.  (we're waiting on Nathan/Storm to
fix some classpath loading issues)

RE: a customer virtual keyspace Partitioner, point well taken

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive ? King of Prussia, PA ? 19406
M: 215.588.6024 ? @boneill42 http://www.twitter.com/boneill42  ?
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

Well, I think I know the direction we may follow so we can
1. Have Virtual CF's
2. Be able to map/reduce ONE Virtual CF

Well, not map/reduce exactly but really really close.  We use PlayOrm with
it's partitioning so I am now thinking what we will do is have a compute
grid  where we can have each node doing a findAll query into the
partitions it is responsible for.  In this way, I think we can 1000's of
virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves
the rows for that partition of one virtual CF.

Anyone know of a computer grid we can dish out work to?  That would be my
only missing piece (well, that and the PlayOrm virtual CF feature but I
can add that within a week probably though I am on vacation this Thursday
to monday).

Later,
Dean


On 10/2/12 6:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

So basically, with moving towards the 1000's of CF all being put in one
CF, our performance is going to tank on map/reduce, correct?  I mean,
from
what I remember we could do map/reduce on a single CF, but by stuffing
1000's of virtual Cf's into one CF, our map/reduce will have to read in
all 999 virtual CF's rows that we don't want just to map/reduce the ONE
CF.

Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.

Is this correct?  This really sounds like highly undesirable behavior.
There needs to be a way for people with 1000's of CF's to also run
map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
rows will be 1000 times slowerŠ.and of course, we will most likely get up
to 20,000 tables from my most recent projectionsŠ.our last test load, we
ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
started getting really REALLY slow when we got up to 15k+ CF's in the
system though I didn't look into why.

I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
map/reduce just the virtual CF!  Ugh.

Thanks,
Dean

On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote:

On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu
wrote:
 Its just a convenient way of prefixing:
 
http://hector-client.github.com/hector/build/html/content/virtual_keysp
a
c
es.html

So given that it is possible to use a CF per tenant, should we assume
that there at sufficient scale that there is less overhead to prefix
keys than there is to manage multiple CFs?

Ben






Re: 1000's of column families

2012-10-02 Thread Brian O'Neill
Exactly.

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:55 AM, Ben Hood 0x6e6...@gmail.com wrote:

Brian,

On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote:

 Without putting too much thought into it...

 Given the underlying architecture, I think you could/would have to write
 your own partitioner, which would partition based on the prefix/virtual
 keyspace.

I might be barking up the wrong tree here, but looking at source of
ColumnFamilyInputFormat, it seems that you can specify a KeyRange for
the input, but only when you use an order preserving partitioner. So I
presume that if you are using the RandomPartitioner, you are
effectively doing a full CF scan (i.e. including all tenants in your
system).

Ben




Re: 1000's of column families

2012-10-01 Thread Brian O'Neill
Dean,

We have the same question...

We have thousands of separate feeds of data as well (20,000+).  To
date, we've been using a CF per feed strategy, but as we scale this
thing out to accommodate all of those feeds, we're trying to figure
out if we're going to blow out the memory.

The initial documentation for heap sizing had column families in the equation:
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing

But in the more recent documentation, it looks like they removed the
column family variable with the introduction of the universal
key_cache_size.
http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size

We haven't committed either way yet, but given Ed Anuff's presentation
on virtual keyspaces, we were leaning towards a single column family
approach:
http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Definitely let us know what you decide.

-brian

On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
f.baro...@list-group.com wrote:
 We had some serious trouble with dynamically adding CFs, although last time
 we tried we were using version 0.7, so maybe
 that's not an issue any more.
 Our problems were two:
 - You are (were?) not supposed to add CFs concurrently. Since we had more
 servers talking to the same Cassandra cluster,
 we had to use distributed locks (Hazelcast) to avoid concurrency.
 - You must be very careful to add new CFs to different Cassandra nodes. If
 you do that fast enough, and the clocks of
 the two servers are skewed, you will severely compromise your schema
 (Cassandra will not understand in which order the
 updates must be applied).

 As I said, this applied to version 0.7, maybe current versions solved these
 problems.

 Flavio


 Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
 We have 1000's of different building devices and we stream data from these
 devices.  The format and data from each one varies so one device has 
 temperature
 at timeX with some other variables, another device has CO2 percentage and 
 other
 variables.  Every device is unique and streams it's own data.  We dynamically
 discover devices and register them.  Basically, one CF or table per thing 
 really
 makes sense in this environment.  While we could try to find out which devices
 are similar, this would really be a pain and some devices add some new
 variable into the equation.  NOT only that but researchers can register new
 datasets and upload them as well and each dataset they have they do NOT want 
 to
 share with other researches necessarily so we have security groups and each CF
 belongs to security groups.  We dynamically create CF's on the fly as people
 register new datasets.

 On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and put 
 all
 the data and have a partition per device, but then a time partition will 
 contain
 multiple devices of data meaning we need to shrink our time partition size
 where if we have CF per device, the time partition can be larger as it is only
 for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people want
 to query for streams that match criteria AND which returns a CF name and they
 query that CF name so we almost need a query with variables like select cfName
 from Meta where x = y and then select * from cfName where x. Which we can 
 do
 today.

 Dean

 From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Thursday, September 27, 2012 8:01 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families

 Out of curiosity, is it really necessary to have that amount of CFs?
 I am probably still used to relational databases, where you would use a new
 table just in case you need to store different kinds of data. As Cassandra
 stores anything in each CF, it might probably make sense to have a lot of CFs 
 to
 store your data...
 But why wouldn't you use a single CF with partitions in these case? Wouldn't
 it be the same thing? I am asking because I might learn a new modeling 
 technique
 with the answer.

 []s

 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When
 using the tools they are all geared to analyzing ONE column family at a time 
 :(.
 If I remember correctly, Cassandra supports as many CF's as you want, correct?
 Even though I am going to have tons of funs with limitations on the tools,
 correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed to
 sum up multiple column families and such).

 Thanks,
 Dean



 --
 Marcelo Elias Del Valle
 http://mvalle.com - 

Re: 1000's of column families

2012-10-01 Thread Brian O'Neill
Its just a convenient way of prefixing:
http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html

-brian

On Mon, Oct 1, 2012 at 4:22 PM, Ben Hood 0x6e6...@gmail.com wrote:
 Brian,

 On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote:
 We haven't committed either way yet, but given Ed Anuff's presentation
 on virtual keyspaces, we were leaning towards a single column family
 approach:
 http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

 Is this doing something special or is this just a convenience way of
 prefixing keys to make the storage space multi-tenanted?

 Cheers,

 Ben



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)

mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Using the commit log for external synchronization

2012-09-21 Thread Brian O'Neill
 IMHO it's a better design to multiplex the data stream at the application
 level.
+1, agreed.

That is where we ended up. (and Storm is proving to be a solid
framework for that)

-brian

On Fri, Sep 21, 2012 at 4:56 AM, aaron morton aa...@thelastpickle.com wrote:
 The commit log is essentially internal implementation. The total size of the
 commit log is restricted, and the multiple files used to represent segments
 are recycled. So once all the memtables have been flushed for segment it may
 be overwritten.

 To archive the segments see the conf/commitlog_archiving.properties file.

 Large rows will bypass the commit log.

 A write commited to the commit log may still be considered a failure if CL
 nodes do not succeed.

 IMHO it's a better design to multiplex the data stream at the application
 level.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 21/09/2012, at 11:51 AM, Brian O'Neill b...@alumni.brown.edu wrote:


 Along those lines...

 We sought to use triggers for external synchronization.   If you read
 through this issue:
 https://issues.apache.org/jira/browse/CASSANDRA-1311

 You'll see the idea of leveraging a commit log for synchronization, via
 triggers.

 We went ahead and implemented this concept in:
 https://github.com/hmsonline/cassandra-triggers

 With that, via AOP, you get handed the mutation as things change.  We used
 it for synchronizing SOLR.

 fwiw,
 -brian



 On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:

 +1. Would be a pretty cool feature

 Right now I write once to cassandra and once to kafka.

 On 9/20/12 4:13 PM, Data Craftsman 木匠 database.crafts...@gmail.com
 wrote:

 This will be a good new feature. I guess the development team don't

 have time on this yet.  ;)



 On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood 0x6e6...@gmail.com wrote:

 Hi,


 I'd like to incrementally synchronize data written to Cassandra into

 an external store without having to maintain an index to do this, so I

 was wondering whether anybody is using the commit log to establish

 what updates have taken place since a given point in time?


 Cheers,


 Ben




 --

 Thanks,


 Charlie (@mujiang) 木匠

 ===

 Data Architect Developer 汉唐 田园牧歌DBA

 http://mujiang.blogspot.com



 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook



 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/





-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Kundera 2.1 released

2012-09-21 Thread Brian O'Neill

Well done, Vivek and team!!  This release was much anticipated.

I'll give this a test with Spring Data JPA when I return from vacation.

thanks,
-brian


On Sep 21, 2012, at 9:15 PM, Vivek Mishra wrote:

 Hi All,
 
 We are happy to announce release of Kundera 2.0.7.
 
 Kundera is a JPA 2.0 based, object-datastore papping library for NoSQL 
 datastores. The idea behind Kundera is to make working with NoSQL Databases
 drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB and 
 relational databases.
 
 Major Changes in this release:
 ---
 * Allow user to set specific CQL versioning.
 
 * Batch insert/update for Cassandra/MongoDB/HBase.
 
 * Extended JPA Metamodel/TypedQuery/ProviderUtil implementation.
 
 * Another Thrift client implementation for Cassandra.
 
 * Deprecated support for properties with XML based Column family/Table/server 
 specific property configuration for Cassandra, MongoDB and HBase.
 
 * Stronger query support:
  a) JPQL support over all data types and associations.
  b) JPQL support to query using primary key alongwith other columns.
 
  * Fixed github issues:
 
https://github.com/impetus-opensource/Kundera/issues/90
https://github.com/impetus-opensource/Kundera/issues/91
https://github.com/impetus-opensource/Kundera/issues/92
https://github.com/impetus-opensource/Kundera/issues/93
https://github.com/impetus-opensource/Kundera/issues/94
https://github.com/impetus-opensource/Kundera/issues/96
https://github.com/impetus-opensource/Kundera/issues/98
https://github.com/impetus-opensource/Kundera/issues/99
https://github.com/impetus-opensource/Kundera/issues/100
https://github.com/impetus-opensource/Kundera/issues/101
https://github.com/impetus-opensource/Kundera/issues/102
https://github.com/impetus-opensource/Kundera/issues/104
https://github.com/impetus-opensource/Kundera/issues/106
https://github.com/impetus-opensource/Kundera/issues/107 
https://github.com/impetus-opensource/Kundera/issues/108
https://github.com/impetus-opensource/Kundera/issues/109
https://github.com/impetus-opensource/Kundera/issues/111
https://github.com/impetus-opensource/Kundera/issues/112   
https://github.com/impetus-opensource/Kundera/issues/116
 
 
 To download, use or contribute to Kundera, visit:
 http://github.com/impetus-opensource/Kundera
 
 Latest released tag version is 2.1. Kundera maven libraries are now available 
 at: https://oss.sonatype.org/content/repositories/releases/com/impetus and 
 http://kundera.googlecode.com/svn/maven2/maven-missing-resources.
 
 Sample codes and examples for using Kundera can be found here:
 http://github.com/impetus-opensource/Kundera-Examples
 and 
 https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests
 
 Thank you all for your contributions!
 
 Regards,
 Kundera Team.

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Using the commit log for external synchronization

2012-09-20 Thread Brian O'Neill

Along those lines...

We sought to use triggers for external synchronization.   If you read through 
this issue:
https://issues.apache.org/jira/browse/CASSANDRA-1311

You'll see the idea of leveraging a commit log for synchronization, via 
triggers.

We went ahead and implemented this concept in:
https://github.com/hmsonline/cassandra-triggers

With that, via AOP, you get handed the mutation as things change.  We used it 
for synchronizing SOLR.  

fwiw,
-brian



On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:

 +1. Would be a pretty cool feature
 
 Right now I write once to cassandra and once to kafka.
 
 On 9/20/12 4:13 PM, Data Craftsman 木匠 database.crafts...@gmail.com
 wrote:
 
 This will be a good new feature. I guess the development team don't
 have time on this yet.  ;)
 
 
 On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood 0x6e6...@gmail.com wrote:
 Hi,
 
 I'd like to incrementally synchronize data written to Cassandra into
 an external store without having to maintain an index to do this, so I
 was wondering whether anybody is using the commit log to establish
 what updates have taken place since a given point in time?
 
 Cheers,
 
 Ben
 
 
 
 -- 
 Thanks,
 
 Charlie (@mujiang) 木匠
 ===
 Data Architect Developer 汉唐 田园牧歌DBA
 http://mujiang.blogspot.com
 
 
 'Like' us on Facebook for exclusive content and other resources on all 
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Data Modeling - JSON vs Composite columns

2012-09-19 Thread Brian O'Neill
Roshni,

We're going through the same debate right now.

I believe native support for JSON (or collections) is on the docket
for Cassandra.
Here is a discussion we had a few months ago on the topic:
http://comments.gmane.org/gmane.comp.db.cassandra.devel/5233

We presently store JSON, but we're considering a change to composite keys.

Presently, each client has to parse the JSON value.  If you are
retrieving lots of values, that's a lot of parsing.  Also, storing the
raw values allows for better integration with other tools, such as
reporting engines (e.g. JasperSoft).  Also, if you do want to update a
single value inside the json, you get into real trouble, because you
first need to read the value, update the field, then write the column
again.  The read before write is a problem, especially if you have a
lot of concurrency in your system.  (Two clients could read the old
value, then update different fields, and the second would overwrite
the firsts change)

One final note...
(As a side not, JSON values also complicated our wide-row indexing
mechanism: (https://github.com/hmsonline/cassandra-indexing))

For those reasons, we're considering a data model shift away from JSON.

That said, I'm keeping a close watch on:
https://issues.apache.org/jira/browse/CASSANDRA-3647

But if this is CQL only, I'm not sure how much use it will be for us
since we're coming in from different clients.
Anyone know how/if collections will be available from other clients?

-brian


On Wed, Sep 19, 2012 at 8:00 AM, Roshni Rajagopal
roshni_rajago...@hotmail.com wrote:
 Hi,

 There was a conversation on this some time earlier, and to continue it

 Suppose I want to associate a user to  an item, and I want to also store 3
 commonly used attributes without needing to go to an entity item column
 family , I have 2 options :-

 A) use composite columns
 UserId1 : {
  itemid1:Name = Betty Crocker,
  itemid1:Descr = Cake
 itemid1:Qty = 5
  itemid2:Name = Nutella,
  itemid2:Descr = Choc spread
 itemid2:Qty = 15
 }

 B) use a json with the data
 UserId1 : {
  itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5},
  itemid2 ={name: Nutella,descr: Choc spread, Qty: 15}
 }

 Essentially A is better if one wants to update individual fields , while B
 is better if one wants easier paging, reading multiple items at once in one
 read. etc. The details are in this discussion thread
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-another-question-td7581967.html

 I had an additional question,
 as its being said, that CQL is the direction in which cassandra is moving,
 and there's a lot of effort in making CQL the standard,

 How does approach B work in CQL. Can we read/write a JSON easily in CQL? Can
 we extract a field from a JSON in CQL or would that need to be done via the
 client code?


 Regards,
 Roshni



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Solr Use Cases

2012-09-19 Thread Brian O'Neill
Roshni,

We're using SOLR to support ad hoc queries and fuzzy searches against
unstructured data stored in Cassandra.  Cassandra is great for storage
and you can create data models and indexes that support your queries,
provided you can anticipate those queries.  When you can't anticipate
the queries, or if you need to support a large permutation of
multi-dimensional queries, your probably better off using an index
like SOLR.

Since SOLR only supports a flat document structure, you may need to
perform transformation before inserting into SOLR.  We chose not to
use DSE, so we used a cassandra-triggers as our mechanism to integrate
SOLR. (https://github.com/hmsonline/cassandra-triggers)  We intercept
the mutation, transform the data into a document (w/ multi-value
fields) and POST it to SOLR.

More recently though, we're looking to roll out ElasticSearch.  As our
query demand increases, we expect SOLR to quickly become a PITA to
administrer.  (master-slave relationships)  IMHO, ElasticSearch's
architecture is a better match for Cassandra.  We are also looking to
substitute cassandra-triggers for Storm, allowing us to build a data
processing flow using Cassandra and ElasticSearch bolts.  (we've open
sourced the Cassandra bolt and we'll be open sourcing the elastic
search bolt shortly)

-brian


On Wed, Sep 19, 2012 at 8:27 AM, Roshni Rajagopal
roshni_rajago...@hotmail.com wrote:
 Hi,

 Im new to Solr, and I hear that Solr is a great tool for improving search
 performance
 Im unsure whether Solr or DSE Search is a must for all cassandra deployments

 1. For performance - I thought cassandra had great read  write performance.
 When should solr be used ?
 Taking the following use cases for cassandra from the datastax FAQ page, in
 which cases would Solr be useful, and whether for all?

 Time series data management
 High-velocity device data ingestion and analysis
 Media streaming (e.g., music, movies)
 Social media input and analysis
 Online web retail (e.g., shopping carts, user transactions)
 Web log management / analysis
 Web click-stream analysis
 Real-time data analytics
 Online gaming (e.g., real-time messaging)
 Write-intensive transaction systems
 Buyer event analytics
 Risk analysis and management


 2. what changes to cassandra data modeling does Solr bring? We have some
 guidelines  best practices around cassandra data modeling.
 Is Solr so powerful, that it does not matter how data is modelled in
 cassandra? Are there different best practices for cassandra data modeling
 when Solr is in the picture?
 Is this something we should keep in mind while modeling for cassandra today-
 that it should be  good to be used via Solr in future?

 3. Does Solr come with any drawbacks like its not real time ?

 I can  should read the manual, but it will be great if someone can explain
 at a high level.

 Thank you!


 Regards,
 Roshni



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Compound Keys: Connecting the dots between CQL3 and Java APIs

2012-09-11 Thread Brian O'Neill
Our data architects (ex-Oracle DBA types) are jumping on the CQL3
bandwagon and creating schemas for us.  That triggered me to write a
quick article mapping the CQL3 schemas to how they are accessed via
Java APIs (for our dev team).

I hope others find this useful as well:
http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Cassandra API Library.

2012-09-04 Thread Brian O'Neill
You got it.  (done)

-brian

On Tue, Sep 4, 2012 at 7:08 AM, Filipe Gonçalves
the.wa.syndr...@gmail.com wrote:
 @Brian: you can add the Cassandra::Simple Perl client
 http://fmgoncalves.github.com/p5-cassandra-simple/


 2012/8/27 Paolo Bernardi berna...@gmail.com

 On 08/23/2012 01:40 PM, Thomas Spengler wrote:

 4) pelops (Thrift,Java)


 I've been using Pelops for quite some time with pretty good results; it
 felt much cleaner than Hector.

 Paolo

 --
 @bernarpa
 http://paolobernardi.wordpress.com




 --
 Filipe Gonçalves



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Spring - cassandra

2012-08-30 Thread Brian O'Neill

Yes.  I'm in contact with Oliver Gierke and Erez Mazor of Spring Data.

We are working on two fronts:
1) Spring Data support via JPA (using Kundera underneath)
- Initial attempt here:
http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.h
tml
- Most recently (an hour ago): The issues w/ MetaModel are fixed, now
waiting on an enhancement to the EntityManager to fully support type
queries.

For this one, we're in a holding pattern until Kundera is fully JPA
compliant.

2) Spring Data support via Astyanax
- The project I'm working below should mimic Spring Data MongoDB's
approach and capabilities, allowing people to use Spring Data with
Cassandra without the constraints of JPA.  I'd love some help working on
the project.  Once we have it functional we should be able to push it to
Spring. (with Oliver's help)

Go ahead and fork.  Feel free to email me directly so we don't spam this
list.
(or setup a googlegroup just in case others want to contribute)

-brian


---
Brian O'Neill
Lead Architect, Software Development
Apache Cassandra MVP
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/30/12 9:01 AM, Radim Kolar h...@filez.com wrote:



 You looking for the author of Spring Data Cassandra?
 https://github.com/boneill42/spring-data-cassandra

 If so, I guess that is me. =)
Did you get in touch with spring guys? They have cassandra support on
their spring data todo list. They might have some todo or feature list
they want to implement for cassandra, i am willing to code something to
make official spring cassandra support happen faster.




Re: Spring - cassandra

2012-08-29 Thread Brian O'Neill

You looking for the author of Spring Data Cassandra?
https://github.com/boneill42/spring-data-cassandra

If so, I guess that is me. =)

-brian

---
Brian O'Neill
Lead Architect, Software Development
Apache Cassandra MVP
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/29/12 10:38 AM, Radim Kolar h...@filez.com wrote:

is author of Spring - Cassandra here? I am interested in getting this
merged into upstream spring. They have cassandra support on their todo
list.




Re: Cassandra API Library.

2012-08-23 Thread Brian O'Neill


We've used 'em all andŠ (IMHO)

1) I would avoid Thrift directly.
2) Hector is a sure bet.
3) Astyanax is the up and comer.
4) Kundera is good, but works like an ORM -- so not so good if your
columns aren't defined ahead of time.

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de wrote:

4) pelops (Thrift,Java)

On 08/23/2012 01:28 PM, Baskar Sikkayan wrote:
 I would vote for Hector :)
 
 On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com
wrote:
 
 hi,

 kindly let me know which java client api is more matured, and easy to
use
 with all features(Super Columns, caching, pooling, etc) of Cassandra
1.X.
 Right now i come to know that following client exists:

 1) Hector(Java)
 2) Thrift (Java)
 3) Kundera (Java)


 With Regards,
 Amit

 


-- 
Thomas Spengler
Chief Technology Officer


TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin
Tel.: (030) 2000912 0 | Fax: (030) 2000912 100
thomas.speng...@toptarif.de | www.toptarif.de

Amtsgericht Charlottenburg, HRB 113287 B
Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor
-




Re: Cassandra API Library.

2012-08-23 Thread Brian O'Neill

Thanks Dean… I hadn't played with that one.  I wonder if that would better
fit the bill for the Spring Data Cassandra module I'm hacking on.
https://github.com/boneill42/spring-data-cassandra

I'll poke around.

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

playOrm has a raw layer that if your columns are not defined ahead of time
and SQL with no limitations on , =, =, etc. etc. as well as joins being
added shortly BUT joins are for joining partitions so that your system can
still scale to infinity.  Also has an in-memory database as well for unit
testing that you can do TDD with built in.

So if you like JQL but want infinite scale JQL, try playOrm.

All 45 tests are passing.  We expect 100 unit tests to be in place by the
end of the year.

Dean

On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote:



We've used 'em all andŠ (IMHO)

1) I would avoid Thrift directly.
2) Hector is a sure bet.
3) Astyanax is the up and comer.
4) Kundera is good, but works like an ORM -- so not so good if your
columns aren't defined ahead of time.

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material.
If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de
wrote:

4) pelops (Thrift,Java)

On 08/23/2012 01:28 PM, Baskar Sikkayan wrote:
 I would vote for Hector :)
 
 On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com
wrote:
 
 hi,

 kindly let me know which java client api is more matured, and easy to
use
 with all features(Super Columns, caching, pooling, etc) of Cassandra
1.X.
 Right now i come to know that following client exists:

 1) Hector(Java)
 2) Thrift (Java)
 3) Kundera (Java)


 With Regards,
 Amit

 


-- 
Thomas Spengler
Chief Technology Officer


TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin
Tel.: (030) 2000912 0 | Fax: (030) 2000912 100
thomas.speng...@toptarif.de | www.toptarif.de

Amtsgericht Charlottenburg, HRB 113287 B
Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor

-







Re: Cassandra API Library.

2012-08-23 Thread Brian O'Neill
FWIW.. I just threw this together...
http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

Let me know if I missed any others. (I didn't have playorm on there)

-brian

On Thu, Aug 23, 2012 at 9:51 AM, Brian O'Neill boneil...@gmail.com wrote:

 Thanks Dean… I hadn't played with that one.  I wonder if that would better
 fit the bill for the Spring Data Cassandra module I'm hacking on.
 https://github.com/boneill42/spring-data-cassandra

 I'll poke around.

 -brian

 ---
 Brian O'Neill
 Lead Architect, Software Development

 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

playOrm has a raw layer that if your columns are not defined ahead of time
and SQL with no limitations on , =, =, etc. etc. as well as joins being
added shortly BUT joins are for joining partitions so that your system can
still scale to infinity.  Also has an in-memory database as well for unit
testing that you can do TDD with built in.

So if you like JQL but want infinite scale JQL, try playOrm.

All 45 tests are passing.  We expect 100 unit tests to be in place by the
end of the year.

Dean

On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote:



We've used 'em all andŠ (IMHO)

1) I would avoid Thrift directly.
2) Hector is a sure bet.
3) Astyanax is the up and comer.
4) Kundera is good, but works like an ORM -- so not so good if your
columns aren't defined ahead of time.

-brian

---
Brian O'Neill
Lead Architect, Software Development

Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material.
If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.







On 8/23/12 7:40 AM, Thomas Spengler thomas.speng...@toptarif.de
wrote:

4) pelops (Thrift,Java)

On 08/23/2012 01:28 PM, Baskar Sikkayan wrote:
 I would vote for Hector :)

 On Thu, Aug 23, 2012 at 4:55 PM, Amit Handa amithand...@gmail.com
wrote:

 hi,

 kindly let me know which java client api is more matured, and easy to
use
 with all features(Super Columns, caching, pooling, etc) of Cassandra
1.X.
 Right now i come to know that following client exists:

 1) Hector(Java)
 2) Thrift (Java)
 3) Kundera (Java)


 With Regards,
 Amit




--
Thomas Spengler
Chief Technology Officer


TopTarif Internet GmbH, Pappelallee 78-79, D-10437 Berlin
Tel.: (030) 2000912 0 | Fax: (030) 2000912 100
thomas.speng...@toptarif.de | www.toptarif.de

Amtsgericht Charlottenburg, HRB 113287 B
Geschäftsführer: Dr. Rainer Brosch, Dr. Carolin Gabor

-








-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Cassandra API Library.

2012-08-23 Thread Brian O'Neill
Ha… how could I forget? =)
Adding it now.

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42   •
healthmarketscience.com


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Robin Verlangen ro...@us2.nl
Reply-To:  user@cassandra.apache.org
Date:  Thursday, August 23, 2012 9:56 AM
To:  user@cassandra.apache.org
Subject:  Re: Cassandra API Library.

@Brian: You're missing PhpCassa (PHP library)

With kind regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/8/23 Hiller, Dean dean.hil...@nrel.gov
 No problem, if you like SQL at all and don't mind adding a PARTITIONS
 clause, we have a raw ad-hoc layer(if you have properly added meta data
 which the ORM objects do for you but can be done manually) you get a query
 like this
 
 PARTITIONS p('account56') SELECT tr FROM Trades as tr WHERE tr. price  70;
 
 So it queries just the partition of the Trades table.  We are still
 investigating how large partitions can be but we know it is quite large
 from previous nosql projects.
 
 Dean
 
 
 On 8/23/12 7:51 AM, Brian O'Neill boneil...@gmail.com wrote:
 
 
 Thanks Dean… I hadn't played with that one.  I wonder if that would better
 fit the bill for the Spring Data Cassandra module I'm hacking on.
 https://github.com/boneill42/spring-data-cassandra
 
 I'll poke around.
 
 -brian
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive • King of Prussia, PA • 19406
 M: 215.588.6024 tel:215.588.6024  • @boneill42
 http://www.twitter.com/boneill42  •
 healthmarketscience.com http://healthmarketscience.com
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.
 
 
 
 
 
 
 
 On 8/23/12 9:19 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 playOrm has a raw layer that if your columns are not defined ahead of
 time
 and SQL with no limitations on , =, =, etc. etc. as well as joins being
 added shortly BUT joins are for joining partitions so that your system
 can
 still scale to infinity.  Also has an in-memory database as well for unit
 testing that you can do TDD with built in.
 
 So if you like JQL but want infinite scale JQL, try playOrm.
 
 All 45 tests are passing.  We expect 100 unit tests to be in place by the
 end of the year.
 
 Dean
 
 On 8/23/12 6:46 AM, Brian O'Neill boneil...@gmail.com wrote:
 
 
 
 We've used 'em all andŠ (IMHO)
 
 1) I would avoid Thrift directly.
 2) Hector is a sure bet.
 3) Astyanax is the up and comer.
 4) Kundera is good, but works like an ORM -- so not so good if your
 columns aren't defined ahead of time.
 
 -brian
 
 ---
 Brian O'Neill
 Lead Architect, Software Development
 
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 tel:215.588.6024  € @boneill42
 http://www.twitter.com/boneill42  €
 healthmarketscience.com http://healthmarketscience.com
 
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material.
 If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please

A Big Data Trifecta: Storm, Kafka and Cassandra

2012-08-04 Thread Brian O'Neill
Philip,

I figured I would reply via blog post. =)
http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html

That blog post shows how we pieced together Kafka and Cassandra (via Storm).
With LinkedIn behind Kafka, it is well supported.  They use it in
production. (and most likely we will too =)

Let me know if you end up using it.  Thus far, I think it pairs nicely
with Cassandra, but we don't have it in production yet.

-brian

On Fri, Aug 3, 2012 at 3:41 PM, Milind Parikh milindpar...@gmail.com wrote:
 Kafka is relatively stable and has a active well-supported news-group as
 well.

 As discussed by Brian, you would be inverting the paradigm of store-process.
 Essentially in your original approach, you are storing the messages first
 and then processing them after the fact. In the Kafka model, you would
 process the messages as they come in.

 Since you are thinking about parallelism anyways, I trust that your
 processing paradigm is inherently paralleizable.

 Regards
 Milind





 On Fri, Aug 3, 2012 at 12:22 PM, Philip Nelson
 philipomailbox-c...@yahoo.com wrote:

 Brian -- thanks.

  We were looking to do the same thing, but in the end decided
  to go with Kafka.
  Given your throughput requirements, Kafka might be a good
  option for you as well.

 This might be off-topic, so I'll keep it short. Kafka is reasonably
 stable? Mature (I see it's in the Incubator)? Relative to Cassandra?

 Philip






-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: How to process new rows in parallel?

2012-08-03 Thread Brian O'Neill
If you are deleting the messages after processing, it sounds like you
are using Cassandra as a work queue.

Here are some links for implementing a distributed queue in Cassandra:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
http://comments.gmane.org/gmane.comp.db.cassandra.user/16633

There is a placeholder on the use cases wiki for this, but no info:
http://wiki.apache.org/cassandra/UseCases#A_distributed_Priority_Job_Queue

We were looking to do the same thing, but in the end decided to go with Kafka.
Given your throughput requirements, Kafka might be a good option for
you as well.

-brian


On Fri, Aug 3, 2012 at 2:18 PM, Philip Nelson
philipomailbox-c...@yahoo.com wrote:
 Hello,

 I am using a Column Family in Cassandra to store incoming messages, which 
 arrive at a high rate (100s of thousands per second). I then have a process 
 wake up periodically to work on those messages, and then delete them. I'd 
 like to understand how I could have multiple processes running, each pulling 
 off a bunch of messages in parallel. It would be nice to be able to add 
 processes dynamically, and not have to explicitly assign message ranges to 
 various processes.

 Any suggestions on how to ensure that each process pulls off a different 
 bunch of messages? Any recommended design patterns? I was going to look at 
 qsandra too, for inspiration. Would this be worthwhile?

 If this was a relational database, I would have the processes lock the table 
 (or perhaps a row), set flags on a row indicating that it's being 
 processed, and then unlock. Processes would choose messages by SELECTing on 
 unflagged messages. I'm not sure how this might map to Cassandra. I realise 
 it may not. Even if I configure the cluster such that seting a flag on a row 
 requires all nodes to be written, two processes could still race setting that 
 flag, right?

 I am open to the idea that it might help to store the messages in wide rows, 
 if that helps.

 Thanks,

 Philip



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: How to manually build and maintain secondary indexes

2012-07-26 Thread Brian O'Neill
Alon,

We came to the same conclusion regarding secondary indexes, and instead of
using them we implemented our own wide-row indexing capability and
open-sourced it.  

Its available here:
https://github.com/hmsonline/cassandra-indexing

We still have challenges rebuilding indexes, etc.  It doesn't address all
of your concerns, but I tried to capture the motivation behind our
implementation here:
http://brianoneill.blogspot.com/2012/03/cassandra-indexing-good-bad-and-ugl
y.html

-brian

-- 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024
www.healthmarketscience.com





On 7/26/12 2:05 PM, Alon Pilberg alo...@taboola.com wrote:

Hello,
My company is working on transition of our relational data model to
Cassandra. Naturally, one of the basic demands is to have secondary
indexes to answer queries quickly according to the application's
needs.
After looking at Cassandra's native support for secondary indexes, we
decided not to use them due to the poor performance for
high-cardinality values. Instead, we decide to implement secondary
indexes manually.
Some search led us to
http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html which
details a schema for such indexes. However, the method employed there
specifically adds an index entries column family, whereas it seems
like only 2 CFs are needed - one for the items and one for the indexes
(assuming one has access to both old and new values when updating an
item). The article actually mentioned that this is indeed not the
obvious solution, for a number of reasons related to Cassandra's
model of eventual consistency ... will not reliably work and it's a
really good idea to make sure you understand why this CF is
necessary. However, no additional information is provided on what
might be a critical issue, as dealing with corrupt indexes in a large
production environment is surely to be a nightmare.
What are the community's thoughts on this matter? Given the writer's
credentials in the Cassandra realm, specifically regarding indexes,
I'm inclined not to ignore his remarks.
References to a document / system that implement similar indexes would
be greatly appreciated as well.

- alon




An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)

2012-07-18 Thread Brian O'Neill
This is just an FYI.

I experimented w/ Spring Data JPA w/ Cassandra leveraging Kundera.

It sort of worked:
https://github.com/boneill42/spring-data-jpa-cassandra
http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.html

I'm now working on a pure Spring Data adapter using Astyanax:
https://github.com/boneill42/spring-data-cassandra

I'll keep you posted.

(Thanks to all those that helped out w/ advice)

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Trigger and customized filter

2012-07-10 Thread Brian O'Neill
While Jonathan and crew work on the infrastructure to support triggers:
https://issues.apache.org/jira/browse/CASSANDRA-4285

We have a project going over here that provides a trigger-like capability:
https://github.com/hmsonline/cassandra-triggers/
https://github.com/hmsonline/cassandra-triggers/wiki/GettingStarted

We are working enhancements that would support synchronous triggers w/
javascript.
For now, they are processed asynchronously, and you implement a Java interface.

-brian

On Tue, Jul 10, 2012 at 9:24 AM, Felipe Schmidt felipef...@gmail.com wrote:
 Does anyone know something about the following questions?

 1. Does Cassandra support customized filter? customized filter means
 programmer can define his desired filter to select the data.
 2. Does Cassandra support trigger? trigger has the same meaning as in
 RDBMS.

 Thanks in advance.

 Regards,
 Felipe Mathias Schmidt
 (Computer Science UFRGS, RS, Brazil)






-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Cassandra and Tableau

2012-07-06 Thread Brian O'Neill
Robin,

We have the same issue right now.  We use Tableau for all of our
reporting needs, but we couldn't find any acceptable bridge between it
and Cassandra.

We ended up using cassandra-triggers to replicate the data to Oracle.
https://github.com/hmsonline/cassandra-triggers/

Let us know if you get things setup with a direct connection.
We'd be *very* interested int helping out if you find a way to do it.

-brian


On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen ro...@us2.nl wrote:
 Hi there,

 Is there anyone out there who's using Tableau in combination with a
 Cassandra cluster? There seems to be no standard solution to connect, at
 least I couldn't find one. Does anyone know how to tackle this problem?


 With kind regards,

 Robin Verlangen
 Software engineer

 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.




-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: which high level Java client

2012-06-28 Thread Brian O'Neill
FWIW,

We keep most of our system level integrations behind REST using Virgil:
https://github.com/hmsonline/virgil

When a lower-level integration is necessary we use Hector, but
recently we've started using Astyanax and plan to port our Hector
dependencies over to Astyanax when given a chance.

I've also been looking to implement a Spring Data JPA adaptor like
what is available for MongoDB.
https://github.com/boneill42/spring-data-mongodb

I've forked the SpringSource Cassandra repo here if anyone wants to help out:
https://github.com/boneill42/spring-data-cassandra

-brian


On Thu, Jun 28, 2012 at 9:02 AM, Vivek Mishra mishra.v...@gmail.com wrote:

 Would like to add one more https://github.com/impetus-opensource/Kundera . 
 Next release is planned with many distinguishing features.

 -Vivek


 On Thu, Jun 28, 2012 at 6:23 PM, Sasha Dolgy sdo...@gmail.com wrote:

 Not following this thread too much, but there is also 
 https://github.com/Netflix/astyanax/

 Astyanax is currently in use at Netflix. Issues generally are fixed as 
 quickly as possbile and releases done frequently.

 -sd

 On Thu, Jun 28, 2012 at 2:39 PM, Poziombka, Wade L 
 wade.l.poziom...@intel.com wrote:

 I use Pelops and have been very happy.  In my opinion the interface is 
 cleaner than that with Hector.  I personally do like the serializer 
 business.

 -Original Message-
 From: Radim Kolar [mailto:h...@filez.com]
 Sent: Thursday, June 28, 2012 5:06 AM
 To: user@cassandra.apache.org
 Subject: Re: which high level Java client

 i do not have experience with other clients, only hector. But timeout 
 management in hector is really broken. If you expect your nodes to timeout 
 often (for example, if you are using WAN) better to try something else 
 first.




 --
 Sasha Dolgy
 sasha.do...@gmail.com





--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Ball is rolling on High Performance Cassandra Cookbook second edition

2012-06-27 Thread Brian O'Neill
RE: API method signatures changing

That triggers another thought...

What terminology will you use in the book to describe the data model?  CQL?

When we wrote the RefCard on
DZonehttp://refcardz.dzone.com/refcardz/apache-cassandra,
we intentionally favored/used CQL terminology.  On advisement from Jonathan
and Kris Hahn, we wanted to start the process of sunsetting the legacy
terms (keyspace, column family, etc.) in favor of the more familiar CQL
terms (schema, table, etc.). I've gone on
recordhttp://css.dzone.com/articles/new-refcard-apache-cassandrain
favor of the switch, but it is probably something worth noting in the
book since that terminology does not yet align with all the client APIs
yet. (e.g. Hector, Astyanax, etc.)

I'm not sure when the client APIs will catch up to the new terminology, but
we may want to inquire as to future proof the recipes as much as possible.

-brian




On Wed, Jun 27, 2012 at 4:18 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Jun 27, 2012 at 3:08 PM, Courtney Robinson court...@crlog.info
 wrote:
  Sounds good.
  One thing I'd like to see is more coverage on Cassandra Internals. Out of
  the box Cassandra's great but having a little inside knowledge can be
 very
  useful because it helps you design your applications to work with
 Cassandra;
  rather than having to later make endless optimizations that could
 probably
  have been avoided had you done your implementation slightly differently.
 
  Another thing that may be worth adding would be a recipe that showed an
  approach to evaluating Cassandra for your organization/use case. I
 realize
  that's going to vary on a case by case basis but one thing I've noticed
 is
  that some people dive in without really thinking through whether
 Cassandra
  is actually the right fit for what they're doing. It sort of becomes a
  hammer for anything that looks like a nail.
 
  On Tue, Jun 26, 2012 at 10:25 PM, Edward Capriolo edlinuxg...@gmail.com
 
  wrote:
 
  Hello all,
 
  It has not been very long since the first book was published but
  several things have been added to Cassandra and a few things have
  changed. I am putting together a list of changed content, for example
  features like the old per Column family memtable flush settings versus
  the new system with the global variable.
 
  My editors have given me the green light to grow the second edition
  from ~200 pages currently up to 300 pages! This gives us the ability
  to add more items/sections to the text.
 
  Some things were missing from the first edition such as Hector
  support. Nate has offered to help me in this area. Please feel contact
  me with any ideas and suggestions of recipes you would like to see in
  the book. Also get in touch if you want to write a recipe. Several
  people added content to the first edition and it would be great to see
  that type of participation again.
 
  Thank you,
  Edward
 
 
 
 
  --
  Courtney Robinson
  court...@crlog.info
  http://crlog.info
  07535691628 (No private #s)
 

 Thanks for the comments. Yes the INTERNALS chapter was a bit tricky.
 The challenge of writing about internals is they go stale fairly
 quickly. I was considering writing a partitioner for the internals
 chapter but then I thought about it more:
 1) Its hard
 2) The APIs can change. (They work the same way across versions but
 they may have a different signature etc)
 3) 99.99% of people should be using the random partitioner :)

 But I agree the external chapter can be made much stronger then it is.

 The recipe format strict. It naturally conflicts with the typical use
 case style. In a use case where you write a good amount of text
 talking about problem domain, previous solutions, bragging about
 company X. We can not do that with the recipe style, but we can do our
 best to make the recipes as real world as possible. I tried to do that
 throughout the text, you do not find many examples like 'writing foo
 records to bar column families'. However the format does not allow
 extensive text blocks mentioned above so it is difficult to set the
 stage for a complex and detailed real world problem. Still, I think
 for some examples we can take the next step and make the recipe more
 real world practical and more use-case like.




-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Indexing JSON in Cassandra

2012-06-21 Thread Brian O'Neill
I know we had this conversation over on the dev list a while back:
http://www.mail-archive.com/dev@cassandra.apache.org/msg03914.html

I just wanted to let people know that we added the capability to our
cassandra-indexing extension.
http://brianoneill.blogspot.com/2012/06/indexing-json-in-cassandra.html

Let us know if you have any trouble with it.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Server Side Logic/Script - Triggers / StoreProc

2012-04-22 Thread Brian O'Neill
Praveen,

We are certainly interested. To get things moving we implemented an add-on for 
Cassandra to demonstrate the viability (using AOP):
https://github.com/hmsonline/cassandra-triggers

Right now the implementation executes triggers asynchronously, allowing you to 
implement a java interface and plugin your own java class that will get called 
for every insert.

Per the discussion on 1311, we intend to extend our proof of concept to be able 
to invoke scripts as well.  (minimally we'll enable javascript, but we'll 
probably allow for ruby and groovy as well)

-brian

On Apr 22, 2012, at 12:23 PM, Praveen Baratam wrote:

 I found that Triggers are coming in Cassandra 1.2 
 (https://issues.apache.org/jira/browse/CASSANDRA-1311) but no mention of any 
 StoreProc like pattern.
 
 I know this has been discussed so many times but never met with any 
 initiative. Even Groovy was staged out of the trunk.
 
 Cassandra is great for logging and as such will be infinitely more useful if 
 some logic can be pushed into the Cassandra cluster nearer to the location of 
 Data to generate a materialized view useful for applications.
 
 Server Side Scripts/Routines in Distributed Databases could soon prove to be 
 the differentiating factor.
 
 Let me reiterate things with a use case.
 
 In our application we store time series data in wide rows with TTL set on 
 each point to prevent data from growing beyond acceptable limits. Still the 
 data size can be a limiting factor to move all of it from the cluster node to 
 the querying node and then to the application via thrift for processing and 
 presentation.
 
 Ideally we should process the data on the residing node and pass only the 
 materialized view of the data upstream. This should be trivial if Cassandra 
 implements some sort of server side scripting and CQL semantics to call it.
 
 Is anybody else interested in a similar feature? Is it being worked on? Are 
 there any alternative strategies to this problem?
 
 Praveen
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: cassandra gui

2012-04-01 Thread Brian O'Neill
If you give Virgil a try, let me know how it goes.
The REST layer is pretty solid, but the gui is just a PoC which makes it
easy to see what's in the CFs during development/testing.
(It's only a couple hundred lines of ExtJS code built on the REST layer)

We had plans to add CQL to the gui for CRUD, but never got around to it.

-brian

On Fri, Mar 30, 2012 at 5:20 PM, Ben McCann b...@benmccann.com wrote:

 If you want a REST interface and a GUI then Virgil may be interesting.  I
 just came across it and haven't tried it myself yet.

 http://brianoneill.blogspot.com/2011/10/virgil-gui-and-rest-layer-for-cassandra.html




 On Fri, Mar 30, 2012 at 2:15 PM, John Liberty libjac...@gmail.com wrote:

 I made some updates to a cassandra-gui project I found, which seemed to
 be stuck at version 0.7, and posted to github:
 https://github.com/libjack/cassandra-gui

 Besides updating to work with version 1.0+, main improvements I added
 were to obey validation types, including column metadata, when displaying
 or accepting data. This includes support for Composite types, both keys and
 columns.

 I often create CF with non string keys, columns, values, and especially
 Composite types... And I need a tool to browse/verify and then add/edit
 test data, and this works quite well for me.

 --
 John Liberty
 libjac...@gmail.com
 (585) 466-4249





-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Cassandra Triggers Capability published out to GitHub

2012-03-02 Thread Brian O'Neill
FYI --
http://brianoneill.blogspot.com/2012/03/cassandra-triggers-for-indexing-and.html

https://github.com/hmsonline/cassandra-triggers

Feedback welcome. Contribution and involvement is even better. ;)

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Virgil Moved (and Cassandra-Triggers coming soon)

2012-02-07 Thread Brian O'Neill
FYI -- we moved Virgil to Github to make it easier for people to contribute.
https://github.com/hmsonline/virgil

Also, we created an organization profile (hmsonline) to house all of our
storm/cassandra related work.
https://github.com/hmsonline

Under that profile, we'll be releasing cassandra-triggers.
It is an AOP-based trigger solution that provides a simple
trigger/event-log that can be used for data replication and indexing
reacting to column family mutations.
https://github.com/hmsonline/cassandra-triggers

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Remote Hadoop Job Deployment

2012-01-24 Thread Brian O'Neill
FYI... we finally got around to releasing a version of Virgil that includes
the ability to deploy jobs to remote Hadoop clusters running against
Cassandra Column Families.

http://brianoneill.blogspot.com/2012/01/virgil-remote-hadoop-job-deployment-via.html

This has enabled an army of people to write and deploy Hadoop jobs against
our Cassandra cluster.
(Literally, we'll probably have 100 M/R jobs by the end of the month)

Yes, we still plan to implement a javascript engine as well, but first we
intend to tackle Triggers for indexing, data replication and materialized
views.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Cassandra to Oracle?

2012-01-22 Thread Brian O'Neill
 a potential
 performance problem.
 
 
 On 1/20/2012 7:55 PM, Mohit Anchlia wrote:
 I think the problem stems when you have data in a column that you need
 to run adhoc query on which is not denormalized. In most cases it's
 difficult to predict the type of query that would be required.
 
 Another way of solving this could be to index the fields in search engine.
 
 On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov  wrote:
 What makes you think that RDBMS will give you acceptable performance?
 
 I guess you will try to index it to death (because otherwise the ad hoc
 queries won't work well if at all), and at this point you may be hit with a
 performance penalty.
 
 It may be a good idea to interview users and build denormalized views in
 Cassandra, maybe on a separate look-up cluster. A few percent of users
 will be unhappy, but you'll find it hard to do better. I'm talking from my
 experience with an industrial strength RDBMS which doesn't scale very well
 for what you call ad-hoc queries.
 
 Regards,
 Maxim
 
 
 
 
 
 On 1/20/2012 9:28 AM, Brian O'Neill wrote:
 
 I can't remember if I asked this question before, but
 
 We're using Cassandra as our transactional system, and building up quite a
 library of map/reduce jobs that perform data quality analysis, statistics,
 etc.
 (  100 jobs now)
 
 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.
 
 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.
 
 Anyone have any ideas?  Better ways to support ad-hoc queries?
 
 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.
 
 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.
 
 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?
 
 -brian
 
 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/
 
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Re: Cassandra to Oracle?

2012-01-22 Thread Brian O'Neill

Eric,

Thinking even a little bit more about this...

We could go the distributed counter approach with additional column families to 
support the ad hoc queries, but use triggers to implement it.  That would allow 
us to keep the client-side code thin, but achieve the same result... without 
necessarily replicating to Oracle for the attributes we can predict.

Maybe we'll take a look at that this week as well.

thanks again,
brian


On Jan 21, 2012, at 8:35 AM, Eric Czech wrote:

 Hi Brian,
 
 We're trying to do the exact same thing and I find myself asking very similar 
 questions.
 
 Our solution though has been to find what kind of queries we need to satisfy 
 on a preemptive basis and leverage cassandra's built-in indexing features to 
 build those result sets beforehand.  The whole point here then is that our 
 gain in cost efficiency comes from the fact that disk space is really cheap 
 and serving up result sets from disk is fast provided that those result sets 
 are pre-calculated and reasonable in size (even if we don't know all the 
 values upfront).  For example, when you're writing to your CF X, you could 
 also make writes to column family A like this:
 
 - write A[Z][Y] = 1
 where A = CF, Z = key, Y = column
 
 Answering the question select count(distinct Y) from X group by Z then is 
 as simple as getting a list of rows for CF A and counting the distinct values 
 of Y and grouping them by Z on the client side.
 
 Alternatively, there are much better ways to do this with composite 
 keys/columns and distributed counters but it's hard for me to tell what makes 
 the most sense without knowing more about your data / product requirements.
 
 Either way, I feel your pain in getting things like this to work with 
 Cassandra when the domain of values for a particular key or column is unknown 
 and secondary indexing doesn't apply, but I'm positive there's a much cheaper 
 way to make it work than paying for Oracle if you have at least a decent idea 
 about what kinds of queries you need to satisfy (which it sounds like you 
 do).  To Maxim's death by index point, you could certainly go overboard 
 with this concept and cross a pricing threshold with some other database 
 technology, but I can't imagine you're even close to being in that boat given 
 how concise your query needs seem to be.
 
 If you're interested, I'd be happy to share how we do these things to save 
 lots of money over commercial databases and try to relate that to your use 
 case, but if not, then I hope at least some of that this useful for you.
 
 Good luck either way!
 
 On Fri, Jan 20, 2012 at 9:27 PM, Maxim Potekhin potek...@bnl.gov wrote:
 I certainly agree with difficult to predict. There is a Danish
 proverb, which goes it's difficult to make predictions, especially
 about the future.
 
 My point was that it's equally difficult with noSQL and RDBMS.
 The latter requires indexing to operate well, and that's a potential
 performance problem.
 
 
 On 1/20/2012 7:55 PM, Mohit Anchlia wrote:
 I think the problem stems when you have data in a column that you need
 to run adhoc query on which is not denormalized. In most cases it's
 difficult to predict the type of query that would be required.
 
 Another way of solving this could be to index the fields in search engine.
 
 On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov  wrote:
 What makes you think that RDBMS will give you acceptable performance?
 
 I guess you will try to index it to death (because otherwise the ad hoc
 queries won't work well if at all), and at this point you may be hit with a
 performance penalty.
 
 It may be a good idea to interview users and build denormalized views in
 Cassandra, maybe on a separate look-up cluster. A few percent of users
 will be unhappy, but you'll find it hard to do better. I'm talking from my
 experience with an industrial strength RDBMS which doesn't scale very well
 for what you call ad-hoc queries.
 
 Regards,
 Maxim
 
 
 
 
 
 On 1/20/2012 9:28 AM, Brian O'Neill wrote:
 
 I can't remember if I asked this question before, but
 
 We're using Cassandra as our transactional system, and building up quite a
 library of map/reduce jobs that perform data quality analysis, statistics,
 etc.
 (  100 jobs now)
 
 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.
 
 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.
 
 Anyone have any ideas?  Better ways to support ad-hoc queries?
 
 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.
 
 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.
 
 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?
 
 -brian
 
 --
 Brian ONeill
 Lead Architect, Health Market Science (http

Re: Cassandra to Oracle?

2012-01-22 Thread Brian O'Neill

Good point Milind. (RE: Client-side AOP)

I was thinking server-side to stay with the trigger concept, but we could just 
as easily intercept on the client-side. 
We'd just need to make sure that all clients got the AOP code injected. 
(including all of our map/reduce jobs)

If we get the point-cut right (using the Cassandra.Iface), we could probably 
make it portable.  People could drop it in client-side or server-side.

-brian



On Jan 22, 2012, at 9:45 AM, Milind Parikh wrote:

 My bad ~s/X:X-Value/Y:Y-Value/ after rereading the SELECT.
 
 /***
 sent from my android...please pardon occasional typos as I respond @ the 
 speed of thought
 /
 
 On Jan 22, 2012 6:40 AM, Milind Parikh milindpar...@gmail.com wrote:
 
 
 The composite-key approach with counters would work very well in this case. 
 It will also obviate the concern of not knowing the exact column names 
 apriori...although for efficiencies, you might to look at maintaining a 
 secondary cachelike cf for lookup
 
 Depending on your data patterns(not to hit 2b columns) and actual queries, 
 you could store each Zs as one row and composite key on Z - value + X:X-value 
 and then as counter-column. Other optimizations may be possible.
 
 If you're using AOP, as I read it, there's really no need to intercept your 
 own writes at the C* level; instead do it (use aop)at the client level.
 
 Your migration also needs to be attended to and might need a MR first and AOP 
 intercepted writes.
 
 Hth
 Milind
 
 
 /***
 sent from my android...please pardon occasional typos as I respond @ the 
 speed of thought
 /
 
 
 
 
  On Jan 22, 2012 4:42 AM, Brian Oapos;Neill boneil...@gmail.com wrote:
 
 
 
  Thanks for all the ideas...
 
  Since we can't predict all the values, we actually cut to Oracle...
 
 
 
 
 
 
 
  On Jan 21, 2012, at 8:35 AM, Eric Czech wrote:
 
   Hi Brian,
  
 
  We're trying to do the exact same...
 
 
 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/



Cassandra to Oracle?

2012-01-20 Thread Brian O'Neill
I can't remember if I asked this question before, but

We're using Cassandra as our transactional system, and building up quite a
library of map/reduce jobs that perform data quality analysis, statistics,
etc.
( 100 jobs now)

But... we are still struggling to provide an ad-hoc query mechanism for
our users.

To fill that gap, I believe we still need to materialize our data in an
RDBMS.

Anyone have any ideas?  Better ways to support ad-hoc queries?

Effectively, our users want to be able to select count(distinct Y) from X
group by Z.
Where Y and Z are arbitrary columns of rows in X.

We believe we can create column families with different key structures
(using Y an Z as row keys), but some column names we don't know / can't
predict ahead of time.

Are people doing bulk exports?
Anyone trying to keep an RDBMS in synch in real-time?

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Cassandra to Oracle?

2012-01-20 Thread Brian O'Neill
Not terribly large
~50 million rows, each row has ~100-300 columns.

But big enough that a map/reduce job takes longer than users would like.

Actually maybe that is another question...
Does anyone have any benchmarks running map/reduce against Cassandra?
(even a simple count / or copy CF benchmark would be helpful)

-brian

On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson 
j.zach.richard...@gmail.com wrote:

 How much data do you think you will need ad hoc query ability for?


 On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite
 a library of map/reduce jobs that perform data quality analysis,
 statistics, etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism for
 our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from X
 group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/





-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Ad Hoc Queries

2012-01-20 Thread Brian O'Neill
Interesting articles... (changing the subject line to broaden the scope)
http://codemonkeyism.com/dark-side-nosql/
http://www.reportsanywhere.com/pebble/2010/04/16/127143774.html

These articulate the exact challenge we're trying to overcome.

-brian



On Fri, Jan 20, 2012 at 12:57 PM, Brian O'Neill b...@alumni.brown.eduwrote:

 Not terribly large
 ~50 million rows, each row has ~100-300 columns.

 But big enough that a map/reduce job takes longer than users would like.

 Actually maybe that is another question...
 Does anyone have any benchmarks running map/reduce against Cassandra?
 (even a simple count / or copy CF benchmark would be helpful)

 -brian

 On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson 
 j.zach.richard...@gmail.com wrote:

 How much data do you think you will need ad hoc query ability for?


 On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 I can't remember if I asked this question before, but

 We're using Cassandra as our transactional system, and building up quite
 a library of map/reduce jobs that perform data quality analysis,
 statistics, etc.
 ( 100 jobs now)

 But... we are still struggling to provide an ad-hoc query mechanism
 for our users.

 To fill that gap, I believe we still need to materialize our data in an
 RDBMS.

 Anyone have any ideas?  Better ways to support ad-hoc queries?

 Effectively, our users want to be able to select count(distinct Y) from
 X group by Z.
 Where Y and Z are arbitrary columns of rows in X.

 We believe we can create column families with different key structures
 (using Y an Z as row keys), but some column names we don't know / can't
 predict ahead of time.

 Are people doing bulk exports?
 Anyone trying to keep an RDBMS in synch in real-time?

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/





 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://weblogs.java.net/blog/boneill42/
 blog: http://brianoneill.blogspot.com/




-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Triggers?

2012-01-20 Thread Brian O'Neill
Anyone know if there is any activity to deliver triggers?

I saw this quote:

http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php

Ellis says that he's just starting to think about the post-1.0 world for
Cassandra. Two features do come to mind, though, that missed the boat for
1.0 and that were on a lot of wishlists. The first is triggers.

Database triggers let you define rules in the database, such as updating
table X when table Y is updated. Ellis says that triggers will be necessary
for Cassandra as it grows in popularity. As more tools use it, that's
something more users are going to be asking for.

But grepping the trunk code, I don't see any work on triggers.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Copy a column family?

2012-01-09 Thread Brian O'Neill
What is the fastest way to copy a column family?
We were headed down the map/reduce path, but that seems silly.
Any file level mechanisms for this?

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Copy a column family?

2012-01-09 Thread Brian O'Neill
Excellent.  We'll give it a try.
Thanks Brandon.

-brian

 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 1/9/12 10:31 AM, Brandon Williams dri...@gmail.com wrote:

On Mon, Jan 9, 2012 at 9:14 AM, Brian O'Neill b...@alumni.brown.edu
wrote:

 What is the fastest way to copy a column family?
 We were headed down the map/reduce path, but that seems silly.
 Any file level mechanisms for this?

Copy all the sstables 1:1 renaming them to the new CF name.  Then
create the schema for the CF.

-Brandon




  1   2   >