Re: Timeuuid inserted with now(), how to get the value back in Java client?

2014-04-01 Thread Theo Hultberg
no, there's no way. you should generate the TIMEUUID on the client side so
that you have it.

T#


On Sat, Mar 29, 2014 at 1:01 AM, Andy Atj2 andya...@gmail.com wrote:

 I'm writing a Java client to a Cassandra db.

 One of the main primary keys is a timeuuid.

 I plan to do INSERTs using now() and have Cassandra generate the value of
 the timeuuid.

 After the INSERT, I need the Cassandra-generated timeuuid value. Is there
 an easy wsay to get it, without having to re-query for the record I just
 inserted, hoping to get only one record back? Remember, I don't have the PK.

 Eg, in every other db there's a way to get the generated PK back. In sql
 it's @@identity, in oracle its...etc etc.

 I know Cassandra is not an RDBMS. All I want is the value Cassandra just
 generated.

 Thanks,
 Andy




Re: Meaning of token column in system.peers and system.local

2014-03-31 Thread Theo Hultberg
your assumption about 256 tokens per node is correct.

as for you second question, it seems to me like most of your assumptions
are correct, but I'm not sure I understand them correctly. hopefully
someone else can answer this better. tokens are a property of the cluster
and not the keyspace. the first replica of any token will be the same for
all keyspaces, but with different replication factors the other replicas
will differ.

when you query the system.local and system.peers tables you must make sure
that you don't connect to other nodes. I think the inconsistency you think
you found is because the first and second queries went to different nodes.
the java driver will connect to all nodes and load balance requests by
default.

T#


On Mon, Mar 31, 2014 at 4:06 AM, Clint Kelly clint.ke...@gmail.com wrote:

 BTW one other thing that I have not been able to debug today that maybe
 someone can help me with:

 I am using a three-node Cassandra cluster with Vagrant.  The nodes in my
 cluster are 192.168.200.11, 192.168.200.12, and 192.168.200.13.

 If I use cqlsh to connect to 192.168.200.11, I see unique sets of tokens
 when I run the following three commands:

 select tokens from system.local
 select tokens from system.peers where peer=192.168.200.12
 select tokens from system.peers where peer=192.168.200.13

 This is what I expect.  However, when I tried making an application with
 the Java driver that does the following:


- Create a Session by connecting to 192.168.200.11
- From that session, select tokens from system.local
- From that session, select tokens, peer from system.peers

 Now I get the exact-same set of tokens from system.local and from the row
 in system.peers in which peer=192.168.200.13.

 Anyone have any idea why this would happen?  I'm not sure how to debug
 this.  I see the following log from the Java driver:

 14/03/30 19:05:24 DEBUG com.datastax.driver.core.Cluster: Starting new
 cluster with contact points [/192.168.200.11]
 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
 host /192.168.200.13 added
 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
 host /192.168.200.12 added

 I'm running Cassandra 2.0.6 in the virtual machine and I built my
 application with version 2.0.1 of the driver.

 Best regards,
 Clint







 On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly clint.ke...@gmail.comwrote:

 Hi all,


 I am working on a Hadoop InputFormat implementation that uses only the
 native protocol Java driver and not the Thrift API.  I am currently trying
 to replicate some of the behavior of
 *Cassandra.client.describe_ring(myKeyspace)* from the Thrift API.  I
 would like to do the following:

- Get a list of all of the token ranges for a cluster
- For every token range, determine the replica nodes on which the
data in the token range resides
- Estimate the number of rows for every range of tokens
- Groups ranges of tokens on common replica nodes such that we can
create a set of input splits for Hadoop with total estimated line counts
that are reasonably close to the requested split size

 Last week I received some much-appreciated help on this list that pointed
 me to using the system.peers table to get the list of token ranges for the
 cluster and the corresponding hosts.  Today I created a three-node C*
 cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and
 tried inspecting some of the system tables.  I have a couple of questions
 now:

 1. *How many total unique tokens should I expect to see in my cluster?*
 If I have three nodes, and each node has a cassandra.yaml with num_tokens =
 256, then should I expect a total of 256*3 = 768 distinct vnodes?

 2. *How does the creation of vnodes and their assignment to nodes relate
 to the replication factor for a given keyspace?*  I never thought about
 this until today, and I tried to reread the documentation on virtual nodes,
 replication in Cassandra, etc., and now I am sadly still confused.  Here is
 what I think I understand.  :)

- Given a row with a partition key, any client request for an
operation on that row will go to a coordinator node in the cluster.
- The coordinator node will compute the token value for the row and
from that determine a set of replica nodes for that token.
   - One of the replica nodes I assume is the node that owns the
   vnode with the token range that encompasses the token
   - The identity of the owner of this virtual node is a
   cross-keyspace property
   - And the other replicas were originally chosen based on the
   replica-placement strategy
   - And therefore the other replicas will be different for each
   keyspace (because replication factors and replica-placement strategy 
 are
   properties of a keyspace)

 3. What do the values in the token column in system.peers and
 system.local refer to then?

- Since these tables appear to be global, and 

Re: Production Quality Ruby Driver?

2014-03-19 Thread Theo Hultberg
I'm the author of cql-rb, the first one on your list. It runs in production
in systems doing tens of thousands of operations per second. cequel is an
ORM and its latest version runs on top of cql-rb.

If you decide on using cql-rb I'm happy to help you out with any problems
you might have, just open an issue on the GitHub project page.

yours
Theo


On Mon, Mar 17, 2014 at 6:55 PM, NORD SC jan.algermis...@nordsc.com wrote:

 Hi,

 I am looking for a Ruby driver that is production ready and truly supports
 CQL 3. Can anyone strongly recommend one in particular?


 I found

 - https://github.com/iconara/cql-rb
 - https://github.com/kreynolds/cassandra-cql
 - https://github.com/cequel/cequel


 Jan




Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-12 Thread Theo Hultberg
Speaking as a CQL driver maintainer (Ruby) I'm +1 for end-of-lining Thrift.

I agree with Edward that it's unfortunate that there are no official
drivers being maintained by the Cassandra maintainers -- even though the
current state with the Datastax drivers is in practice very close (it is
not the same thing though).

However, I don't agree that not having drivers in the same repo/project is
a problem. Whether or not there's a Java driver in the Cassandra source or
not doesn't matter at all to us non-Java developers, and I don't see any
difference between the situation where there's no driver in the source or
just a Java driver. I might have misunderstood Edwards point about this,
though.

The CQL protocol is the key, as others have mentioned. As long as that is
maintained, and respected I think it's absolutely fine not having any
drivers shipped as part of Cassandra. However, I feel as this has not been
the case lately. I'm thinking particularly about the UDT feature of 2.1,
which is not a part of the CQL spec. There is no documentation on how
drivers should handle them and what a user should be able to expect from a
driver, they're completely implemented as custom types.

I hope this will be fixed before 2.1 is released (and there's been good
discussions on the mailing lists about how a driver should handle UDTs),
but it shows a problem with the the-spec-is-the-thruth argument. I think
we'll be fine as long as the spec is the truth, but that requires the spec
to be the truth and new features to not be bolted on outside of the spec.

T#


On Wed, Mar 12, 2014 at 3:23 PM, Peter Lin wool...@gmail.com wrote:

 I'm enjoying the discussion also.

 @Brian
 I've been looking at spark/shark along with other recent developments the
 last few years. Berkeley has been doing some interesting stuff. One reason
 I like Thrift is for type safety and the benefits for query validation and
 query optimization. One could do similar things with CQL, but it's just
 more work, especially with dynamic columns. I know others are mixing static
 with dynamic columns, so I'm not alone. I have no clue how long it will
 take to get there, but having tools like query explanation is a big time
 saver. Writing business reports is hard enough, so every bit of help the
 tool can provide makes it less painful.


 On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill b...@alumni.brown.eduwrote:


 just when you thought the thread died...


 First, let me say we are *WAY* off topic.  But that is a good thing.
 I love this community because there are a ton of passionate, smart
 people. (often with differing perspectives ;)

 RE: Reporting against C* (@Peter Lin)
 We've had the same experience.  Pig + Hadoop is painful.  We are
 experimenting with Spark/Shark, operating directly against the data.
 http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html

 The Shark layer gives you SQL and caching capabilities that make it easy
 to use and fast (for smaller data sets).  In front of this, we are going to
 add dimensional aggregations so we can operate at larger scales.  (then the
 Hive reports will run against the aggregations)

 RE: REST Server (@Russel Bradbury)
 We had moderate success with Virgil, which was a REST server built
 directly on Thrift.  We built it directly on top of Thrift, so one day it
 could be easily embedded in the C* server itself.   It could be deployed
 separately, or run an embedded C*.  More often than not, we ended up
 running it separately to separate the layers.  (just like Titan and
 Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
 top of CQL. (I'd love some help)
 https://github.com/boneill42/memnon

 RE: CQL vs. Thrift
 We've hitched our wagons to CQL.  CQL != Relational.
 We've had success translating our native schemas into CQL, including
 all the NoSQL goodness of wide-rows, etc.  You just need a good
 understanding of how things translate into storage and underlying CFs.  If
 anything, I think we could add some DESCRIBE information, which would help
 users with this, along the lines of:
 (https://issues.apache.org/jira/browse/CASSANDRA-6676)

 CQL does open up the *opportunity* for users to articulate more complex
 queries using more familiar syntax.  (including future things such as
 joins, grouping, etc.)   To me, that is exciting, and again -- one of the
 reasons we are leaning on it.

 my two cents,
 brian

 ---

 Brian O'Neill

 Chief Technology Officer


 *Health Market Science*

 *The Science of Better Results*

 2700 Horizon Drive * King of Prussia, PA * 19406

 M: 215.588.6024 * @boneill42 http://www.twitter.com/boneill42  *

 healthmarketscience.com


 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender 

Re: How to paginate through all columns in a row?

2014-02-27 Thread Theo Hultberg
You can page yourself using the withColumnRange method (see the slice query
example on the page you linked to). What you do is that you save the last
column you got from the previous query, and you set that as the start of
the range you pass to withColumnRange. You don't need to set an end of a
range, but you want to set a max size.

This code is just a quick rewrite from the page you linked to and I haven't
checked that it worked, but it should give you an idea of where to start

ColumnListString result;
int pageSize = 100;
String offset = Character.toString('\0');
while (true) {
  result = keyspace.prepareQuery(CF_STANDARD1)
   .getKey(rowKey)
   .withColumnRange(new
RangeBuilder().setStart(offset).setMaxSize(pageSize).build())
   .execute().getResult();
  while (result.hasNext()) {
ColumnString col = result.next();
// do something with your column here, then save
// the last column to use as the offset when loading the next page
offset = col.getStringValue();
} while (result.size() == pageSize);

I'm using a string with a null byte as the first offset because that should
sort before all strings, but there might be a better way of doing. If you
have non-string columns or composite columns the exact way to do this is a
bit different but I hope this shows you the general idea.

T#



On Thu, Feb 27, 2014 at 11:36 AM, Lu, Boying boying...@emc.com wrote:

 Hi, All,



 I'm using Netflix/Astyanax as a java cassandra client to access Cassandra
 DB.



 I need to paginate through all columns in a row and I found the document
 at https://github.com/Netflix/astyanax/wiki/Reading-Data

 about how to do that.



 But my requirement is a little different.  I don't want to do paginate in
 'one querying session',

 i.e. I don't want to hold the returned 'RowQuery' object to get next page.



 Is there any way that I can keep a 'marker' for next page, so by using the
 marker,

 I can tell the Cassandra DB that where to start query.

 e.g.  the query result has three 'pages',

 Can I build the query by giving a marker pointed to the 'page 2' and
 Cassandra will return the second page of the query?



 Thanks a lot.



 Boying





Re: How should clients handle the user defined types in 2.1?

2014-02-25 Thread Theo Hultberg
thanks for the high level description of the format, I'll see if I can make
a stab at implementing support for custom types now.

and maybe I should take all of the reverse engineering I've done of the
type encoding and decoding and send a pull request for the protocol spec,
or write an appendix.

T#


On Tue, Feb 25, 2014 at 12:10 PM, Sylvain Lebresne sylv...@datastax.comwrote:


 Is there any documentation on how CQL clients should handle the new user
 defined types coming in 2.1? There's nothing in the protocol specification
 on how to handle custom types as far as I can see.


 Can't say there is much documentation so far for that. As for the spec, it
 was written in a time where user defined types didn't existed and so as far
 as the protocol is concerned so far, user defined types are handled by the
 protocol as a custom type, i.e the full internal class is returned. And
 so ...



 For example, I tried creating the address type from the description of
 CASSANDRA-5590, and this is how its metadata looks (the metadata for a
 query contains a column with a custom type and this is the description of
 it):


 org.apache.cassandra.db.marshal.UserType(user_defined_types,61646472657373,737472656574:org.apache.cassandra.db.marshal.UTF8Type,63697479:org.apache.cassandra.db.marshal.UTF8Type,7a69705f636f6465:org.apache.cassandra.db.marshal.Int32Type,70686f6e6573:org.apache.cassandra.db.marshal.SetType(org.apache.cassandra.db.marshal.UTF8Type))

 Is the client supposed to parse that description, and in that case how?


 ... yes, for now you're supposed to parse that description. Which is not
 really much documented outside of looking up the Cassandra code, but I can
 tell you that the first parameter of the UserType is the keyspace name the
 type has been defined in, the second is the type name hex encoded, and the
 rest is list of fields and their type. Each field name is hex encoded and
 separated from it's type by ':'. And that's about it.

 We will introduce much shorted definitions in the next iteration of the
 native protocol, but it's yet unclear when that will happen.

 --
 Sylvain





Re: CQL decimal encoding

2014-02-24 Thread Theo Hultberg
I don't know if it's by design or if it's by oversight that the data types
aren't part of the binary protocol specification. I had to reverse engineer
how to encode and decode all of them for the Ruby driver. There were
definitely a few bugs in the first few versions that could have been
avoided if there was a specification available.

T#


On Mon, Feb 24, 2014 at 8:43 PM, Paul LeoNerd Evans 
leon...@leonerd.org.uk wrote:

 On Mon, 24 Feb 2014 19:14:48 +
 Ben Hood 0x6e6...@gmail.com wrote:

  So I have a question about the encoding of 0: \x00\x00\x00\x00\x00.

 The first four octets are the decimal shift (0), and the remaining ones
 (one in this case) encode a varint - 0 in this case. So it's

   0 * 10**0

 literally zero.

 Technically the decimal shift matters not for zero - any four bytes
 could be given as the shift, ending in \x00, but 0 is the simplest.

 --
 Paul LeoNerd Evans

 leon...@leonerd.org.uk
 ICQ# 4135350   |  Registered Linux# 179460
 http://www.leonerd.org.uk/



How should clients handle the user defined types in 2.1?

2014-02-24 Thread Theo Hultberg
(I posted this on the client-dev list the other day, but that list seems
dead so I'm cross posting, sorry if it's the wrong thing to do)

Hi,

Is there any documentation on how CQL clients should handle the new user
defined types coming in 2.1? There's nothing in the protocol specification
on how to handle custom types as far as I can see.

For example, I tried creating the address type from the description of
CASSANDRA-5590, and this is how its metadata looks (the metadata for a
query contains a column with a custom type and this is the description of
it):

org.apache.cassandra.db.marshal.UserType(user_defined_types,61646472657373,737472656574:org.apache.cassandra.db.marshal.UTF8Type,63697479:org.apache.cassandra.db.marshal.UTF8Type,7a69705f636f6465:org.apache.cassandra.db.marshal.Int32Type,70686f6e6573:org.apache.cassandra.db.marshal.SetType(org.apache.cassandra.db.marshal.UTF8Type))

Is the client supposed to parse that description, and in that case how? I
could probably figure it out but it would be great if someone could point
me to the right docs.

yours,
Theo (author of cql-rb, the Ruby driver)


Re: How should clients handle the user defined types in 2.1?

2014-02-24 Thread Theo Hultberg
There hasn't been any activity (apart from my question) since december, and
only sporadic activity before that, so I think it's essentially dead.

http://www.mail-archive.com/client-dev@cassandra.apache.org/

T#


On Mon, Feb 24, 2014 at 10:34 PM, Ben Hood 0x6e6...@gmail.com wrote:

 On Mon, Feb 24, 2014 at 7:52 PM, Theo Hultberg t...@iconara.net wrote:
  (I posted this on the client-dev list the other day, but that list seems
  dead so I'm cross posting, sorry if it's the wrong thing to do)

 I didn't even realize there was a list for driver implementors - is
 this used at all? Is it worth being on this list?



Re: manually removing sstable

2013-07-12 Thread Theo Hultberg
thanks aaron, the second point I had not considered, and it could explain
why the sstables don't always disapear completely, sometimes a small file
(but megabytes instead of gigabytes) is left behind.

T#


On Fri, Jul 12, 2013 at 10:25 AM, aaron morton aa...@thelastpickle.comwrote:

 That sounds sane to me. Couple of caveats:

 * Remember that Expiring Columns turn into Tombstones and can only be
 purged after TTL and gc_grace.
 * Tombstones will only be purged if all fragments of a row are in the
 SStable(s) being compacted.

 Cheers

 -
 Aaron Morton
 Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 11/07/2013, at 10:17 PM, Theo Hultberg t...@iconara.net wrote:

 a colleague of mine came up with an alternative solution that also seems
 to work, and I'd just like your opinion on if it's sound.

 we run find to list all old sstables, and then use cmdline-jmxclient to
 run the forceUserDefinedCompaction function on each of them, this is
 roughly what we do (but with find and xargs to orchestrate it)

   java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199
 org.apache.cassandra.db:type=CompactionManager 
 forceUserDefinedCompaction=the_keyspace,db_file_name

 the downside is that c* needs to read the file and do disk io, but the
 upside is that it doesn't require a restart. c* does a little more work,
 but we can schedule that during off-peak hours. another upside is that it
 feels like we're pretty safe from screwups, we won't accidentally remove an
 sstable with live data, the worst case is that we ask c* to compact an
 sstable with live data and end up with an identical sstable.

 if anyone else wants to do the same thing, this is the full cron command:

 0 4 * * * find /path/to/cassandra/data/the_keyspace_name -maxdepth 1 -type
 f -name '*-Data.db' -mtime +8 -printf
 forceUserDefinedCompaction=the_keyspace_name,\%P\n | xargs -t
 --no-run-if-empty java -jar
 /usr/local/share/java/cmdline-jmxclient-0.10.3.jar - localhost:7199
 org.apache.cassandra.db:type=CompactionManager

 just change the keyspace name and the path to the data directory.

 T#


 On Thu, Jul 11, 2013 at 7:09 AM, Theo Hultberg t...@iconara.net wrote:

 thanks a lot. I can confirm that it solved our problem too.

 looks like the C* 2.0 feature is perfect for us.

 T#


 On Wed, Jul 10, 2013 at 7:28 PM, Marcus Eriksson krum...@gmail.comwrote:

 yep that works, you need to remove all components of the sstable though,
 not just -Data.db

 and, in 2.0 there is this:
 https://issues.apache.org/jira/browse/CASSANDRA-5228

 /Marcus


 On Wed, Jul 10, 2013 at 2:09 PM, Theo Hultberg t...@iconara.net wrote:

 Hi,

 I think I remember reading that if you have sstables that you know
 contain only data that whose ttl has expired, it's safe to remove them
 manually by stopping c*, removing the *-Data.db files and then starting up
 c* again. is this correct?

 we have a cluster where everything is written with a ttl, and sometimes
 c* needs to compact over a 100 gb of sstables where we know ever has
 expired, and we'd rather just manually get rid of those.

 T#








Re: Extract meta-data using cql 3

2013-07-12 Thread Theo Hultberg
there's a keyspace called system which has a few tables that contain the
metadata. for example schema_keyspaces that contain keyspace metadata, and
schema_columnfamilies that contain table metadata. there are more, just
fire up cqlsh and do a describe keyspace in the system keyspace to find
them.

T#


On Fri, Jul 12, 2013 at 10:52 AM, Murali muralidharan@gmail.com wrote:

 Hi experts,

 How to extract meta-data of a table or a keyspace using CQL 3.0?

 --
 Thanks,
 Murali




manually removing sstable

2013-07-10 Thread Theo Hultberg
Hi,

I think I remember reading that if you have sstables that you know contain
only data that whose ttl has expired, it's safe to remove them manually by
stopping c*, removing the *-Data.db files and then starting up c* again. is
this correct?

we have a cluster where everything is written with a ttl, and sometimes c*
needs to compact over a 100 gb of sstables where we know ever has expired,
and we'd rather just manually get rid of those.

T#


Re: manually removing sstable

2013-07-10 Thread Theo Hultberg
thanks a lot. I can confirm that it solved our problem too.

looks like the C* 2.0 feature is perfect for us.

T#


On Wed, Jul 10, 2013 at 7:28 PM, Marcus Eriksson krum...@gmail.com wrote:

 yep that works, you need to remove all components of the sstable though,
 not just -Data.db

 and, in 2.0 there is this:
 https://issues.apache.org/jira/browse/CASSANDRA-5228

 /Marcus


 On Wed, Jul 10, 2013 at 2:09 PM, Theo Hultberg t...@iconara.net wrote:

 Hi,

 I think I remember reading that if you have sstables that you know
 contain only data that whose ttl has expired, it's safe to remove them
 manually by stopping c*, removing the *-Data.db files and then starting up
 c* again. is this correct?

 we have a cluster where everything is written with a ttl, and sometimes
 c* needs to compact over a 100 gb of sstables where we know ever has
 expired, and we'd rather just manually get rid of those.

 T#





Re: does anyone store large values in cassandra e.g. 100kb?

2013-07-09 Thread Theo Hultberg
We store objects that are a couple of tens of K, sometimes 100K, and we
store quite a few of these per row, sometimes hundreds of thousands.

One problem we encountered early was that these rows would become so big
that C* couldn't compact the rows in-memory and had to revert to slow
two-pass compactions where it spills partially compacted rows to disk. we
solved that in two ways, first by
increasing in_memory_compaction_limit_in_mb from 64 to 128, and although it
helped a little bit we quickly realized didn't have much effect because
most of the time was taken up by really huge rows many times larger than
that.

We ended up implementing a simple sharding scheme where each row is
actually 36 rows that each contain 1/36 of the range (we take the first
letter in the column key and stick that on the row key on writes, and on
reads we read all 36 rows -- 36 because there are 36 letters and numbers in
the ascii alphabet and our column keys happen to distribute over that quite
nicely).

Cassandra works well with semi-large objects, and it works well with wide
rows, but you have to be careful about the combination where rows get
larger than 64 Mb.

T#


On Mon, Jul 8, 2013 at 8:13 PM, S Ahmed sahmed1...@gmail.com wrote:

 Hi Peter,

 Can you describe your environment, # of documents and what kind of usage
 pattern you have?




 On Mon, Jul 8, 2013 at 2:06 PM, Peter Lin wool...@gmail.com wrote:

 I regularly store word and pdf docs in cassandra without any issues.




 On Mon, Jul 8, 2013 at 1:46 PM, S Ahmed sahmed1...@gmail.com wrote:

 I'm guessing that most people use cassandra to store relatively smaller
 payloads like 1-5kb in size.

 Is there anyone using it to store say 100kb (1/10 of a megabyte) and if
 so, was there any tweaking or gotchas that you ran into?






Re: does anyone store large values in cassandra e.g. 100kb?

2013-07-09 Thread Theo Hultberg
yes, by splitting the rows into 36 parts it's very rare that any part gets
big enough to impact the clusters performance. there are still rows that
are bigger than the in memory compaction limit, but when it's only some it
doesn't matter as much.

T#


On Tue, Jul 9, 2013 at 5:43 PM, S Ahmed sahmed1...@gmail.com wrote:

 So was the point of breaking into 36 parts to bring each row to the 64 or
 128mb threshold?


 On Tue, Jul 9, 2013 at 3:18 AM, Theo Hultberg t...@iconara.net wrote:

 We store objects that are a couple of tens of K, sometimes 100K, and we
 store quite a few of these per row, sometimes hundreds of thousands.

 One problem we encountered early was that these rows would become so big
 that C* couldn't compact the rows in-memory and had to revert to slow
 two-pass compactions where it spills partially compacted rows to disk. we
 solved that in two ways, first by
 increasing in_memory_compaction_limit_in_mb from 64 to 128, and although it
 helped a little bit we quickly realized didn't have much effect because
 most of the time was taken up by really huge rows many times larger than
 that.

 We ended up implementing a simple sharding scheme where each row is
 actually 36 rows that each contain 1/36 of the range (we take the first
 letter in the column key and stick that on the row key on writes, and on
 reads we read all 36 rows -- 36 because there are 36 letters and numbers in
 the ascii alphabet and our column keys happen to distribute over that quite
 nicely).

 Cassandra works well with semi-large objects, and it works well with wide
 rows, but you have to be careful about the combination where rows get
 larger than 64 Mb.

 T#


 On Mon, Jul 8, 2013 at 8:13 PM, S Ahmed sahmed1...@gmail.com wrote:

 Hi Peter,

 Can you describe your environment, # of documents and what kind of usage
 pattern you have?




 On Mon, Jul 8, 2013 at 2:06 PM, Peter Lin wool...@gmail.com wrote:

 I regularly store word and pdf docs in cassandra without any issues.




 On Mon, Jul 8, 2013 at 1:46 PM, S Ahmed sahmed1...@gmail.com wrote:

 I'm guessing that most people use cassandra to store relatively
 smaller payloads like 1-5kb in size.

 Is there anyone using it to store say 100kb (1/10 of a megabyte) and
 if so, was there any tweaking or gotchas that you ran into?








Re: What is best Cassandra client?

2013-07-04 Thread Theo Hultberg
Datastax Java driver: https://github.com/datastax/java-driver

T#


On Thu, Jul 4, 2013 at 10:25 AM, Tony Anecito adanec...@yahoo.com wrote:

 Hi All,
 What is the best client to use? I want to use CQL 3.0.3 and have support
 for preparedStatmements. I tried JDBC and the thrift client so far.

 Thanks!



Re: Performance issues with CQL3 collections?

2013-06-27 Thread Theo Hultberg
the thing I was doing was definitely triggering the range tombstone issue,
this is what I was doing:

UPDATE clocks SET clock = ? WHERE shard = ?

in this table:

CREATE TABLE clocks (shard INT PRIMARY KEY, clock MAPTEXT, TIMESTAMP)

however, from the stack overflow posts it sounds like they aren't
necessarily overwriting their collections. I've tried to replicate their
problem with these two statements

INSERT INTO clocks (shard, clock) VALUES (?, ?)
UPDATE clocks SET clock = clock + ? WHERE shard = ?

the first one should create range tombstones because it overwrites the the
map on every insert, and the second should not because it adds to the map.
neither of those seems to have any performance issues, at least not on
inserts.

and it's the slowdown on inserts that confuses me, both the stack overflow
questioners say that they saw a drop in insert performance. I never saw
that in my application, I just got slow reads (and Fabien's explanation
makes complete sense for that). I don't understand how insert performance
could be affected at all, and I know that for non-counter columns cassandra
doesn't read before it writes, but is it the same for collections too? they
are a bit special, but how special are they?

T#


On Fri, Jun 28, 2013 at 7:04 AM, aaron morton aa...@thelastpickle.comwrote:

 Can you provide details of the mutation statements you are running ? The
 Stack Overflow posts don't seem to include them.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 27/06/2013, at 5:58 AM, Theo Hultberg t...@iconara.net wrote:

 do I understand it correctly if I think that collection modifications are
 done by reading the collection, writing a range tombstone that would cover
 the collection and then re-writing the whole collection again? or is it
 just the modified parts of the collection that are covered by the range
 tombstones, but you still get massive amounts of them and its just their
 number that is the problem.

 would this explain the slowdown of writes too? I guess it would if
 cassandra needed to read the collection before it wrote the new values,
 otherwise I don't understand how this affects writes, but that only says
 how much I know about how this works.

 T#


 On Wed, Jun 26, 2013 at 10:48 AM, Fabien Rousseau fab...@yakaz.comwrote:

 Hi,

 I'm pretty sure that it's related to this ticket :
 https://issues.apache.org/jira/browse/CASSANDRA-5677

 I'd be happy if someone tests this patch.
 It should apply easily on 1.2.5  1.2.6

 After applying the patch, by default, the current implementation is still
 used, but modify your cassandra.yaml to add the following one :
 interval_tree_provider: IntervalTreeAvlProvider

 (Note that implementations should be interchangeable, because they share
 the same serializers and deserializers)

 Also, please note that this patch has not been reviewed nor intensively
 tested... So, it may not be production ready

 Fabien







 2013/6/26 Theo Hultberg t...@iconara.net

 Hi,

 I've seen a couple of people on Stack Overflow having problems with
 performance when they have maps that they continuously update, and in
 hindsight I think I might have run into the same problem myself (but I
 didn't suspect it as the reason and designed differently and by accident
 didn't use maps anymore).

 Is there any reason that maps (or lists or sets) in particular would
 become a performance issue when they're heavily modified? As I've
 understood them they're not special, and shouldn't be any different
 performance wise than overwriting regular columns. Is there something
 different going on that I'm missing?

 Here are the Stack Overflow questions:


 http://stackoverflow.com/questions/17282837/cassandra-insert-perfomance-issue-into-a-table-with-a-map-type/17290981


 http://stackoverflow.com/questions/17082963/bad-performance-when-writing-log-data-to-cassandra-with-timeuuid-as-a-column-nam/17123236

 yours,
 Theo




 --
 Fabien Rousseau
 *
 *
  aur...@yakaz.comwww.yakaz.com






Performance issues with CQL3 collections?

2013-06-26 Thread Theo Hultberg
Hi,

I've seen a couple of people on Stack Overflow having problems with
performance when they have maps that they continuously update, and in
hindsight I think I might have run into the same problem myself (but I
didn't suspect it as the reason and designed differently and by accident
didn't use maps anymore).

Is there any reason that maps (or lists or sets) in particular would become
a performance issue when they're heavily modified? As I've understood them
they're not special, and shouldn't be any different performance wise than
overwriting regular columns. Is there something different going on that I'm
missing?

Here are the Stack Overflow questions:

http://stackoverflow.com/questions/17282837/cassandra-insert-perfomance-issue-into-a-table-with-a-map-type/17290981

http://stackoverflow.com/questions/17082963/bad-performance-when-writing-log-data-to-cassandra-with-timeuuid-as-a-column-nam/17123236

yours,
Theo


Re: Performance issues with CQL3 collections?

2013-06-26 Thread Theo Hultberg
do I understand it correctly if I think that collection modifications are
done by reading the collection, writing a range tombstone that would cover
the collection and then re-writing the whole collection again? or is it
just the modified parts of the collection that are covered by the range
tombstones, but you still get massive amounts of them and its just their
number that is the problem.

would this explain the slowdown of writes too? I guess it would if
cassandra needed to read the collection before it wrote the new values,
otherwise I don't understand how this affects writes, but that only says
how much I know about how this works.

T#


On Wed, Jun 26, 2013 at 10:48 AM, Fabien Rousseau fab...@yakaz.com wrote:

 Hi,

 I'm pretty sure that it's related to this ticket :
 https://issues.apache.org/jira/browse/CASSANDRA-5677

 I'd be happy if someone tests this patch.
 It should apply easily on 1.2.5  1.2.6

 After applying the patch, by default, the current implementation is still
 used, but modify your cassandra.yaml to add the following one :
 interval_tree_provider: IntervalTreeAvlProvider

 (Note that implementations should be interchangeable, because they share
 the same serializers and deserializers)

 Also, please note that this patch has not been reviewed nor intensively
 tested... So, it may not be production ready

 Fabien







 2013/6/26 Theo Hultberg t...@iconara.net

 Hi,

 I've seen a couple of people on Stack Overflow having problems with
 performance when they have maps that they continuously update, and in
 hindsight I think I might have run into the same problem myself (but I
 didn't suspect it as the reason and designed differently and by accident
 didn't use maps anymore).

 Is there any reason that maps (or lists or sets) in particular would
 become a performance issue when they're heavily modified? As I've
 understood them they're not special, and shouldn't be any different
 performance wise than overwriting regular columns. Is there something
 different going on that I'm missing?

 Here are the Stack Overflow questions:


 http://stackoverflow.com/questions/17282837/cassandra-insert-perfomance-issue-into-a-table-with-a-map-type/17290981


 http://stackoverflow.com/questions/17082963/bad-performance-when-writing-log-data-to-cassandra-with-timeuuid-as-a-column-nam/17123236

 yours,
 Theo




 --
 Fabien Rousseau
 *
 *
  aur...@yakaz.comwww.yakaz.com



cql-rb, the CQL3 driver for Ruby has reached v1.0

2013-06-13 Thread Theo Hultberg
After a few months of development and many preview releases cql-rb, the
pure Ruby CQL3 driver has finally reached v1.0.

You can find the code and examples on GitHub:
https://github.com/iconara/cql-rb

T#


Re: Why so many vnodes?

2013-06-11 Thread Theo Hultberg
But in the paragraph just before Richard said that finding the node that
owns a token becomes slower on large clusters with lots of token ranges, so
increasing it further seems contradictory.

Is this a correct interpretation: finding the node that owns a particular
token becomes slower as the number of nodes (and therefore total token
ranges) increases, but for large clusters you also need to take the time
for bootstraps into account, which will become slower if each node has
fewer token ranges. The speed referred to in the two cases are the speeds
of different operations, and there will be a trade off, and 256 initial
tokens is a trade off that works for most cases.

T#


On Tue, Jun 11, 2013 at 8:37 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 I think he actually meant *increase*, for this reason For small T, a
 random choice of initial tokens will in most cases give a poor distribution
 of data.  The larger T is, the closer to uniform the distribution will be,
 with increasing probability.

 Alain


 2013/6/11 Theo Hultberg t...@iconara.net

 thanks, that makes sense, but I assume in your last sentence you mean
 decrease it for large clusters, not increase it?

 T#


 On Mon, Jun 10, 2013 at 11:02 PM, Richard Low rich...@wentnet.comwrote:

 Hi Theo,

 The number (let's call it T and the number of nodes N) 256 was chosen to
 give good load balancing for random token assignments for most cluster
 sizes.  For small T, a random choice of initial tokens will in most cases
 give a poor distribution of data.  The larger T is, the closer to uniform
 the distribution will be, with increasing probability.

 Also, for small T, when a new node is added, it won't have many ranges
 to split so won't be able to take an even slice of the data.

 For this reason T should be large.  But if it is too large, there are
 too many slices to keep track of as you say.  The function to find which
 keys live where becomes more expensive and operations that deal with
 individual vnodes e.g. repair become slow.  (An extreme example is SELECT *
 LIMIT 1, which when there is no data has to scan each vnode in turn in
 search of a single row.  This is O(NT) and for even quite small T takes
 seconds to complete.)

 So 256 was chosen to be a reasonable balance.  I don't think most users
 will find it too slow; users with extremely large clusters may need to
 increase it.

 Richard.


 On 10 June 2013 18:55, Theo Hultberg t...@iconara.net wrote:

 I'm not sure I follow what you mean, or if I've misunderstood what
 Cassandra is telling me. Each node has 256 vnodes (or tokens, as the
 prefered name seems to be). When I run `nodetool status` each node is
 reported as having 256 vnodes, regardless of how many nodes are in the
 cluster. A single node cluster has 256 vnodes on the single node, a six
 node cluster has 256 nodes on each machine, making 1590 vnodes in total.
 When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node
 lists 256 tokens.

 This is different from how it works in Riak and Voldemort, if I'm not
 mistaken, and that is the source of my confusion.

 T#


 On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh 
 milindpar...@gmail.comwrote:

 There are n vnodes regardless of the size of the physical cluster.
 Regards
 Milind
 On Jun 10, 2013 7:48 AM, Theo Hultberg t...@iconara.net wrote:

 Hi,

 The default number of vnodes is 256, is there any significance in
 this number? Since Cassandra's vnodes don't work like for example Riak's,
 where there is a fixed number of vnodes distributed evenly over the 
 nodes,
 why so many? Even with a moderately sized cluster you get thousands of
 slices. Does this matter? If your cluster grows to over thirty machines 
 and
 you start looking at ten thousand slices, would that be a problem? I 
 guess
 trat traversing a list of a thousand or ten thousand slices to find 
 where a
 token lives isn't a huge problem, but are there any other up or downsides
 to having a small or large number of vnodes per node?

 I understand the benefits for splitting up the ring into pieces, for
 example to be able to stream data from more nodes when bootstrapping a 
 new
 one, but that works even if each node only has say 32 vnodes (unless your
 cluster is truly huge).

 yours,
 Theo








Why so many vnodes?

2013-06-10 Thread Theo Hultberg
Hi,

The default number of vnodes is 256, is there any significance in this
number? Since Cassandra's vnodes don't work like for example Riak's, where
there is a fixed number of vnodes distributed evenly over the nodes, why so
many? Even with a moderately sized cluster you get thousands of slices.
Does this matter? If your cluster grows to over thirty machines and you
start looking at ten thousand slices, would that be a problem? I guess trat
traversing a list of a thousand or ten thousand slices to find where a
token lives isn't a huge problem, but are there any other up or downsides
to having a small or large number of vnodes per node?

I understand the benefits for splitting up the ring into pieces, for
example to be able to stream data from more nodes when bootstrapping a new
one, but that works even if each node only has say 32 vnodes (unless your
cluster is truly huge).

yours,
Theo


Re: Why so many vnodes?

2013-06-10 Thread Theo Hultberg
I'm not sure I follow what you mean, or if I've misunderstood what
Cassandra is telling me. Each node has 256 vnodes (or tokens, as the
prefered name seems to be). When I run `nodetool status` each node is
reported as having 256 vnodes, regardless of how many nodes are in the
cluster. A single node cluster has 256 vnodes on the single node, a six
node cluster has 256 nodes on each machine, making 1590 vnodes in total.
When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node
lists 256 tokens.

This is different from how it works in Riak and Voldemort, if I'm not
mistaken, and that is the source of my confusion.

T#


On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh milindpar...@gmail.comwrote:

 There are n vnodes regardless of the size of the physical cluster.
 Regards
 Milind
 On Jun 10, 2013 7:48 AM, Theo Hultberg t...@iconara.net wrote:

 Hi,

 The default number of vnodes is 256, is there any significance in this
 number? Since Cassandra's vnodes don't work like for example Riak's, where
 there is a fixed number of vnodes distributed evenly over the nodes, why so
 many? Even with a moderately sized cluster you get thousands of slices.
 Does this matter? If your cluster grows to over thirty machines and you
 start looking at ten thousand slices, would that be a problem? I guess trat
 traversing a list of a thousand or ten thousand slices to find where a
 token lives isn't a huge problem, but are there any other up or downsides
 to having a small or large number of vnodes per node?

 I understand the benefits for splitting up the ring into pieces, for
 example to be able to stream data from more nodes when bootstrapping a new
 one, but that works even if each node only has say 32 vnodes (unless your
 cluster is truly huge).

 yours,
 Theo




Re: Why so many vnodes?

2013-06-10 Thread Theo Hultberg
thanks, that makes sense, but I assume in your last sentence you mean
decrease it for large clusters, not increase it?

T#


On Mon, Jun 10, 2013 at 11:02 PM, Richard Low rich...@wentnet.com wrote:

 Hi Theo,

 The number (let's call it T and the number of nodes N) 256 was chosen to
 give good load balancing for random token assignments for most cluster
 sizes.  For small T, a random choice of initial tokens will in most cases
 give a poor distribution of data.  The larger T is, the closer to uniform
 the distribution will be, with increasing probability.

 Also, for small T, when a new node is added, it won't have many ranges to
 split so won't be able to take an even slice of the data.

 For this reason T should be large.  But if it is too large, there are too
 many slices to keep track of as you say.  The function to find which keys
 live where becomes more expensive and operations that deal with individual
 vnodes e.g. repair become slow.  (An extreme example is SELECT * LIMIT 1,
 which when there is no data has to scan each vnode in turn in search of a
 single row.  This is O(NT) and for even quite small T takes seconds to
 complete.)

 So 256 was chosen to be a reasonable balance.  I don't think most users
 will find it too slow; users with extremely large clusters may need to
 increase it.

 Richard.


 On 10 June 2013 18:55, Theo Hultberg t...@iconara.net wrote:

 I'm not sure I follow what you mean, or if I've misunderstood what
 Cassandra is telling me. Each node has 256 vnodes (or tokens, as the
 prefered name seems to be). When I run `nodetool status` each node is
 reported as having 256 vnodes, regardless of how many nodes are in the
 cluster. A single node cluster has 256 vnodes on the single node, a six
 node cluster has 256 nodes on each machine, making 1590 vnodes in total.
 When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node
 lists 256 tokens.

 This is different from how it works in Riak and Voldemort, if I'm not
 mistaken, and that is the source of my confusion.

 T#


 On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh milindpar...@gmail.comwrote:

 There are n vnodes regardless of the size of the physical cluster.
 Regards
 Milind
 On Jun 10, 2013 7:48 AM, Theo Hultberg t...@iconara.net wrote:

 Hi,

 The default number of vnodes is 256, is there any significance in this
 number? Since Cassandra's vnodes don't work like for example Riak's, where
 there is a fixed number of vnodes distributed evenly over the nodes, why so
 many? Even with a moderately sized cluster you get thousands of slices.
 Does this matter? If your cluster grows to over thirty machines and you
 start looking at ten thousand slices, would that be a problem? I guess trat
 traversing a list of a thousand or ten thousand slices to find where a
 token lives isn't a huge problem, but are there any other up or downsides
 to having a small or large number of vnodes per node?

 I understand the benefits for splitting up the ring into pieces, for
 example to be able to stream data from more nodes when bootstrapping a new
 one, but that works even if each node only has say 32 vnodes (unless your
 cluster is truly huge).

 yours,
 Theo






Re: [Cassandra] Conflict resolution in Cassandra

2013-06-07 Thread Theo Hultberg
Like Edward says Cassandra's conflict resolution strategy is LWW (last
write wins). This may seem simplistic, but Cassandra's Big Query-esque data
model makes it less of an issue than in a pure key/value-store like Riak,
for example. When all you have is an opaque value for a key you want to be
able to do things like keeping conflicting writes so that you can resolve
them later. Since Cassandra's rows aren't opaque, but more like a sorted
map LWW is almost always enough. With Cassandra you can add new
columns/cells to a row from multiple clients without having to worry about
conflicts. It's only when multiple clients write to the same column/cell
that there is an issue, but in that case you usually can (and you probably
should) model your way around that.

T#


On Fri, Jun 7, 2013 at 4:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Conflicts are managed at the column level.
 1) If two columns have the same name the column with the highest timestamp
 wins.
 2) If two columns have the same column name and the same timestamp the
 value of the column is compared and the highest* wins.

 Someone correct me if I am wrong about the *. I know the algorithm is
 deterministic, I do not remember if it is highest or lowest.


 On Thu, Jun 6, 2013 at 6:25 PM, Emalayan Vairavanathan 
 svemala...@yahoo.com wrote:

 I tried google and found conflicting answers. Thats why wanted to double
 check with user forum.

 Thanks

   --
  *From:* Bryan Talbot btal...@aeriagames.com
 *To:* user@cassandra.apache.org; Emalayan Vairavanathan 
 svemala...@yahoo.com
 *Sent:* Thursday, 6 June 2013 3:19 PM
 *Subject:* Re: [Cassandra] Conflict resolution in Cassandra

 For generic questions like this, google is your friend:
 http://lmgtfy.com/?q=cassandra+conflict+resolution

 -Bryan


 On Thu, Jun 6, 2013 at 11:23 AM, Emalayan Vairavanathan 
 svemala...@yahoo.com wrote:

 Hi All,

 Can someone tell me about the conflict resolution mechanisms provided by
 Cassandra?

 More specifically does Cassandra provides a way to define application
 specific conflict resolution mechanisms (per row basis  / column basis)?
or
 Does it automatically manage the conflicts based on some synchronization
 algorithms ?


 Thank you
 Emalayan









Re: Getting error Too many in flight hints

2013-05-31 Thread Theo Hultberg
thanks a lot for the explanation. if I understand it correctly it basically
back pressure from C*, it's telling me that it's overloaded and that I need
to back off.

I better start a few more nodes, I guess.

T#


On Thu, May 30, 2013 at 10:47 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, May 30, 2013 at 8:24 AM, Theo Hultberg t...@iconara.net wrote:
  I'm using Cassandra 1.2.4 on EC2 (3 x m1.large, this is a test cluster),
 and
  my application is talking to it over the binary protocol (I'm using JRuby
  and the cql-rb driver). I get this error quite frequently: Too many in
  flight hints: 2411 (the exact number varies)
 
  Has anyone any idea of what's causing it? I'm pushing the cluster quite
 hard
  with writes (but no reads at all).

 The code that produces this message (below) sets the bound based on
 the number of available processors. It is a bound of   number of in
 progress hints. An in progress hint (for some reason redundantly
 referred to as in flight) is a hint which has been submitted to the
 executor which will ultimately write it to local disk. If you get
 OverloadedException, this means that you were trying to write hints to
 this executor so fast that you risked OOM, so Cassandra refused to
 submit your hint to the hint executor and therefore (partially) failed
 your write.

 
 private static volatile int maxHintsInProgress = 1024 *
 FBUtilities.getAvailableProcessors();
 [... snip ...]
 for (InetAddress destination : targets)
 {
 // avoid OOMing due to excess hints.  we need to do this
 check even for live nodes, since we can
 // still generate hints for those if it's overloaded or
 simply dead but not yet known-to-be-dead.
 // The idea is that if we have over maxHintsInProgress
 hints in flight, this is probably due to
 // a small number of nodes causing problems, so we should
 avoid shutting down writes completely to
 // healthy nodes.  Any node with no hintsInProgress is
 considered healthy.
 if (totalHintsInProgress.get()  maxHintsInProgress
  (hintsInProgress.get(destination).get()  0 
 shouldHint(destination)))
 {
 throw new OverloadedException(Too many in flight
 hints:  + totalHintsInProgress.get());
 }
 

 If Cassandra didn't return this exception, it might OOM while
 enqueueing your hints to be stored. Giving up on trying to enqueue a
 hint for the failed write is chosen instead. The solution is to reduce
 your write rate, ideally by enough that you don't even queue hints in
 the first place.

 =Rob



Getting error Too many in flight hints

2013-05-30 Thread Theo Hultberg
Hi,

I'm using Cassandra 1.2.4 on EC2 (3 x m1.large, this is a test cluster),
and my application is talking to it over the binary protocol (I'm using
JRuby and the cql-rb driver). I get this error quite frequently: Too many
in flight hints: 2411 (the exact number varies)

Has anyone any idea of what's causing it? I'm pushing the cluster quite
hard with writes (but no reads at all).

T#


Re: Limit on the size of a list

2013-05-13 Thread Theo Hultberg
In the CQL3 protocol the sizes of collections are unsigned shorts, so the
maximum number of elements in a LIST... is 65,536. There's no check,
afaik, that stops you from creating lists that are bigger than that, but
the protocol doesn't handle returning them (you get the first N - 65536 %
65536 items).

On the other hand the JDBC driver doesn't talk over the binary protocol but
Thrift, doesn't it? In that case there may be other limits.

T#


On Mon, May 13, 2013 at 3:26 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 2 billion is the maximum theoretically limit of columns under a row. It is
 NOT the maximum limit of a CQL collection. The design of CQL collections
 currently require retrieving the entire collection on read.


 On Sun, May 12, 2013 at 11:13 AM, Robert Wille rwi...@footnote.comwrote:

 I designed a data model for my data that uses a list of UUID's in a
 column. When I designed my data model, my expectation was that most of the
 lists would have fewer than a hundred elements, with a few having several
 thousand. I discovered in my data a list that has nearly 400,000 items in
 it. When I try to retrieve it, I get the following exception:

 java.lang.IllegalArgumentException: Illegal Capacity: -14594
 at java.util.ArrayList.init(ArrayList.java:110)
 at
 org.apache.cassandra.cql.jdbc.ListMaker.compose(ListMaker.java:54)
 at
 org.apache.cassandra.cql.jdbc.TypedColumn.init(TypedColumn.java:68)
 at

 org.apache.cassandra.cql.jdbc.CassandraResultSet.createColumn(CassandraResu
 ltSet.java:1086)
 at

 org.apache.cassandra.cql.jdbc.CassandraResultSet.populateColumns(CassandraR
 esultSet.java:161)
 at

 org.apache.cassandra.cql.jdbc.CassandraResultSet.init(CassandraResultSet.
 java:134)
 at

 org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStateme
 nt.java:166)
 at

 org.apache.cassandra.cql.jdbc.CassandraStatement.executeQuery(CassandraStat
 ement.java:226)


 I get this with Cassandra 1.2.4 and the latest snapshot of the JDBC
 driver. Admittedly, several hundred thousand is quite a lot of items, but
 odd that I'm getting some kind of wraparound, since 400,000 is a long ways
 from 2 billion.

 What are the physical and practical limits on the size of a list? Is it
 possible to retrieve a range of items from a list?

 Thanks in advance

 Robert






New CQL3 driver for Ruby

2013-02-24 Thread Theo Hultberg
Hi,

For the last few weeks I've been working on a CQL3 driver for Ruby. If
you're using Ruby and Cassandra I would very much like your help getting it
production ready.

You can find the code and documentation here:

https://github.com/iconara/cql-rb

The driver supports the full CQL3 protocol except for authentication. It's
implemented purely in Ruby and has no dependencies.

If you try it out and find a bug (which I'm sure you will), please email me
directy (t...@iconara.net) or open an issue in the GitHub project.

yours,
Theo


Re: cql: show tables in a keystone

2013-01-28 Thread Theo Hultberg
the DESCRIBE family of commands in cqlsh are wrappers around queries to the
system keyspace, so if you want to inspect what keyspaces and tables exist
from your application you can do something like:

SELECT columnfamily_name, comment
FROM system.schema_columnfamilies
WHERE keyspace_name = 'test';

or

SELECT * FROM system.schema_keyspaces;

T#


On Mon, Jan 28, 2013 at 8:35 PM, Brian O'Neill b...@alumni.brown.eduwrote:


 cqlsh use keyspace;
 cqlsh:cirrus describe tables;

 For more info:
 cqlsh help describe

 -brian


 ---
 Brian O'Neill
 Lead Architect, Software Development
 Health Market Science
 The Science of Better Results
 2700 Horizon Drive € King of Prussia, PA € 19406
 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
 healthmarketscience.com

 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or
 the person responsible to deliver it to the intended recipient, please
 contact the sender at the email above and delete this email and any
 attachments and destroy any copies thereof. Any review, retransmission,
 dissemination, copying or other use of, or taking any action in reliance
 upon, this information by persons or entities other than the intended
 recipient is strictly prohibited.







 On 1/28/13 2:27 PM, Paul van Hoven paul.van.ho...@googlemail.com
 wrote:

 Is there some way in cql to get a list of all tables or column
 families that belong to a keystore like show tables in sql?





Re: CQL3 Frame Length

2013-01-19 Thread Theo Hultberg
Hi,

Another reason for keeping the frame length in the header is that newer
versions can add fields to frames without older clients breaking. For
example a minor release can add some more content to an existing frame
without older clients breaking. If clients didn't know the full frame
length (and were required by the specification to consume all the bytes)
there would be trailing garbage which would most likely crash the client.

T#

 Hey Sylvain,

 Thanks for explaining the rationale. When you look at from the perspective
 of the use cases you mention, it makes sense to be able to supply the
 reader with the frame size up front.

 I've opted to go for serializing the frame into a buffer. Although this
 could materialize an arbitrarily large amount of memory, ultimately the
 driving application has control of the degree to which this can occur, so
 in the grander scheme of things, you can still maintain streaming
semantics.

 Thanks for the heads up.

 Cheers,

 Ben


 On Tue, Jan 8, 2013 at 4:08 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 Mostly this is because having the frame length is convenient to have in
 practice.

 Without pretending that there is only one way to write a server, it is
 common
 to separate the phase read a frame from the network from the phase
 decode
 the frame which is often simpler if you can read the frame upfront.
Also,
 if
 you don't have the frame size, it means you need to decode the whole
frame
 before being able to decode the next one, and so you can't parallelize
the
 decoding.

 It is true however that it means for the write side that you need to
 either be
 able to either pre-compute the frame body size or to serialize it in
memory
 first. That's a trade of for making it easier on the read side. But if
you
 want
 my opinion, on the write side too it's probably worth parallelizing the
 message
 encoding (which require you encode it in memory first) since it's an
 asynchronous protocol and so there will likely be multiple writer
 simultaneously.

 --
 Sylvain



 On Tue, Jan 8, 2013 at 12:48 PM, Ben Hood 0x6e6...@gmail.com wrote:

 Hi,

 I've read the CQL wire specification and naively, I can't see how the
 frame length length header is used.

 To me, it looks like on the read side, you know which type of structures
 to expect based on the opcode and each structure is TLV encoded.

 On the write side, you need to encode TLV structures as well, but you
 don't know the overall frame length until you've encoded it. So it would
 seem that you either need to pre-calculate the cumulative TLV size
before
 you serialize the frame body, or you serialize the frame body to a
buffer
 which you can then get the size of and then write to the socket, after
 having first written the count out.

 Is there potentially an implicit assumption that the reader will want to
 pre-buffer the entire frame before decoding it?

 Cheers,

 Ben