Question regarding tombstone removal and compaction

2012-08-10 Thread Fredrik
We've had a bug that caused one of our column families to grow very big 280 GB on a 500 GB disk. We're using size tiered compaction. Since it's only append data I've now issued deletes of 260 GB of superflous data. 1. There are som quite large SSTables (80 GB, 40 GB etc..). If I run a major

Problem with building Cascading tap for Cassandra

2012-08-10 Thread Gijs Stuurman
Hi all, I'm trying to build a Cascading tap for Cassandra. Cascading is a layer on top of Hadoop. For this purpose I use ColumnFamilyInputFormat and ColumnFamilyRecordReader from Cassandra. I ran into a problem that the record reader would create an endless iterator because something goes wrong

Commit log + Data directory on same partition (software raid)

2012-08-10 Thread Thibaut Britz
Hi, Has anyone of you made some experience with software raid (raid 1, mirroring 2 disks)? Our workload is rather read based at the moment (Commit Log directory only grows by 128MB every 2-3 minutes), while the second hd is under high load due to the read requests to our cassandra cluster. I

Re: Commit log + Data directory on same partition (software raid)

2012-08-10 Thread Radim Kolar
I was thinking about putting both the commit log and the data directory on a software raid partition spanning over the two disks. Would this increase the general read performance? In theory I could get twice the read performance, but I don't know how the commit log will influence the read

CQL connections

2012-08-10 Thread David McNelis
In using CQL (the python library, at least), I didn't see a way to pass in multiple nodes as hosts. With other libraries (like Hector and Pycassa) I can set multiple hosts and my app will work with anyone on that list. Is there something similar going on in the background with CQL? If not, then

Re: Decision Making- YCSB

2012-08-10 Thread Edward Capriolo
There are many YCSB forks on github that get optimized for specific databases but the default one is decent across the defaults. Cassandra has it's own internal stress tool that we like better. The short comings are that generic tools and generic workloads are generic and thus not real-world. But

Re: Decision Making- YCSB

2012-08-10 Thread Mohit Anchlia
I agree with Edward. We always develop our own stress tool that tests each use case of interest. Every use case is different in certain ways that can only be tested using custom stress tool. On Fri, Aug 10, 2012 at 7:25 AM, Edward Capriolo edlinuxg...@gmail.comwrote: There are many YCSB forks

Re: Cassandra data model help

2012-08-10 Thread Aaron Turner
You need to track node membership separately. I do that in a SQL database, but you can use cassandra for that. For example: rowkey = cluster name column name Composite[ epoch_time:node_name] = [join|leave] Then every time a node joins or leaves a cluster, write an entry. Then you can just

Re: How to create a COLUMNFAMILY with Leveled Compaction?

2012-08-10 Thread Andy Ballingall TF
On 3 August 2012 21:31, Data Craftsman 木匠 database.crafts...@gmail.com wrote: Nobody use Leveled Compaction with CQL 3.0 ? I tried this, and I can't get it to work either. I'm using: [cqlsh 2.2.0 | Cassandra 1.1.2 | CQL spec 3.0.0 | Thrift protocol 19.32.0] Here's what my create table

Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Hi all Just replaced ( clean install ) version 1.0.9 with 1.1.3 - two node amazon cluster. After yaml modification and starting both nodes - they do not see each other: Note: Ownership information does not include topology, please specify a keyspace. Address DC Rack

Re: Problem with version 1.1.3

2012-08-10 Thread Derek Barnes
Do both nodes refer to one another as seeds in cassandra.yaml? On Fri, Aug 10, 2012 at 1:46 PM, Dwight Smith dwight.sm...@genesyslab.comwrote: Hi all ** ** Just replaced ( clean install ) version 1.0.9 with 1.1.3 – two node amazon cluster. After yaml modification and starting both

RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Yes - BUT they are the node hostnames and not the ip addresses From: Derek Barnes [mailto:sj.clim...@gmail.com] Sent: Friday, August 10, 2012 2:00 PM To: user@cassandra.apache.org Subject: Re: Problem with version 1.1.3 Do both nodes refer to one another as seeds in cassandra.yaml? On

RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Derek I added both node hostnames to the seeds and it now has the correct nodetool ring: Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052863 10.168.87.107 datacenter1 rack1 Up Normal 13.5 KB 50.00%

Re: CQL connections

2012-08-10 Thread Data Craftsman 木匠
I want to know it too. http://www.datastax.com/support-forums/topic/when-will-pycassa-support-cql Connection pool and load balance is a necessary feature for multi-user production application. Thanks, Charlie | DBA On Fri, Aug 10, 2012 at 6:47 AM, David McNelis dmcne...@gmail.com wrote: In

RE: Problem with version 1.1.3

2012-08-10 Thread Dwight Smith
Further info - it seems I had the seeds list backwards - it did not need both nodes - I have corrected that with each pointing to the other as a single seed entry - and it works fine. Thanks again for the quick response. From: Dwight Smith [mailto:dwight.sm...@genesyslab.com] Sent:

anyone have any performance numbers? and here are some perf numbers of my own...

2012-08-10 Thread Hiller, Dean
** 3. In my test below, I see there is now 8Gig of data and 9,000,000 rows. Does that sound right?, nearly 1MB of space is used per row for a 50 column row That sounds like a huge amount of overhead. (my values are long on every column, but that is still not much). I was expecting

Re: anyone have any performance numbers? and here are some perf numbers of my own...

2012-08-10 Thread Hiller, Dean
Ignore the third one, my math was badŠworked out to 733 bytes / row and it ended up being 6.6 gig as it compacted it some after it was done when the load was light(noticed that a bit later) But what about the other two? Is that the time is expected approximately? Thanks, Dean On 8/10/12 3:50

quick question about data layout on disk

2012-08-10 Thread Aaron Turner
Curious, but does cassandra store the rowkey along with every column/value pair on disk (pre-compaction) like Hbase does? If so (which makes the most sense), I assume that's something that is optimized during compaction? -- Aaron Turner http://synfin.net/ Twitter: @synfinatic

Re: quick question about data layout on disk

2012-08-10 Thread Terje Marthinussen
Rowkey is stored only once in any sstable file. That is, in the spesial case where you get sstable file per column/value, you are correct, but normally, I guess most of us are storing more per key. Regards, Terje On 11 Aug 2012, at 10:34, Aaron Turner synfina...@gmail.com wrote: Curious, but

Re: Decision Making- YCSB

2012-08-10 Thread Roshni Rajagopal
Thanks Edward and Mohit. We do have an in house tool, but that tests pretty much the same thing as YCSB- read , write performance given a number of threads type of operations as an input. The good thing here is that we own the code and we can modify it easily. YCSB does not seem to be