Re: Reasonable range for the max number of tables?

2014-08-18 Thread Cheng Ren
We use Cassandra for multi-tenant application. Each tenant has own set of
tables and we have 1592 tables in total in our production Cassandra
cluster. It's running well and doesn't have any memory consumption issue,
but the challenge confronting us is the schema change problem.We have such
a large amount of tables, moreover a significant amount of them have around
as many as 50 columns, which totals 33099 rows in schema_columns table in
system keyspace. Every time we do schema change for all of our tenants, the
whole cluster will get very busy and applications running against it need
to be shut down for several hours to accommodate this change.

The way we solve this is using a new feature we developed called
template. There is a detailed description in the JIRA issue we opened:
https://issues.apache.org/jira/browse/CASSANDRA-7643

We have some performance results in our 15-node test cluster. Normally
creating 400 tables takes more than hours for all the migration stage tasks
to complete , but if we create 400 tables with templates, *it just takes 1
to 2 seconds*. It also works great for alter table.

[image: Inline image 1]


[image: Inline image 1]
table # in the graph means the number of existing tables in user keyspaces.
We created 400 more tables and measure the time all tasks in migration
stage take to complete. Besides, we also measure the migration task
completion time for adding one column for a template, which will also add
the column for all the column families with that template.

We believe what we proposed here can be very useful for other people in the
Cassandra community as well. We have attached the patch in the JIRA. You
can also read the community feedbacks there.

Thanks,
Cheng


On Tue, Aug 5, 2014 at 5:43 AM, Michal Michalski 
michal.michal...@boxever.com wrote:

  - Use a keyspace per customer
  These effectively amount to the same thing and they both fall foul to the
  limit in the number of column families so do not scale.

 But then you can scale by moving some of the customers to a new cluster
 easily. If you keep everything in a single keyspace or - worse - if you do
 your multitenancy by prefixing row keys with customer ids of some kind, it
 won't be that easy, as you wrote later in your e-mail.

 M.



 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com


 On 5 August 2014 12:36, Phil Luckhurst phil.luckhu...@powerassure.com
 wrote:

 Hi Mark,

 Mark Reddy wrote
  To segregate customer data, you could:
  - Use customer specific column families under a single keyspace
  - Use a keyspace per customer

 These effectively amount to the same thing and they both fall foul to the
 limit in the number of column families so do not scale.


 Mark Reddy wrote
  - Use the same column families and have a column that identifies the
  customer. On the application layer ensure that there are sufficient
 checks
  so one customer can't read another customers data

 And while this gets around the column family limit it does not allow the
 same level of data segregation. For example with a separate keyspace or
 column families it is trivial to remove a single customer's data or move
 that data to another system. With one set of column families for all
 customers these types of actions become much more difficult as any change
 impacts all customers but perhaps that's the price we have to pay to
 scale.

 And I still think this needs to be made more prominent in the
 documentation.

 Thanks
 Phil



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive
 at Nabble.com.





Re: Reasonable range for the max number of tables?

2014-08-05 Thread Phil Luckhurst
Is there any mention of this limitation anywhere in the Cassandra
documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra'
section of the DataStax 2.0 documentation or anywhere else.

When starting out with Cassandra as a store for a multi-tenant application
it seems very attractive to segregate data for each tenant using a tenant
specific keyspace each with their own set of tables. It's not until you
start browsing through forums such as this that you find out that it isn't
going to scale above a few tenants.

If you want to be able to segregate customer data in Cassandra is it the
accepted practice to have multiple Cassandra installations?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Reasonable range for the max number of tables?

2014-08-05 Thread Mark Reddy
Hi Phil,

In theory, the max number of column families would be in the low number of
hundreds. In practice the limit is related the amount of heap you have, as
each column family will consume 1 MB of heap due to arena allocation.

To segregate customer data, you could:
- Use customer specific column families under a single keyspace
- Use a keyspace per customer
- Use the same column families and have a column that identifies the
customer. On the application layer ensure that there are sufficient checks
so one customer can't read another customers data


Mark


On Tue, Aug 5, 2014 at 9:09 AM, Phil Luckhurst 
phil.luckhu...@powerassure.com wrote:

 Is there any mention of this limitation anywhere in the Cassandra
 documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra'
 section of the DataStax 2.0 documentation or anywhere else.

 When starting out with Cassandra as a store for a multi-tenant application
 it seems very attractive to segregate data for each tenant using a tenant
 specific keyspace each with their own set of tables. It's not until you
 start browsing through forums such as this that you find out that it isn't
 going to scale above a few tenants.

 If you want to be able to segregate customer data in Cassandra is it the
 accepted practice to have multiple Cassandra installations?



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Re: Reasonable range for the max number of tables?

2014-08-05 Thread Phil Luckhurst
Hi Mark,

Mark Reddy wrote
 To segregate customer data, you could:
 - Use customer specific column families under a single keyspace
 - Use a keyspace per customer

These effectively amount to the same thing and they both fall foul to the
limit in the number of column families so do not scale.


Mark Reddy wrote
 - Use the same column families and have a column that identifies the
 customer. On the application layer ensure that there are sufficient checks
 so one customer can't read another customers data

And while this gets around the column family limit it does not allow the
same level of data segregation. For example with a separate keyspace or
column families it is trivial to remove a single customer's data or move
that data to another system. With one set of column families for all
customers these types of actions become much more difficult as any change
impacts all customers but perhaps that's the price we have to pay to scale.

And I still think this needs to be made more prominent in the documentation.

Thanks
Phil



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Reasonable range for the max number of tables?

2014-08-05 Thread Jack Krupansky
Multi-tenant remain a challenge - for most technologies. Yes, you can do 
what you suggest, but... you need to exercise great care and test and 
provision your cluster with great care. It's not like a free resource that 
scales wildly in all directions with no forethought or care.


It is something that does work, sort of, but it wasn't one of the design 
goals or core strengths of Cassandra. IOW, it was/is more of a side effect 
rather than a core pattern. Anti-pattern simply means that it is not 
guaranteed to be a full-fledged, first-class feature. It means you can do 
it, and if it works well for you for your particular use case, great, but 
don't complain too loudly here if it doesn't.


That said, anybody who has great success - or great failure - with 
multi-tenant for Cassandra, or any other technology, should definitely share 
their experiences here.


And the bottom line is that dozens or low hundreds remains the 
recommended limit for tables in a single Cassandra cluster. Not a hard 
limit, but just a recommendation.


Multi-tenant is an area of great interest, so I suspect Cassandra - and all 
other technologies - will see a lot of evolution in the coming years in this 
area.


-- Jack Krupansky

-Original Message- 
From: Phil Luckhurst

Sent: Tuesday, August 5, 2014 4:09 AM
To: cassandra-u...@incubator.apache.org
Subject: Re: Reasonable range for the max number of tables?

Is there any mention of this limitation anywhere in the Cassandra
documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra'
section of the DataStax 2.0 documentation or anywhere else.

When starting out with Cassandra as a store for a multi-tenant application
it seems very attractive to segregate data for each tenant using a tenant
specific keyspace each with their own set of tables. It's not until you
start browsing through forums such as this that you find out that it isn't
going to scale above a few tenants.

If you want to be able to segregate customer data in Cassandra is it the
accepted practice to have multiple Cassandra installations?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com. 



Re: Reasonable range for the max number of tables?

2014-08-05 Thread Michal Michalski
 - Use a keyspace per customer
 These effectively amount to the same thing and they both fall foul to the
 limit in the number of column families so do not scale.

But then you can scale by moving some of the customers to a new cluster
easily. If you keep everything in a single keyspace or - worse - if you do
your multitenancy by prefixing row keys with customer ids of some kind, it
won't be that easy, as you wrote later in your e-mail.

M.



Kind regards,
Michał Michalski,
michal.michal...@boxever.com


On 5 August 2014 12:36, Phil Luckhurst phil.luckhu...@powerassure.com
wrote:

 Hi Mark,

 Mark Reddy wrote
  To segregate customer data, you could:
  - Use customer specific column families under a single keyspace
  - Use a keyspace per customer

 These effectively amount to the same thing and they both fall foul to the
 limit in the number of column families so do not scale.


 Mark Reddy wrote
  - Use the same column families and have a column that identifies the
  customer. On the application layer ensure that there are sufficient
 checks
  so one customer can't read another customers data

 And while this gets around the column family limit it does not allow the
 same level of data segregation. For example with a separate keyspace or
 column families it is trivial to remove a single customer's data or move
 that data to another system. With one set of column families for all
 customers these types of actions become much more difficult as any change
 impacts all customers but perhaps that's the price we have to pay to scale.

 And I still think this needs to be made more prominent in the
 documentation.

 Thanks
 Phil



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Reasonable range for the max number of tables?

2014-08-04 Thread Kevin Burton
What's a reasonable range for the max number of tables?  We have an
append-only table system and I've been thinking of moving them to using
hourly / partitioned tables.

This means I can do things like easily drop older tables if I run out of
disk space.  It also means that I can fadvise the most recent tables and
force them into table cache.

But what's the max number of tables?  If I'm doing hourly tables and have
30 days of them this would be 720 individual tables.

That sounds reasonable… but what's the upper bound here?

Kevin

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: Reasonable range for the max number of tables?

2014-08-04 Thread Robert Coli
On Mon, Aug 4, 2014 at 12:42 PM, Kevin Burton bur...@spinn3r.com wrote:

 What's a reasonable range for the max number of tables?  We have an
 append-only table system and I've been thinking of moving them to using
 hourly / partitioned tables.


Low Numbers of Hundreds


 This means I can do things like easily drop older tables if I run out of
 disk space.  It also means that I can fadvise the most recent tables and
 force them into table cache.


If you have to use fadvise to do this, you are almost certainly Doing It
Wrong.


 But what's the max number of tables?  If I'm doing hourly tables and have
 30 days of them this would be 720 individual tables.


Probably pushing it.

=Rob


Re: Reasonable range for the max number of tables?

2014-08-04 Thread Kevin Burton
 This means I can do things like easily drop older tables if I run out of
 disk space.  It also means that I can fadvise the most recent tables and
 force them into table cache.


 If you have to use fadvise to do this, you are almost certainly Doing It
 Wrong.


possibly.. fincore is also useful.  This way you can see the cache % of
each file… if I'm doing something funky the older files should not be in
cache at all.

and fadvise dontneed is your friend too.




 But what's the max number of tables?  If I'm doing hourly tables and have
 30 days of them this would be 720 individual tables.


 Probably pushing it.


What are the bottlenecks here ?  Is this an HDD issue because we're running
them on SSD.  I agree that that would be significant on HDD ...

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: Reasonable range for the max number of tables?

2014-08-04 Thread Robert Coli
On Mon, Aug 4, 2014 at 1:35 PM, Kevin Burton bur...@spinn3r.com wrote:

 What are the bottlenecks here ?  Is this an HDD issue because we're
 running them on SSD.  I agree that that would be significant on HDD ...


Heap consumption. There are probably some JIRA tickets touching on the
memory consumption of large numbers of tables, I suggest searching there...
it has come up about 15 times on this list in the archives too.. :D

=Rob


Re: Reasonable range for the max number of tables?

2014-08-04 Thread Jack Krupansky
Read https://issues.apache.org/jira/browse/CASSANDRA-5935, especially the part 
about “having more than dozens or hundreds of tables defined is almost 
certainly a Bad Idea”.

Either way, try a POC that simply creates your 720 tables and see what happens. 
And let us know how well it works, and whether SSD is somehow different from 
HDD.

It’s not so much that there is a hard “limit”, so much as not being able to 
sleep soundly at night knowing that you’re “pushing it” so far that suddenly, 
out of nowhere, after things seeming to be fine fro an extended period, 
everything suddenly falls apart, with no warning.

-- Jack Krupansky

From: Robert Coli 
Sent: Monday, August 4, 2014 4:54 PM
To: user@cassandra.apache.org 
Subject: Re: Reasonable range for the max number of tables?

On Mon, Aug 4, 2014 at 1:35 PM, Kevin Burton bur...@spinn3r.com wrote:

  What are the bottlenecks here ?  Is this an HDD issue because we're running 
them on SSD.  I agree that that would be significant on HDD ...

Heap consumption. There are probably some JIRA tickets touching on the memory 
consumption of large numbers of tables, I suggest searching there... it has 
come up about 15 times on this list in the archives too.. :D

=Rob