Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
It is the total table count, across all key spaces. Memory is memory.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:26 PM, Brian Sam-Bodden 
wrote:

> Eric,
>   Is the keyspace as a multitenancy solution as bad as the many tables
> pattern? Is the memory overhead of keyspaces as heavy as that of tables?
>
> Cheers,
> Brian
>
>
> On Tuesday, March 1, 2016, Eric Stevens  wrote:
>
>> It's definitely not true for every use case of a large number of tables,
>> but for many uses where you'd be tempted to do that, adding whatever would
>> have driven your table naming instead as a column in your partition key on
>> a smaller number of tables will meet your needs.  This is especially true
>> if you're looking to solve multi-tenancy, unless you let your tenants
>> dynamically drive your schema (which is a separate can of worms).
>>
>> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky 
>> wrote:
>>
>>> I don't think Cassandra was "purposefully developed" for some target
>>> number of tables - there is no evidence of any such an explicit intent.
>>> Instead, it would be fair to say that Cassandra was "not purposefully
>>> developed" with a goal of supporting "large numbers of tables." Sometimes
>>> features and capabilities come for free or as a side effect of the
>>> technologies used, but usually specific features and specific capabilities
>>> (such as large numbers of tables) require explicit intent and explicit
>>> effort.
>>>
>>> One could indeed endeavor to design a data store (I'm not even sure it
>>> would still be considered a database per se) that supported either large
>>> numbers of tables or an additional level of storage model in between table
>>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>>> not designed with that goal in mind.
>>>
>>> Traditionally, a "table" is a defined relation over a set of data.
>>> Relation and data are distinct concepts. And a relation name is not simply
>>> a Java-style "object". A relation (table) name is supposed to represent an
>>> abstraction or entity type, while essentially all of the cases I have heard
>>> of for wanting thousands (or even hundreds) of tables are trying to use
>>> table as more of a container for a group of rows for a specific entity
>>> instance rather than a distinct entity type. Granted, Cassandra is not
>>> obligated to be limited to the relational model, but Cassandra, especially
>>> CQL, is intentionally modeled reasonably closely with the relational model
>>> in terms of the data modeling abstractions even though the storage engine
>>> is designed to scale across nodes.
>>>
>>> You could file a Jira requesting such a feature improvement. And then we
>>> would see if sentiment has shifted over the years.
>>>
>>> The key thing is to offer up a use case that warrants support for large
>>> numbers of tables. So far, it has usually been the case that the perceived
>>> need for separate tables could easily be met using clustering columns of a
>>> single table.
>>>
>>> Seriously, if you guys can define a legitimate use case that can't
>>> easily be handled by a single table, that could get the discussion started.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com> wrote:
>>>
 Hi Jack

 Being purposefully developed to only handle up to “a few hundred”
 tables is reason enough. I accept that, and likely a use case with many
 tables was never really considered. But I would still like to understand
 the design choices made so perhaps we gain some confidence level in this
 upper limit in the number of tables. The best estimate we have so far is “a
 few hundred” which is a bit vague.

 Regarding scaling, I’m not talking about scaling in terms of data
 volume, but on how the data is structured. One thousand tables with one row
 each is the same data volume as one table with one thousand rows, excluding
 any data structures required to maintain the extra tables. But whereas the
 first seems likely to bring a Cassandra cluster to its knees, the second
 will run happily on a single node cluster in a low end machine.

 We will design our code to use a single table to avoid having
 nightmares with this issue. But if there is any authoritative documentation
 on this characteristic of Cassandra, I would love to know more.

 FJ


 On 01 Mar 2016, at 14:23, Jack Krupansky 
 wrote:

 I don't think there are any "reasons behind it." It is simply empirical
 experience - as reported here.

 Cassandra scales in two dimension - number of rows per node and number
 of nodes. If some source of information lead you to believe otherwise,
 please point out the source so that we can endeavor to correct it.

 The exact number of rows per node and tables 

Re: Practical limit on number of column families

2016-03-01 Thread Brian Sam-Bodden
Eric,
  Is the keyspace as a multitenancy solution as bad as the many tables
pattern? Is the memory overhead of keyspaces as heavy as that of tables?

Cheers,
Brian

On Tuesday, March 1, 2016, Eric Stevens  wrote:

> It's definitely not true for every use case of a large number of tables,
> but for many uses where you'd be tempted to do that, adding whatever would
> have driven your table naming instead as a column in your partition key on
> a smaller number of tables will meet your needs.  This is especially true
> if you're looking to solve multi-tenancy, unless you let your tenants
> dynamically drive your schema (which is a separate can of worms).
>
> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky  > wrote:
>
>> I don't think Cassandra was "purposefully developed" for some target
>> number of tables - there is no evidence of any such an explicit intent.
>> Instead, it would be fair to say that Cassandra was "not purposefully
>> developed" with a goal of supporting "large numbers of tables." Sometimes
>> features and capabilities come for free or as a side effect of the
>> technologies used, but usually specific features and specific capabilities
>> (such as large numbers of tables) require explicit intent and explicit
>> effort.
>>
>> One could indeed endeavor to design a data store (I'm not even sure it
>> would still be considered a database per se) that supported either large
>> numbers of tables or an additional level of storage model in between table
>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>> not designed with that goal in mind.
>>
>> Traditionally, a "table" is a defined relation over a set of data.
>> Relation and data are distinct concepts. And a relation name is not simply
>> a Java-style "object". A relation (table) name is supposed to represent an
>> abstraction or entity type, while essentially all of the cases I have heard
>> of for wanting thousands (or even hundreds) of tables are trying to use
>> table as more of a container for a group of rows for a specific entity
>> instance rather than a distinct entity type. Granted, Cassandra is not
>> obligated to be limited to the relational model, but Cassandra, especially
>> CQL, is intentionally modeled reasonably closely with the relational model
>> in terms of the data modeling abstractions even though the storage engine
>> is designed to scale across nodes.
>>
>> You could file a Jira requesting such a feature improvement. And then we
>> would see if sentiment has shifted over the years.
>>
>> The key thing is to offer up a use case that warrants support for large
>> numbers of tables. So far, it has usually been the case that the perceived
>> need for separate tables could easily be met using clustering columns of a
>> single table.
>>
>> Seriously, if you guys can define a legitimate use case that can't easily
>> be handled by a single table, that could get the discussion started.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com
>> >
>> wrote:
>>
>>> Hi Jack
>>>
>>> Being purposefully developed to only handle up to “a few hundred” tables
>>> is reason enough. I accept that, and likely a use case with many tables was
>>> never really considered. But I would still like to understand the design
>>> choices made so perhaps we gain some confidence level in this upper limit
>>> in the number of tables. The best estimate we have so far is “a few
>>> hundred” which is a bit vague.
>>>
>>> Regarding scaling, I’m not talking about scaling in terms of data
>>> volume, but on how the data is structured. One thousand tables with one row
>>> each is the same data volume as one table with one thousand rows, excluding
>>> any data structures required to maintain the extra tables. But whereas the
>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>> will run happily on a single node cluster in a low end machine.
>>>
>>> We will design our code to use a single table to avoid having nightmares
>>> with this issue. But if there is any authoritative documentation on this
>>> characteristic of Cassandra, I would love to know more.
>>>
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 14:23, Jack Krupansky >> > wrote:
>>>
>>> I don't think there are any "reasons behind it." It is simply empirical
>>> experience - as reported here.
>>>
>>> Cassandra scales in two dimension - number of rows per node and number
>>> of nodes. If some source of information lead you to believe otherwise,
>>> please point out the source so that we can endeavor to correct it.
>>>
>>> The exact number of rows per node and tables per node will always have
>>> to be evaluated empirically - a proof of concept 

Re: Practical limit on number of column families

2016-03-01 Thread Eric Stevens
It's definitely not true for every use case of a large number of tables,
but for many uses where you'd be tempted to do that, adding whatever would
have driven your table naming instead as a column in your partition key on
a smaller number of tables will meet your needs.  This is especially true
if you're looking to solve multi-tenancy, unless you let your tenants
dynamically drive your schema (which is a separate can of worms).

On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky 
wrote:

> I don't think Cassandra was "purposefully developed" for some target
> number of tables - there is no evidence of any such an explicit intent.
> Instead, it would be fair to say that Cassandra was "not purposefully
> developed" with a goal of supporting "large numbers of tables." Sometimes
> features and capabilities come for free or as a side effect of the
> technologies used, but usually specific features and specific capabilities
> (such as large numbers of tables) require explicit intent and explicit
> effort.
>
> One could indeed endeavor to design a data store (I'm not even sure it
> would still be considered a database per se) that supported either large
> numbers of tables or an additional level of storage model in between table
> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
> not designed with that goal in mind.
>
> Traditionally, a "table" is a defined relation over a set of data.
> Relation and data are distinct concepts. And a relation name is not simply
> a Java-style "object". A relation (table) name is supposed to represent an
> abstraction or entity type, while essentially all of the cases I have heard
> of for wanting thousands (or even hundreds) of tables are trying to use
> table as more of a container for a group of rows for a specific entity
> instance rather than a distinct entity type. Granted, Cassandra is not
> obligated to be limited to the relational model, but Cassandra, especially
> CQL, is intentionally modeled reasonably closely with the relational model
> in terms of the data modeling abstractions even though the storage engine
> is designed to scale across nodes.
>
> You could file a Jira requesting such a feature improvement. And then we
> would see if sentiment has shifted over the years.
>
> The key thing is to offer up a use case that warrants support for large
> numbers of tables. So far, it has usually been the case that the perceived
> need for separate tables could easily be met using clustering columns of a
> single table.
>
> Seriously, if you guys can define a legitimate use case that can't easily
> be handled by a single table, that could get the discussion started.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> Being purposefully developed to only handle up to “a few hundred” tables
>> is reason enough. I accept that, and likely a use case with many tables was
>> never really considered. But I would still like to understand the design
>> choices made so perhaps we gain some confidence level in this upper limit
>> in the number of tables. The best estimate we have so far is “a few
>> hundred” which is a bit vague.
>>
>> Regarding scaling, I’m not talking about scaling in terms of data volume,
>> but on how the data is structured. One thousand tables with one row each is
>> the same data volume as one table with one thousand rows, excluding any
>> data structures required to maintain the extra tables. But whereas the
>> first seems likely to bring a Cassandra cluster to its knees, the second
>> will run happily on a single node cluster in a low end machine.
>>
>> We will design our code to use a single table to avoid having nightmares
>> with this issue. But if there is any authoritative documentation on this
>> characteristic of Cassandra, I would love to know more.
>>
>> FJ
>>
>>
>> On 01 Mar 2016, at 14:23, Jack Krupansky 
>> wrote:
>>
>> I don't think there are any "reasons behind it." It is simply empirical
>> experience - as reported here.
>>
>> Cassandra scales in two dimension - number of rows per node and number of
>> nodes. If some source of information lead you to believe otherwise, please
>> point out the source so that we can endeavor to correct it.
>>
>> The exact number of rows per node and tables per node will always have to
>> be evaluated empirically - a proof of concept implementation, since it all
>> depends on the mix of capabilities of your hardware combined with your
>> specific data model, your specific data values, your specific access
>> patterns, and your specific load. And it also depends on your own personal
>> tolerance for degradation of latency and throughput - some people might
>> find a given set of performance  metrics acceptable while other might not.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:

Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I don't think Cassandra was "purposefully developed" for some target number
of tables - there is no evidence of any such an explicit intent. Instead,
it would be fair to say that Cassandra was "not purposefully developed"
with a goal of supporting "large numbers of tables." Sometimes features and
capabilities come for free or as a side effect of the technologies used,
but usually specific features and specific capabilities (such as large
numbers of tables) require explicit intent and explicit effort.

One could indeed endeavor to design a data store (I'm not even sure it
would still be considered a database per se) that supported either large
numbers of tables or an additional level of storage model in between table
and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
not designed with that goal in mind.

Traditionally, a "table" is a defined relation over a set of data. Relation
and data are distinct concepts. And a relation name is not simply a
Java-style "object". A relation (table) name is supposed to represent an
abstraction or entity type, while essentially all of the cases I have heard
of for wanting thousands (or even hundreds) of tables are trying to use
table as more of a container for a group of rows for a specific entity
instance rather than a distinct entity type. Granted, Cassandra is not
obligated to be limited to the relational model, but Cassandra, especially
CQL, is intentionally modeled reasonably closely with the relational model
in terms of the data modeling abstractions even though the storage engine
is designed to scale across nodes.

You could file a Jira requesting such a feature improvement. And then we
would see if sentiment has shifted over the years.

The key thing is to offer up a use case that warrants support for large
numbers of tables. So far, it has usually been the case that the perceived
need for separate tables could easily be met using clustering columns of a
single table.

Seriously, if you guys can define a legitimate use case that can't easily
be handled by a single table, that could get the discussion started.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Jack
>
> Being purposefully developed to only handle up to “a few hundred” tables
> is reason enough. I accept that, and likely a use case with many tables was
> never really considered. But I would still like to understand the design
> choices made so perhaps we gain some confidence level in this upper limit
> in the number of tables. The best estimate we have so far is “a few
> hundred” which is a bit vague.
>
> Regarding scaling, I’m not talking about scaling in terms of data volume,
> but on how the data is structured. One thousand tables with one row each is
> the same data volume as one table with one thousand rows, excluding any
> data structures required to maintain the extra tables. But whereas the
> first seems likely to bring a Cassandra cluster to its knees, the second
> will run happily on a single node cluster in a low end machine.
>
> We will design our code to use a single table to avoid having nightmares
> with this issue. But if there is any authoritative documentation on this
> characteristic of Cassandra, I would love to know more.
>
> FJ
>
>
> On 01 Mar 2016, at 14:23, Jack Krupansky  wrote:
>
> I don't think there are any "reasons behind it." It is simply empirical
> experience - as reported here.
>
> Cassandra scales in two dimension - number of rows per node and number of
> nodes. If some source of information lead you to believe otherwise, please
> point out the source so that we can endeavor to correct it.
>
> The exact number of rows per node and tables per node will always have to
> be evaluated empirically - a proof of concept implementation, since it all
> depends on the mix of capabilities of your hardware combined with your
> specific data model, your specific data values, your specific access
> patterns, and your specific load. And it also depends on your own personal
> tolerance for degradation of latency and throughput - some people might
> find a given set of performance  metrics acceptable while other might not.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Tommaso
>>
>> It’s not that I _need_ a large number of tables. This approach maps
>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>> not the right approach.
>>
>> At the moment I’m trying to understand the limitations in Cassandra
>> regarding number of Tables and the reasons behind it. I’ve come to the
>> email list as my Google-foo is not giving me what I’m looking for :(
>>
>> FJ
>>
>>
>>
>> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
>>
>> Hi Fernando,
>>
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>> a real pain in terms of 

Re: Practical limit on number of column families

2016-03-01 Thread Vlad
>If your Jira search fu is strong enoughAnd it is! )

>you should be able to find it yourselfAnd I did! )
I see that this issue originates to problem with Java GC's design, but 
according to date it was Java 6 time. Now we have J8 with new  GC mechanism.
Is this problem still exists with J8? Any chances to use original method to 
reduce overhead and "be happy with the results"?
Regards, Vlad
 

On Tuesday, March 1, 2016 4:07 PM, Jack Krupansky 
 wrote:
 

 I'll defer to one of the senior committers as to whether they want that 
information disseminated any further than it already is. It was intentionally 
not documented since it is not recommended. If your Jira search fu is strong 
enough you should be able to find it yourself, but again, its use is strongly 
not recommended.
As the Jira notes, "having more than dozens or hundreds of tables defined is 
almost certainly a Bad Idea."
"Bad Idea" means not good. As in don't go there. And if you do, don't expect 
such a mis-adventure to be supported by the community.
-- Jack Krupansky
On Tue, Mar 1, 2016 at 8:39 AM, Vlad  wrote:

Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please 
>point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides 
>performance degradation, if any)?
Thanks.


On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky 
 wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With 
CQL3 you should be creating "tables". The practical recommendation of an upper 
limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per 
table (an undocumented Jira - intentionally undocumented since it is strongly 
not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish 
categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
 wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of 
column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale 
as more nodes are added into the cluster?  * Is there a difference between one 
keyspace with thousands of column families, vs thousands of keyspaces with only 
a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






   



  

Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Jack

Being purposefully developed to only handle up to “a few hundred” tables is 
reason enough. I accept that, and likely a use case with many tables was never 
really considered. But I would still like to understand the design choices made 
so perhaps we gain some confidence level in this upper limit in the number of 
tables. The best estimate we have so far is “a few hundred” which is a bit 
vague. 

Regarding scaling, I’m not talking about scaling in terms of data volume, but 
on how the data is structured. One thousand tables with one row each is the 
same data volume as one table with one thousand rows, excluding any data 
structures required to maintain the extra tables. But whereas the first seems 
likely to bring a Cassandra cluster to its knees, the second will run happily 
on a single node cluster in a low end machine.

We will design our code to use a single table to avoid having nightmares with 
this issue. But if there is any authoritative documentation on this 
characteristic of Cassandra, I would love to know more.

FJ


> On 01 Mar 2016, at 14:23, Jack Krupansky  wrote:
> 
> I don't think there are any "reasons behind it." It is simply empirical 
> experience - as reported here.
> 
> Cassandra scales in two dimension - number of rows per node and number of 
> nodes. If some source of information lead you to believe otherwise, please 
> point out the source so that we can endeavor to correct it.
> 
> The exact number of rows per node and tables per node will always have to be 
> evaluated empirically - a proof of concept implementation, since it all 
> depends on the mix of capabilities of your hardware combined with your 
> specific data model, your specific data values, your specific access 
> patterns, and your specific load. And it also depends on your own personal 
> tolerance for degradation of latency and throughput - some people might find 
> a given set of performance  metrics acceptable while other might not.
> 
> -- Jack Krupansky
> 
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez 
> > 
> wrote:
> Hi Tommaso
> 
> It’s not that I _need_ a large number of tables. This approach maps easily to 
> the problem we are trying to solve, but it’s becoming clear it’s not the 
> right approach.
> 
> At the moment I’m trying to understand the limitations in Cassandra regarding 
> number of Tables and the reasons behind it. I’ve come to the email list as my 
> Google-foo is not giving me what I’m looking for :(
> 
> FJ
> 
> 
> 
>> On 01 Mar 2016, at 09:36, tommaso barbugli > > wrote:
>> 
>> Hi Fernando,
>> 
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a 
>> real pain in terms of operations. Repairs were terribly slow, boot of C* 
>> slowed down and in general tracking table metrics becomes bit more work. Why 
>> do you need this high number of tables?
>> 
>> Tommaso
>> 
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez 
>> > 
>> wrote:
>> Hi Jack
>> 
>> By entry I mean row
>> 
>> Apologies for the “obsolete terminology”. When I first looked at Cassandra 
>> it was still on CQL2, and now that I’m looking at it again I’ve defaulted to 
>> the terms I already knew. I will bear it in mind and call them tables from 
>> now on.
>> 
>> Is there any documentation about this limit? for example, I’d be keen to 
>> know how much memory is consumed per table, and I’m also curious about the 
>> reasons for keeping this in memory. I’m trying to understand the limitations 
>> here, rather than challenge them.
>> 
>> So far I found nothing in my search, hence why I had to resort to some “load 
>> testing” to see what happens when you push the table count high
>> 
>> Thanks
>> FJ
>> 
>> 
>>> On 01 Mar 2016, at 06:23, Jack Krupansky >> > wrote:
>>> 
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>> 
>>> You are using the obsolete terminology of CQL2 and Thrift - column family. 
>>> With CQL3 you should be creating "tables". The practical recommendation of 
>>> an upper limit of a few hundred tables across all key spaces remains.
>>> 
>>> Technically you can go higher and technically you can reduce the overhead 
>>> per table (an undocumented Jira - intentionally undocumented since it is 
>>> strongly not recommended), but... it is unlikely that you will be happy 
>>> with the results.
>>> 
>>> What is the nature of the use case?
>>> 
>>> You basically have two choices: an additional cluster column to distinguish 
>>> categories of table, or separate clusters for each few hundred of tables.
>>> 
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
>>> >> 

Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I'll defer to one of the senior committers as to whether they want that
information disseminated any further than it already is. It was
intentionally not documented since it is not recommended. If your Jira
search fu is strong enough you should be able to find it yourself, but
again, its use is strongly not recommended.

As the Jira notes, "having more than dozens or hundreds of tables defined
is almost certainly a Bad Idea."

"Bad Idea" means not good. As in don't go there. And if you do, don't
expect such a mis-adventure to be supported by the community.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:39 AM, Vlad  wrote:

> Hi Jack,
>
> >you can reduce the overhead per table  an undocumented Jira
> Can you please point to this Jira number?
>
> >it is strongly not recommended
> What is consequences of this (besides performance degradation, if any)?
>
> Thanks.
>
>
> On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
>
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
> Hi all
>
> I have a use case for Cassandra that would require creating a large number
> of column families. I have found references to early versions of Cassandra
> where each column family would require a fixed amount of memory on all
> nodes, effectively imposing an upper limit on the total number of CFs. I
> have also seen rumblings that this may have been fixed in later versions.
>
> To put the question to rest, I have setup a DSE sandbox and created some
> code to generate column families populated with 3,000 entries each.
>
> Unfortunately I have now hit this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9291
>
> So I will have to retest against Cassandra 3.0 instead
>
> However, I would like to understand the limitations regarding creation of
> column families.
>
> * Is there a practical upper limit?
> * is this a fixed limit, or does it scale as more nodes are added into the
> cluster?
> * Is there a difference between one keyspace with thousands of column
> families, vs thousands of keyspaces with only a few column families each?
>
> I haven’t found any hard evidence/documentation to help me here, but if
> you can point me in the right direction, I will oblige and RTFM away.
>
> Many thanks for your help!
>
> Cheers
> FJ
>
>
>
>
>
>


Re: Practical limit on number of column families

2016-03-01 Thread Vlad
Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please 
>point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides 
>performance degradation, if any)?
Thanks.


On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky 
 wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With 
CQL3 you should be creating "tables". The practical recommendation of an upper 
limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per 
table (an undocumented Jira - intentionally undocumented since it is strongly 
not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish 
categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
 wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of 
column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale 
as more nodes are added into the cluster?  * Is there a difference between one 
keyspace with thousands of column families, vs thousands of keyspaces with only 
a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






  

Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I don't think there are any "reasons behind it." It is simply empirical
experience - as reported here.

Cassandra scales in two dimension - number of rows per node and number of
nodes. If some source of information lead you to believe otherwise, please
point out the source so that we can endeavor to correct it.

The exact number of rows per node and tables per node will always have to
be evaluated empirically - a proof of concept implementation, since it all
depends on the mix of capabilities of your hardware combined with your
specific data model, your specific data values, your specific access
patterns, and your specific load. And it also depends on your own personal
tolerance for degradation of latency and throughput - some people might
find a given set of performance  metrics acceptable while other might not.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Tommaso
>
> It’s not that I _need_ a large number of tables. This approach maps easily
> to the problem we are trying to solve, but it’s becoming clear it’s not the
> right approach.
>
> At the moment I’m trying to understand the limitations in Cassandra
> regarding number of Tables and the reasons behind it. I’ve come to the
> email list as my Google-foo is not giving me what I’m looking for :(
>
> FJ
>
>
>
> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
>
> Hi Fernando,
>
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
> real pain in terms of operations. Repairs were terribly slow, boot of C*
> slowed down and in general tracking table metrics becomes bit more work.
> Why do you need this high number of tables?
>
> Tommaso
>
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> By entry I mean row
>>
>> Apologies for the “obsolete terminology”. When I first looked at
>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>> defaulted to the terms I already knew. I will bear it in mind and call them
>> tables from now on.
>>
>> Is there any documentation about this limit? for example, I’d be keen to
>> know how much memory is consumed per table, and I’m also curious about the
>> reasons for keeping this in memory. I’m trying to understand the
>> limitations here, rather than challenge them.
>>
>> So far I found nothing in my search, hence why I had to resort to some
>> “load testing” to see what happens when you push the table count high
>>
>> Thanks
>> FJ
>>
>>
>> On 01 Mar 2016, at 06:23, Jack Krupansky 
>> wrote:
>>
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>
>> You are using the obsolete terminology of CQL2 and Thrift - column
>> family. With CQL3 you should be creating "tables". The practical
>> recommendation of an upper limit of a few hundred tables across all key
>> spaces remains.
>>
>> Technically you can go higher and technically you can reduce the overhead
>> per table (an undocumented Jira - intentionally undocumented since it is
>> strongly not recommended), but... it is unlikely that you will be happy
>> with the results.
>>
>> What is the nature of the use case?
>>
>> You basically have two choices: an additional cluster column to
>> distinguish categories of table, or separate clusters for each few hundred
>> of tables.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:
>>
>>> Hi all
>>>
>>> I have a use case for Cassandra that would require creating a large
>>> number of column families. I have found references to early versions of
>>> Cassandra where each column family would require a fixed amount of memory
>>> on all nodes, effectively imposing an upper limit on the total number of
>>> CFs. I have also seen rumblings that this may have been fixed in later
>>> versions.
>>>
>>> To put the question to rest, I have setup a DSE sandbox and created some
>>> code to generate column families populated with 3,000 entries each.
>>>
>>> Unfortunately I have now hit this issue:
>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>
>>> So I will have to retest against Cassandra 3.0 instead
>>>
>>> However, I would like to understand the limitations regarding creation
>>> of column families.
>>>
>>> * Is there a practical upper limit?
>>> * is this a fixed limit, or does it scale as more nodes are added into
>>> the cluster?
>>> * Is there a difference between one keyspace with thousands of column
>>> families, vs thousands of keyspaces with only a few column families each?
>>>
>>> I haven’t found any hard evidence/documentation to help me here, but if
>>> you can point me in the right direction, I will oblige and RTFM away.
>>>
>>> Many thanks for your help!
>>>
>>> Cheers
>>> FJ
>>>
>>>
>>>
>>
>>
>
>


Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Tommaso

It’s not that I _need_ a large number of tables. This approach maps easily to 
the problem we are trying to solve, but it’s becoming clear it’s not the right 
approach.

At the moment I’m trying to understand the limitations in Cassandra regarding 
number of Tables and the reasons behind it. I’ve come to the email list as my 
Google-foo is not giving me what I’m looking for :(

FJ



> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
> 
> Hi Fernando,
> 
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a 
> real pain in terms of operations. Repairs were terribly slow, boot of C* 
> slowed down and in general tracking table metrics becomes bit more work. Why 
> do you need this high number of tables?
> 
> Tommaso
> 
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez 
> > 
> wrote:
> Hi Jack
> 
> By entry I mean row
> 
> Apologies for the “obsolete terminology”. When I first looked at Cassandra it 
> was still on CQL2, and now that I’m looking at it again I’ve defaulted to the 
> terms I already knew. I will bear it in mind and call them tables from now on.
> 
> Is there any documentation about this limit? for example, I’d be keen to know 
> how much memory is consumed per table, and I’m also curious about the reasons 
> for keeping this in memory. I’m trying to understand the limitations here, 
> rather than challenge them.
> 
> So far I found nothing in my search, hence why I had to resort to some “load 
> testing” to see what happens when you push the table count high
> 
> Thanks
> FJ
> 
> 
>> On 01 Mar 2016, at 06:23, Jack Krupansky > > wrote:
>> 
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>> 
>> You are using the obsolete terminology of CQL2 and Thrift - column family. 
>> With CQL3 you should be creating "tables". The practical recommendation of 
>> an upper limit of a few hundred tables across all key spaces remains.
>> 
>> Technically you can go higher and technically you can reduce the overhead 
>> per table (an undocumented Jira - intentionally undocumented since it is 
>> strongly not recommended), but... it is unlikely that you will be happy with 
>> the results.
>> 
>> What is the nature of the use case?
>> 
>> You basically have two choices: an additional cluster column to distinguish 
>> categories of table, or separate clusters for each few hundred of tables.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
>> > 
>> wrote:
>> Hi all
>> 
>> I have a use case for Cassandra that would require creating a large number 
>> of column families. I have found references to early versions of Cassandra 
>> where each column family would require a fixed amount of memory on all 
>> nodes, effectively imposing an upper limit on the total number of CFs. I 
>> have also seen rumblings that this may have been fixed in later versions.
>> 
>> To put the question to rest, I have setup a DSE sandbox and created some 
>> code to generate column families populated with 3,000 entries each.
>> 
>> Unfortunately I have now hit this issue: 
>> https://issues.apache.org/jira/browse/CASSANDRA-9291 
>> 
>> 
>> So I will have to retest against Cassandra 3.0 instead
>> 
>> However, I would like to understand the limitations regarding creation of 
>> column families. 
>> 
>>  * Is there a practical upper limit? 
>>  * is this a fixed limit, or does it scale as more nodes are added into 
>> the cluster? 
>>  * Is there a difference between one keyspace with thousands of column 
>> families, vs thousands of keyspaces with only a few column families each?
>> 
>> I haven’t found any hard evidence/documentation to help me here, but if you 
>> can point me in the right direction, I will oblige and RTFM away.
>> 
>> Many thanks for your help!
>> 
>> Cheers
>> FJ
>> 
>> 
>> 
> 
> 



Re: Practical limit on number of column families

2016-03-01 Thread tommaso barbugli
Hi Fernando,

I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
real pain in terms of operations. Repairs were terribly slow, boot of C*
slowed down and in general tracking table metrics becomes bit more work.
Why do you need this high number of tables?

Tommaso

On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Jack
>
> By entry I mean row
>
> Apologies for the “obsolete terminology”. When I first looked at Cassandra
> it was still on CQL2, and now that I’m looking at it again I’ve defaulted
> to the terms I already knew. I will bear it in mind and call them tables
> from now on.
>
> Is there any documentation about this limit? for example, I’d be keen to
> know how much memory is consumed per table, and I’m also curious about the
> reasons for keeping this in memory. I’m trying to understand the
> limitations here, rather than challenge them.
>
> So far I found nothing in my search, hence why I had to resort to some
> “load testing” to see what happens when you push the table count high
>
> Thanks
> FJ
>
>
> On 01 Mar 2016, at 06:23, Jack Krupansky  wrote:
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi all
>>
>> I have a use case for Cassandra that would require creating a large
>> number of column families. I have found references to early versions of
>> Cassandra where each column family would require a fixed amount of memory
>> on all nodes, effectively imposing an upper limit on the total number of
>> CFs. I have also seen rumblings that this may have been fixed in later
>> versions.
>>
>> To put the question to rest, I have setup a DSE sandbox and created some
>> code to generate column families populated with 3,000 entries each.
>>
>> Unfortunately I have now hit this issue:
>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>
>> So I will have to retest against Cassandra 3.0 instead
>>
>> However, I would like to understand the limitations regarding creation of
>> column families.
>>
>> * Is there a practical upper limit?
>> * is this a fixed limit, or does it scale as more nodes are added into
>> the cluster?
>> * Is there a difference between one keyspace with thousands of column
>> families, vs thousands of keyspaces with only a few column families each?
>>
>> I haven’t found any hard evidence/documentation to help me here, but if
>> you can point me in the right direction, I will oblige and RTFM away.
>>
>> Many thanks for your help!
>>
>> Cheers
>> FJ
>>
>>
>>
>
>


Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Jack

By entry I mean row

Apologies for the “obsolete terminology”. When I first looked at Cassandra it 
was still on CQL2, and now that I’m looking at it again I’ve defaulted to the 
terms I already knew. I will bear it in mind and call them tables from now on.

Is there any documentation about this limit? for example, I’d be keen to know 
how much memory is consumed per table, and I’m also curious about the reasons 
for keeping this in memory. I’m trying to understand the limitations here, 
rather than challenge them.

So far I found nothing in my search, hence why I had to resort to some “load 
testing” to see what happens when you push the table count high

Thanks
FJ


> On 01 Mar 2016, at 06:23, Jack Krupansky  wrote:
> 
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
> 
> You are using the obsolete terminology of CQL2 and Thrift - column family. 
> With CQL3 you should be creating "tables". The practical recommendation of an 
> upper limit of a few hundred tables across all key spaces remains.
> 
> Technically you can go higher and technically you can reduce the overhead per 
> table (an undocumented Jira - intentionally undocumented since it is strongly 
> not recommended), but... it is unlikely that you will be happy with the 
> results.
> 
> What is the nature of the use case?
> 
> You basically have two choices: an additional cluster column to distinguish 
> categories of table, or separate clusters for each few hundred of tables.
> 
> 
> -- Jack Krupansky
> 
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
> > 
> wrote:
> Hi all
> 
> I have a use case for Cassandra that would require creating a large number of 
> column families. I have found references to early versions of Cassandra where 
> each column family would require a fixed amount of memory on all nodes, 
> effectively imposing an upper limit on the total number of CFs. I have also 
> seen rumblings that this may have been fixed in later versions.
> 
> To put the question to rest, I have setup a DSE sandbox and created some code 
> to generate column families populated with 3,000 entries each.
> 
> Unfortunately I have now hit this issue: 
> https://issues.apache.org/jira/browse/CASSANDRA-9291 
> 
> 
> So I will have to retest against Cassandra 3.0 instead
> 
> However, I would like to understand the limitations regarding creation of 
> column families. 
> 
>   * Is there a practical upper limit? 
>   * is this a fixed limit, or does it scale as more nodes are added into 
> the cluster? 
>   * Is there a difference between one keyspace with thousands of column 
> families, vs thousands of keyspaces with only a few column families each?
> 
> I haven’t found any hard evidence/documentation to help me here, but if you 
> can point me in the right direction, I will oblige and RTFM away.
> 
> Many thanks for your help!
> 
> Cheers
> FJ
> 
> 
> 



Re: Practical limit on number of column families

2016-02-29 Thread Jack Krupansky
3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family.
With CQL3 you should be creating "tables". The practical recommendation of
an upper limit of a few hundred tables across all key spaces remains.

Technically you can go higher and technically you can reduce the overhead
per table (an undocumented Jira - intentionally undocumented since it is
strongly not recommended), but... it is unlikely that you will be happy
with the results.

What is the nature of the use case?

You basically have two choices: an additional cluster column to distinguish
categories of table, or separate clusters for each few hundred of tables.


-- Jack Krupansky

On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi all
>
> I have a use case for Cassandra that would require creating a large number
> of column families. I have found references to early versions of Cassandra
> where each column family would require a fixed amount of memory on all
> nodes, effectively imposing an upper limit on the total number of CFs. I
> have also seen rumblings that this may have been fixed in later versions.
>
> To put the question to rest, I have setup a DSE sandbox and created some
> code to generate column families populated with 3,000 entries each.
>
> Unfortunately I have now hit this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9291
>
> So I will have to retest against Cassandra 3.0 instead
>
> However, I would like to understand the limitations regarding creation of
> column families.
>
> * Is there a practical upper limit?
> * is this a fixed limit, or does it scale as more nodes are added into the
> cluster?
> * Is there a difference between one keyspace with thousands of column
> families, vs thousands of keyspaces with only a few column families each?
>
> I haven’t found any hard evidence/documentation to help me here, but if
> you can point me in the right direction, I will oblige and RTFM away.
>
> Many thanks for your help!
>
> Cheers
> FJ
>
>
>


Re: Practical limit on number of column families

2016-02-29 Thread Robert Wille
Yes, there is memory overhead for each column family, effectively limiting the 
number of column families. The general wisdom is that you should limit yourself 
to a few hundred.

Robert

On Feb 29, 2016, at 10:30 AM, Fernando Jimenez 
> 
wrote:

Hi all

I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.

To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.

Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291

So I will have to retest against Cassandra 3.0 instead

However, I would like to understand the limitations regarding creation of 
column families.

* Is there a practical upper limit?
* is this a fixed limit, or does it scale as more nodes are added into the 
cluster?
* Is there a difference between one keyspace with thousands of column families, 
vs thousands of keyspaces with only a few column families each?

I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.

Many thanks for your help!

Cheers
FJ





Practical limit on number of column families

2016-02-29 Thread Fernando Jimenez
Hi all

I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.

To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.

Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291 


So I will have to retest against Cassandra 3.0 instead

However, I would like to understand the limitations regarding creation of 
column families. 

* Is there a practical upper limit? 
* is this a fixed limit, or does it scale as more nodes are added into 
the cluster? 
* Is there a difference between one keyspace with thousands of column 
families, vs thousands of keyspaces with only a few column families each?

I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.

Many thanks for your help!

Cheers
FJ