Re: Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Ryan Svihla
Clarification keyspace for each should be keyspace for cassandra tables
and solr tables

On Fri, Dec 12, 2014 at 11:25 AM, Ryan Svihla rsvi...@datastax.com wrote:

 It would make more sense to just have a keyspace for each. Something like
 solr_tables, and cassandra_tables. I've done similar with most customers
 using DSE search (not a DSE mailing list, but the information is
 interesting background for your question).

 there is a cost to each keyspace and you'll hit a level where the cost of
 managing each keyspace gets expensive for your total heap usage (your
 mileage may vary on lots of factors.). Breaking up keyspaces into logical
 replication groups makes the most sense from a maintainability and
 performance standpoint.

 On Fri, Dec 12, 2014 at 11:21 AM, Eric Stevens migh...@gmail.com wrote:

 We're considering moving to a model where we put each of our tables in a
 dedicated keyspace.  This is so we can tune replication per table, and
 change our mind about that replication on a per-table basis without a major
 migration.  The biggest driver for this is Solr integration, we want to
 tune RF into our Solr DC such that only tables which we want to search are
 sent there (using NetworkTopologyStrategy with 'solr': 0 for tables which
 are not searchable).

 Has anyone else tried this, is there any reason we might not want to do
 so?  Any hidden gotchas we should be concerned about?  Our total table
 count is small, in the tens range; our searchable tables are maybe 4 or 5.



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Ryan Svihla
It would make more sense to just have a keyspace for each. Something like
solr_tables, and cassandra_tables. I've done similar with most customers
using DSE search (not a DSE mailing list, but the information is
interesting background for your question).

there is a cost to each keyspace and you'll hit a level where the cost of
managing each keyspace gets expensive for your total heap usage (your
mileage may vary on lots of factors.). Breaking up keyspaces into logical
replication groups makes the most sense from a maintainability and
performance standpoint.

On Fri, Dec 12, 2014 at 11:21 AM, Eric Stevens migh...@gmail.com wrote:

 We're considering moving to a model where we put each of our tables in a
 dedicated keyspace.  This is so we can tune replication per table, and
 change our mind about that replication on a per-table basis without a major
 migration.  The biggest driver for this is Solr integration, we want to
 tune RF into our Solr DC such that only tables which we want to search are
 sent there (using NetworkTopologyStrategy with 'solr': 0 for tables which
 are not searchable).

 Has anyone else tried this, is there any reason we might not want to do
 so?  Any hidden gotchas we should be concerned about?  Our total table
 count is small, in the tens range; our searchable tables are maybe 4 or 5.



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Eric Stevens
Well we started with the thought that we'd have two keyspaces, one for
searchables and one for non-searchables like you mentioned.  But our
concern is that we may change our mind about what column families are
available for search in the future.  Separate keyspaces per table give us
greater flexibility in that regard.

I know that Thrift includes keyspace as part of the connection details, so
if you're reading or writing to many keyspaces, you'll end up having to
make a lot of additional round trips, and it will hurt your throughput.  I
may be wrong, but I don't think this is true for the native protocol.  If
we're using fully qualified names for all of our queries, I don't think
this incurs the same overhead.

I've had a look through the DataStax Java Driver's execution path and I'm
seeing that it attempts to discover the keyspace used by each query, but
that's to help determine the candidate hosts for token aware policy.  It
does that discovery at the time the session is initted (see Metadata.java
http://grepcode.com/file/repo1.maven.org/maven2/com.datastax.cassandra/cassandra-driver-core/2.1.2/com/datastax/driver/core/Metadata.java/#381)
as well as when a topology change is detected, so it seems like it may
slightly slow down connect time, but the cost per query at execution time
should be relatively static regardless of the number of keyspaces.

I know there is nontrivial overhead for each column family, but I have not
read or heard that there is nontrivial overhead for each keyspace.  Do you
have more information about that?


On Fri, Dec 12, 2014 at 10:26 AM, Ryan Svihla rsvi...@datastax.com wrote:

 Clarification keyspace for each should be keyspace for cassandra tables
 and solr tables

 On Fri, Dec 12, 2014 at 11:25 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 It would make more sense to just have a keyspace for each. Something like
 solr_tables, and cassandra_tables. I've done similar with most customers
 using DSE search (not a DSE mailing list, but the information is
 interesting background for your question).

 there is a cost to each keyspace and you'll hit a level where the cost of
 managing each keyspace gets expensive for your total heap usage (your
 mileage may vary on lots of factors.). Breaking up keyspaces into logical
 replication groups makes the most sense from a maintainability and
 performance standpoint.

 On Fri, Dec 12, 2014 at 11:21 AM, Eric Stevens migh...@gmail.com wrote:

 We're considering moving to a model where we put each of our tables in a
 dedicated keyspace.  This is so we can tune replication per table, and
 change our mind about that replication on a per-table basis without a major
 migration.  The biggest driver for this is Solr integration, we want to
 tune RF into our Solr DC such that only tables which we want to search are
 sent there (using NetworkTopologyStrategy with 'solr': 0 for tables which
 are not searchable).

 Has anyone else tried this, is there any reason we might not want to do
 so?  Any hidden gotchas we should be concerned about?  Our total table
 count is small, in the tens range; our searchable tables are maybe 4 or 5.



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




Re: Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Tyler Hobbs
On Fri, Dec 12, 2014 at 4:50 PM, Eric Stevens migh...@gmail.com wrote:


 I know that Thrift includes keyspace as part of the connection details, so
 if you're reading or writing to many keyspaces, you'll end up having to
 make a lot of additional round trips, and it will hurt your throughput.  I
 may be wrong, but I don't think this is true for the native protocol.  If
 we're using fully qualified names for all of our queries, I don't think
 this incurs the same overhead.


That's correct.  While you can set a default keyspace for a native protocol
connection, the ability to use fully qualified names makes this not matter
in the same way that it did for Thrift.



 I've had a look through the DataStax Java Driver's execution path and I'm
 seeing that it attempts to discover the keyspace used by each query, but
 that's to help determine the candidate hosts for token aware policy.  It
 does that discovery at the time the session is initted (see Metadata.java
 http://grepcode.com/file/repo1.maven.org/maven2/com.datastax.cassandra/cassandra-driver-core/2.1.2/com/datastax/driver/core/Metadata.java/#381)
 as well as when a topology change is detected, so it seems like it may
 slightly slow down connect time, but the cost per query at execution time
 should be relatively static regardless of the number of keyspaces.


This is also correct.  On startup the driver will build a token ring (or
replica map) representation for each keyspace to assist TokenAwarePolicy.
There's no additional overhead per-query for extra keyspaces.



 I know there is nontrivial overhead for each column family, but I have not
 read or heard that there is nontrivial overhead for each keyspace.  Do you
 have more information about that?


The overhead for each keyspace is minor.  There will be some additional
objects in the heap, some more entries in the system tables, and the driver
will generally track more metadata, but that's all pretty lightweight.

The per-column family overhead primarily comes from the way memory is
allocated for memtables.  However, CASSANDRA-7882 should significantly
improve that: https://issues.apache.org/jira/browse/CASSANDRA-7882

-- 
Tyler Hobbs
DataStax http://datastax.com/