And how are you?

My purpose with this email to not to seek out a religious debate on NoSQL vs 
Relational databases, *I'm really looking for someone that has had a great 
experience with successfully sharding relational databases for 
multi-tentant applications operating at very large scale, and overcame the 
associated challenges. *(or someone that's combined a NoSQL database and 
Elastic search, for example, to have relational database like query 
features atop a NoSQL database).

*The reason I seek such a person out is to answer the question:   Can I 
have the querying features of a relational database while at the same time 
have the ease of maintaining a NoSQL database with respect to scale, 
replication, and availability when operating a massive, multi-tenant 
database?*

We know there are no magic architectures, and that such comparisons of NoSQL vs 
Relational database really depend on the need.  The need in this case is 
for: 

a highly scalable and reliable database for a multi-tenant application that 
will serve an estimated 40,000 companies with an average of 1000 users per 
company and about 10,000 entries per user (or 10 million entries per 
company spread across about 50 tables).  


Given this particular use case, which I admit is still a bit vague, here is 
a pro/con analysis on NoSQL vs Relational DB based on my experiences with 
both types of databases. 

*When NoSQL databases -* like Google Datastore, Amazon SimpleDB, Opensource 
Casandra and MongoDB - *are stacked up against this use case, they have the 
following pros (+) and cons ( - )*, IMHO:

   - + Scales easily - sharding on a key-blob entry in a NoSQL datastore is 
   fairly easy compared sharding a relational database
   - + Replicates easily - for the same reason as sharding
   - + Query speed is always fast as each query has to be based on a key 
   index. 
   - + Easy to update data structures, no schema management to deal with
   - + Easy to maintain...in cases of Datastore and SimpleDB, no DBA is 
   needed!
   - -  Ad hoc querying across entities/tables is difficult, typically done 
   in the code rather than by the database
   - -  More upfront data processing work is needed to get things like 
   counts on an entity
   - - Sorting & filtering on an entity requires pre-defined indexes

*When Relational Databases* - like SQL Server, Postgres, Oracle, MySQL - *are 
stacked up against this use case* (i.e., 30,000 customers with 10M entries 
each across about 50 tables) *have the following pros (+) and cons ( - )*, 
IMHO:

   - + Easier to query for ad-hoc reporting needs
   - + Easier to throw aggregate queries (group by/ counts/ max/ min) to 
   the database instead of doing it in code
   - + Pagination (go to page x of Y ) is doable, even on really large 
   datasets
   - + Sorting and filtering on an entity or even between entities is 
   easier; cross entity/table querying is easier
   - - Scaling is difficult - sharding strategy needs to be figured out 
   ahead of time.  How to shard for 100,000 customers, with 10M records for 
   each customers (but those 10M records are stored amoungst 50 different 
   tables)
   - - Replication and Scaling is complicated (??), need serious expertise 
   / experience
   - - Queries can drag because there are so many possibilities -- need to 
   have a represented dataset in a non production instance to do pre-analysis 
   and optimization on new code/queries getting introduced.  Need to monitor 
   and analyze the query stats on existing queries to see what needs tuning.
   - - Data structure changes are harder, need to main sequenced DDL and 
   DML scripts, and coordinate deployments to have the database changes go 
   before the code changes
   - - Need a DBA for replication, backups, performance, scalability, 
   monitoring


It almost comes down to saying...

hey, if you want your product team to be able to do any querying they want, 
then go with a relational database but the technology team will have to 
deal with the technical consequences and challenges.


...or ...

hey, you don't want to deal with the technical challenges for maintaining a 
massively scalable and reliable relational database, so go with NoSQL and 
have your product team deal with the consequences and challenges.


I lean toward NoSQL because I've never seen a database scale so easily and 
be so reliable compared to the very large relational database projects I've 
worked on (and I've seen some MASSIVE Oracle environments and datasets in 
my work history in both government and commercial installations, but they 
weren't multi-tenant). 

However, if I was able to hear the experiences of someone that's been able 
to either:

   - a)  figure out how to shard a relational database easily even for 
   30,000 customers containing 10M records per customer, across 50 tables such 
   that the relational database scales and is as reliable as a NoSQL database, 
   then I would definitely go with a Relational Database.
   - or
   - b) Overcome querying deficiencies of a NoSQL database by writing the 
   data to both a NoSQL datastore and something like Elastic Search (which 
   also scales easily).  This would make up for a number of the 'cons' of a 
   NoSQL database for this use case based on a small spike project we did 
   last summer.  So that's another option I would like to hear from someone 
   about that has implemented this architecture and are fans.

Please let me know if you have any thoughts on any of the above, and I'm 
most interested in speaking with folks about how they've figured out how to 
have the benefits of both Relational and NoSQL databases.  

Rock on,
  Hardwick

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/d/optout.

Reply via email to