With special reference to deployers, I'd like to add a strength of SQL: a SQL data store will almost always need to be maintained side-by-side with any real-world deployment of the OAE. In our own little world, Grouper uses SQL, Sakai CLE uses SQL, pretty much any commercial or open-source product and pretty much any campus IT system we integrate with will use SQL. More generally it seems that many-to-most production teams who have taken on NoSQL continue to store to SQL as well. (This was equally true for the niche community of Jackrabbit-centric developers.)
At single-university scale and with reasonably designed tables, I haven't seen a need for SQL sharding. Best, Ray On 6/28/12 12:50 PM, Zach A. Thomas wrote: > Summary > ======= > > To help improve performance of both the technology and our team, we are > evaluating adoption of a new storage subsystem. A number of solutions > have been evaluated and we have come to some initial conclusions: > > * Eliminated from consideration > > * Infinispan > > * Neo4J > > * SenseiDB > > * Voldemort > > * Still under investigation > > * Cassandra > > * MongoDB > > * Relational DB / JPA / JDO. > > > We want to emphasize that any technology we adopt will be a transition > over time, so that we can maintain stability in the application. > > Background > ========= > > Our Ann Arbor meeting last month was about thinking through > architectural changes that could improve OAE in terms of system > performance and team performance, as well as laying the groundwork for > taking measurements (modeling a production-like data set, provisioning > load testing infrastructure, and the load tests themselves). For team > performance, we agreed that we should strive to rely more heavily on > established low-level infrastructure for storage. In other words, find a > storage subsystem and API that we don't need to maintain ourselves. This > is a fertile time for storage technology, but that also means there are > many options to sift through, each with its own quirks and tradeoffs. > > We set the following criteria for our search (in no particular order): > > * ease of use for developers (APIs, etc.) > * ease of use for deployers (backups, failover, monitoring, etc.) > * strength of the community > * suitable license (ECL2 compatible) > * proven track record (success stories in applications somewhat like ours) > * options for queries > * options for scaling > * options for integrity (atomcity, consistency, transactions, > referential integrity) > > In the weeks since our meeting, the server devs have explored various > options. I'd like to summarize our progress so far. In the interests of > brevity, we won't include every detail. We invite your questions and > feedback. Note that we're right in the middle of our investigation, not > at the end. Hopefully, we'll get some more time with OmniTI to talk > about what we've learned. > > Infinispan - Infinispan is a successor to JBoss Cache. It is simply a > caching layer that has the ability to persist via a configurable > cache-store. Strengths: high level configurability for things like > transactions, write-behind/write-through persistence, high volume of > community activity, and storage agnosticism. Weaknesses: Work with > Infinispan showed promise with respect to operating inside an OSGi > container, however when trying to persist POJO data, it became obvious > that another library such as Hibernate OGM [6] would be necessary to > make persistence of POJOs through Infinispan possible, and there just > doesn't seem to be anything mature enough, or well documented enough > that we could start building off of. Current Thinking: Infinispan, while > being a pretty mature memory grid and caching layer, seems a little > premature to start thinking of as a full-fledged domain persistence layer. > > Voldemort -- a project that comes out of LinkedIn. It is a distributed > key-value store with sophisticated horizontal scaling using a ring > topology similar to Cassandra's. Strengths: speed, elastic scaling, runs > in the JVM, supports various forms of serialization, including JSON and > Google's protobuf. Weaknesses: no support for querying, so we'd have to > write separate synchronous indexing using Lucene, and all the glue code > to make them work together. Not an active, diverse community. Current > Thinking: too much work to get basic store-and-find operations. > > MongoDB -- a document-oriented database with a very developer-friendly > API. Managed and backed commercially by 10gen. Strengths: really easy to > use. You can store JSON documents, which is just what we want to do, and > you can create indexes and query like we're used to from the relational > DB world. Huge community, probably the most active in the NoSQL space. > Tools, hosting options, the works. Weaknesses: We've seen the same story > a number of times [1][2][3][4]: everyone loves MongoDB at first, but it > becomes operationally painful in production. Scaling it is complex. > Current Thinking: The pain might be worth it, but it certainly gives us > pause. This is probably a product that is going to be much easier to > manage when it matures. > > Cassandra -- a column family database, borrowing ideas from Google > BigTable and Amazon DynamoDB. Originated at Facebook, but since it moved > to the Apache Foundation, it has taken on a life of its own. > Commercially backed by DataStax. Strengths: very good replication and > scaling technology (a ring topology, like Voldemort). Supports queries, > but you have to plan for them in your data model. Very strong community. > Consistency tunable per-request. Weaknesses: steeper learning curve for > devs. Data modeling in Cassandra is a different paradigm from the > relational databases we know. Current Thinking: this is attractive for > its power, but it will take work to get everybody up to speed on it. In > a sense, it's the opposite of the MongoDB story: harder to get started > with, but very satisfying in the long term. See [1] for more on this. > > JPA/JDO with a relational database -- This is the technology we're > familiar with from projects past. This is the tried and true relational > model with tables, ORM, and sometimes SQL. Strengths: Everyone knows how > this works. There are plenty of tools, plenty of commercial support, and > you can write JOINs! Weaknesses: vertical scaling. When you reach the > limit of the hardware you can throw at your database server, you can try > sharding, which is notoriously difficult, or something like memcached, > but then you're committing to key-value semantics, so why not just go > there in the first place? [5] JPA via OpenJPA and Eclipselink has been > surprisingly hard to get working in an OSGi runtime. They have trouble > with the dynamic nature of bundles. Exploring JDO at the moment, but it > too feels like swimming upstream. Current Thinking: this is familiar, > but newer technologies have shown us that one size no longer fits all. > > [1] http://www.slideshare.net/eonnen/from-100s-to-100s-of-millions > [2] http://w3matter.com/blog/from-postgresql-to-mongodb-back-to-postgresql > [3] http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb > [4] > http://e1ven.com/2011/11/07/my-experiences-with-mongodb-over-the-last-year-in-production/ > [5] http://www.couchbase.com/ > [6] http://www.hibernate.org/subprojects/ogm.html > > > > _______________________________________________ > oae-dev mailing list > [email protected] > http://collab.sakaiproject.org/mailman/listinfo/oae-dev > _______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
