Re: [oae-dev] OAE Storage Investigation Update

Ray Davis Mon, 02 Jul 2012 09:14:16 -0700

With special reference to deployers, I'd like to add a strength of SQL: 
a SQL data store will almost always need to be maintained side-by-side 
with any real-world deployment of the OAE. In our own little world, 
Grouper uses SQL, Sakai CLE uses SQL, pretty much any commercial or 
open-source product and pretty much any campus IT system we integrate 
with will use SQL. More generally it seems that many-to-most production 
teams who have taken on NoSQL continue to store to SQL as well. (This 
was equally true for the niche community of Jackrabbit-centric developers.)


At single-university scale and with reasonably designed tables, I 
haven't seen a need for SQL sharding.

Best,
Ray

On 6/28/12 12:50 PM, Zach A. Thomas wrote:
> Summary
> =======
>
> To help improve performance of both the technology and our team, we are
> evaluating adoption of a new storage subsystem.  A number of solutions
> have been evaluated and we have come to some initial conclusions:
>
>   * Eliminated from consideration
>
>   * Infinispan
>
>   * Neo4J
>
>   * SenseiDB
>
>   * Voldemort
>
>   * Still under investigation
>
>   * Cassandra
>
>   * MongoDB
>
>   * Relational DB / JPA / JDO.
>
>
> We want to emphasize that any technology we adopt will be a transition
> over time, so that we can maintain stability in the application.
>
> Background
> =========
>
> Our Ann Arbor meeting last month was about thinking through
> architectural changes that could improve OAE in terms of system
> performance and team performance, as well as laying the groundwork for
> taking measurements (modeling a production-like data set, provisioning
> load testing infrastructure, and the load tests themselves). For team
> performance, we agreed that we should strive to rely more heavily on
> established low-level infrastructure for storage. In other words, find a
> storage subsystem and API that we don't need to maintain ourselves. This
> is a fertile time for storage technology, but that also means there are
> many options to sift through, each with its own quirks and tradeoffs.
>
> We set the following criteria for our search (in no particular order):
>
> * ease of use for developers (APIs, etc.)
> * ease of use for deployers (backups, failover, monitoring, etc.)
> * strength of the community
> * suitable license (ECL2 compatible)
> * proven track record (success stories in applications somewhat like ours)
> * options for queries
> * options for scaling
> * options for integrity (atomcity, consistency, transactions,
> referential integrity)
>
> In the weeks since our meeting, the server devs have explored various
> options. I'd like to summarize our progress so far. In the interests of
> brevity, we won't include every detail. We invite your questions and
> feedback. Note that we're right in the middle of our investigation, not
> at the end. Hopefully, we'll get some more time with OmniTI to talk
> about what we've learned.
>
> Infinispan - Infinispan is a successor to JBoss Cache. It is simply a
> caching layer that has the ability to persist via a configurable
> cache-store. Strengths: high level configurability for things like
> transactions, write-behind/write-through persistence, high volume of
> community activity, and storage agnosticism. Weaknesses: Work with
> Infinispan showed promise with respect to operating inside an OSGi
> container, however when trying to persist POJO data, it became obvious
> that another library such as Hibernate OGM [6] would be necessary to
> make persistence of POJOs through Infinispan possible, and there just
> doesn't seem to be anything mature enough, or well documented enough
> that we could start building off of. Current Thinking: Infinispan, while
> being a pretty mature memory grid and caching layer, seems a little
> premature to start thinking of as a full-fledged domain persistence layer.
>
> Voldemort -- a project that comes out of LinkedIn. It is a distributed
> key-value store with sophisticated horizontal scaling using a ring
> topology similar to Cassandra's. Strengths: speed, elastic scaling, runs
> in the JVM, supports various forms of serialization, including JSON and
> Google's protobuf. Weaknesses: no support for querying, so we'd have to
> write separate synchronous indexing using Lucene, and all the glue code
> to make them work together. Not an active, diverse community. Current
> Thinking: too much work to get basic store-and-find operations.
>
> MongoDB -- a document-oriented database with a very developer-friendly
> API. Managed and backed commercially by 10gen. Strengths: really easy to
> use. You can store JSON documents, which is just what we want to do, and
> you can create indexes and query like we're used to from the relational
> DB world. Huge community, probably the most active in the NoSQL space.
> Tools, hosting options, the works. Weaknesses: We've seen the same story
> a number of times [1][2][3][4]: everyone loves MongoDB at first, but it
> becomes operationally painful in production. Scaling it is complex.
> Current Thinking: The pain might be worth it, but it certainly gives us
> pause. This is probably a product that is going to be much easier to
> manage when it matures.
>
> Cassandra -- a column family database, borrowing ideas from Google
> BigTable and Amazon DynamoDB. Originated at Facebook, but since it moved
> to the Apache Foundation, it has taken on a life of its own.
> Commercially backed by DataStax. Strengths: very good replication and
> scaling technology (a ring topology, like Voldemort). Supports queries,
> but you have to plan for them in your data model. Very strong community.
> Consistency tunable per-request. Weaknesses: steeper learning curve for
> devs. Data modeling in Cassandra is a different paradigm from the
> relational databases we know. Current Thinking: this is attractive for
> its power, but it will take work to get everybody up to speed on it. In
> a sense, it's the opposite of the MongoDB story: harder to get started
> with, but very satisfying in the long term. See [1] for more on this.
>
> JPA/JDO with a relational database -- This is the technology we're
> familiar with from projects past. This is the tried and true relational
> model with tables, ORM, and sometimes SQL. Strengths: Everyone knows how
> this works. There are plenty of tools, plenty of commercial support, and
> you can write JOINs! Weaknesses: vertical scaling. When you reach the
> limit of the hardware you can throw at your database server, you can try
> sharding, which is notoriously difficult, or something like memcached,
> but then you're committing to key-value semantics, so why not just go
> there in the first place? [5] JPA via OpenJPA and Eclipselink has been
> surprisingly hard to get working in an OSGi runtime. They have trouble
> with the dynamic nature of bundles. Exploring JDO at the moment, but it
> too feels like swimming upstream. Current Thinking: this is familiar,
> but newer technologies have shown us that one size no longer fits all.
>
> [1] http://www.slideshare.net/eonnen/from-100s-to-100s-of-millions
> [2] http://w3matter.com/blog/from-postgresql-to-mongodb-back-to-postgresql
> [3] http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb
> [4]
> http://e1ven.com/2011/11/07/my-experiences-with-mongodb-over-the-last-year-in-production/
> [5] http://www.couchbase.com/
> [6] http://www.hibernate.org/subprojects/ogm.html
>
>
>
> _______________________________________________
> oae-dev mailing list
> [email protected]
> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>


_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] OAE Storage Investigation Update

Reply via email to