Hi Suresh, I will try to keep the focus of this mail thread on to the object db selection. But I will also share some comments about the architecture and the API since you mentioned those. Please feel free to spawn separate threads on those if we want to keep this thread focused on object DB.
Please see the comments in-line. Thanks, Eran Chinthaka Withana On Sun, Feb 23, 2014 at 2:20 PM, Suresh Marru <[email protected]> wrote: > Hi All, > > Airavata is actively migrating to use Thrift API for the RESTless design > and to facilitate various language bindings from client gateways. The > programming language support in thrift has been so far very encouraging. > The current architecture is looking like Figure 1 at [1]. > Quick questions on the architecture. It seems like the API is directly contacting the Orchestrator to schedule workflows. I honestly think this is not a scalable approach due to the impedance mismatch of these two systems. Are we considering to decouple these two with a message queue and go for a worker based architecture? Also, the "API Mapping Diagram" is hinting towards a "kind of" stateful service with a sequential set of steps. For example, due to the lack of a method to get all experiments, I assume the client is suppose to remember the experiment ids and invoke each of these methods in sequence. I'd encourage to think in terms of stateless invocation where any client can invoke each of these methods without a prior knowledge on the state of the execution. Language specific clients will be released as thrift SDK's (similar to > evernote sdk's [1]). These clients will be integrated into gateway portals > which connect to the API Server. The API operations brokers he simple calls > into one or more backend CPI calls (Airavata internal component > interfaces). An example set of mappings are illustrated in Figure 2 at > [1]. The current draft of thrift API for version 0.12 is at [3], please pay > attention to experiment model at [4]. > Comments on thrift IDL 1. The input and output parameters do not have constraint specifiers (required vs optional) and left to be default. This will be very challenging when we try to improve APIs in later versions and its a standard practise to ALWAYS have either optional or required as constraint specifiers. 2. consider using TypeDefs to reduce repetitive names. For example, defining airavataErrors.InvalidRequestException as a type will help you to simply refer to that as InvalidRequestException 3. Introduce a parameter for each method to get the API key. This will be helpful in the future to identify individual clients, enforce SLAs, logs requests, etc > > For the persistent store, we had few iterations of Airavata Registry > shifting from a legacy XRegistry to JackRabbit to now a OpenJPA based > registry. To allow the API and the associated data models to evolve, it > will be useful to explore object databases so we can store the serialized > version of thrift objects directly. But it will be nice to have all (or > most) of the fields queriable. FYI, we did a storage space analysis sometime back and for smaller objects, the overhead of storing the object in thrift serialized form vs each attribute as a column is same. Also, enabling compression on each column family will make the difference go away further. So, I'd first start with a fields based object representation. Having said that, making each attribute part of a column doesn't make it queriable. We have to either create secondary indexes or do column slices and both these are a bit expensive. So as always with NoSQL storage systems, we should always know the queries ahead of time before even loosely defining storage schemas. > This calls for a more column-family design of any NoSQL approaches. > > Any recommendations for a registry architecture? > It will be easy to answer this question if you can list the use cases for the registry. I don't think most people in this list know all the use cases. I myself have a very faint memory :) > Quickly hacking through I find the following approach a viable one: > ZombieDB[5] over astyanax[6] which talks to Cassandra. Not sure why you picked Astyanax (despite it being originated from Netflix and boasting to have better performance than Hector due to its token range awareness). I'd rather pick Hector or Astyanax based on the performance numbers you get. We did some work on this earlier and came up with an abstraction over these two clients so that we can switch easily between those: https://github.com/WizeCommerce/hecuba In any case, I think its bit too early to talk about this. I haven't used ZombieDB before, but before we pick any technology I'd spend a bit more time to list down the use cases. > Airavata can benefit immediately from the replication and reliability of > cassandra and scalability in near future. Some of the model objects like > experiment creation will need to have strong consistency and most of the > monitoring can live with eventual consistency. > Cassandra, even though is supposed to compromise C for AP (from CAP theorem), there are knobs (like read and write consistency levels) we can use to make it strong C. So I think we are covered here.
