Hi Suresh,

I will try to keep the focus of this mail thread on to the object db
selection. But I will also share some comments about the architecture and
the API since you mentioned those. Please feel free to spawn separate
threads on those if we want to keep this thread focused on object DB.

Please see the comments in-line.

Thanks,
Eran Chinthaka Withana


On Sun, Feb 23, 2014 at 2:20 PM, Suresh Marru <[email protected]> wrote:

> Hi All,
>
> Airavata is actively migrating to use Thrift API for the RESTless design
> and to facilitate various language bindings from client gateways. The
> programming language support in thrift has been so far very encouraging.
> The current architecture is looking like Figure 1 at [1].
>

Quick questions on the architecture. It seems like the API is directly
contacting the Orchestrator to schedule workflows. I honestly think this is
not a scalable approach due to the impedance mismatch of these two systems.
Are we considering to decouple these two with a message queue and go for a
worker based architecture?

 Also, the "API Mapping Diagram" is hinting towards a "kind of" stateful
service with a sequential set of steps. For example, due to the lack of a
method to get all experiments, I assume the client is suppose to remember
the experiment ids and invoke each of these methods in sequence. I'd
encourage to think in terms of stateless invocation where any client can
invoke each of these methods without a prior knowledge on the state of the
execution.

Language specific clients will be released as thrift SDK's (similar to
> evernote sdk's [1]). These clients will be integrated into gateway portals
> which connect to the API Server. The API operations brokers he simple calls
> into one or more backend CPI calls (Airavata internal component
> interfaces).  An example set of mappings are illustrated in Figure 2 at
> [1]. The current draft of thrift API for version 0.12 is at [3], please pay
> attention to experiment model at [4].
>

Comments on thrift IDL

1. The input and output parameters do not have constraint specifiers
(required vs optional) and left to be default. This will be very
challenging when we try to improve APIs in later versions and its a
standard practise to ALWAYS have either optional or required as constraint
specifiers.

2. consider using TypeDefs to reduce repetitive names. For example,
defining airavataErrors.InvalidRequestException as a type will help you to
simply refer to that as InvalidRequestException

3. Introduce a parameter for each method to get the API key. This will be
helpful in the future to identify individual clients, enforce SLAs, logs
requests, etc


>
> For the persistent store, we had few iterations of Airavata Registry
> shifting from a legacy XRegistry to JackRabbit to now a OpenJPA based
> registry. To allow the API and the associated data models to evolve, it
> will be useful to explore object databases so we can store the serialized
> version of thrift objects directly. But it will be nice to have all (or
> most) of the fields queriable.


FYI, we did a storage space analysis sometime back and for smaller objects,
the overhead of storing the object in thrift serialized form vs each
attribute as a column is same. Also, enabling compression on each column
family will make the difference go away further. So, I'd first start with a
fields based object representation.

Having said that, making each attribute part of a column doesn't make it
queriable. We have to either create secondary indexes or do column slices
and both these are a bit expensive. So as always with NoSQL storage
systems, we should always know the queries ahead of time before even
loosely defining storage schemas.


> This calls for a more column-family design of any NoSQL approaches.
>
> Any recommendations for a registry architecture?
>

It will be easy to answer this question if you can list the use cases for
the registry. I don't think most people in this list know all the use
cases. I myself have a very faint memory :)


> Quickly hacking through I find the following approach a viable one:
> ZombieDB[5] over astyanax[6] which talks to Cassandra.


Not sure why you picked Astyanax (despite it being originated from Netflix
and boasting to have better performance than Hector due to its token range
awareness). I'd rather pick Hector or Astyanax based on the performance
numbers you get. We did some work on this earlier and came up with an
abstraction over these two clients so that we can switch easily between
those: https://github.com/WizeCommerce/hecuba

In any case, I think its bit too early to talk about this.

I haven't used ZombieDB before, but before we pick any technology I'd spend
a bit more time to list down the use cases.


> Airavata can benefit immediately from the replication and reliability of
> cassandra and scalability in near future. Some of the model objects like
> experiment creation will need to have strong consistency and most of the
> monitoring can live with eventual consistency.
>

Cassandra, even though is supposed to compromise C for AP (from CAP
theorem), there are knobs (like read and write consistency levels) we can
use to make it strong C. So I think we are covered here.

Reply via email to