[Architecture] Data Storage Architecture Change in BAM

Anjana Fernando Fri, 06 Jun 2014 05:45:28 -0700

Hi,

The BAM team has been looking into some ways in improving the current
approach in we handling the operations in the data layer. So I will here
explain the issues we have because of the current BAM architecture, and
propose a solution to remedy this.


Issues
=====

* No support for other data stores for storing events

At the moment, we are strictly limited to storing events into Cassandra,
but there have been strong interest in using other types of data stores
such as MongoDB, RDBMS etc.. specially because of easy of use for some
users to use their existing databases and so on. And also, in order for BAM
functionality to be embeddable to other products, this support is critical,
for example, as a light-weight analytics solution, people should be able to
use an RDBMS based solution.

* Toolboxes being bound to certain types of data sources

This is the case where, we assume we always retrieve data from Cassandra
and write to some certain RDBMS. This approach does not scale, specially
for WSO2 product related toolboxes we have / we going to have, because
then, the toolboxes are limited to a certain specific combination of
databases, and we will then need to support a different versions of
toolboxes for each database combination, which is not practical to
maintain, and also a huge effort will be spent on testing these each time.

* Multi-tenancy limitations

At the moment, we use our own MT Cassandra to store the events tenant-wise,
and because of this, we cannot use any other Cassandra distribution that is
out there to implement MT features. So effectively, anyone who may use
their own Cassandra installation cannot use MT features. Which makes the
BAM product inconsistent with its features. So ideally, we should support
anyone having their own Cassandra, or actually any type of database that is
supported without any special modifications for MT.

* Transports

CEP introduced a new architecture on defining transports/data formats in
the system. And there are many transports such as HTTP/JMS etc.. with data
types such as XML/Text/JSON available to get events in. But BAM is limited
to using the Thrift transport, where, because we explicitly needs
authentication support from the transport, because that is how we
authenticate to Cassandra data store. So we cannot use any other transport,
because we cannot authenticate to our data store. But ideally, what we need
is, a way to have a default system user for a tenant, where by only
figuring out the tenant this request belongs to, we should be able to write
the events to the data store. For example, we can use a JMS queue, where we
can use the data from that to write to super-tenant's space.

Also, in toolboxes, the stream definitions needs to contain a
username/password pair to create streams and their respective
representation in the data store, ideally, it should be just, identify the
tenant the toolbox should be deployed and just do the data operations that
is needed internally.

Solution
======

So the proposed solution, is to create a clear data abstraction layer for
BAM. Rather than having just having Cassandra and some other RDBMS for
storage of events and analyzed data, we propose having a single interface
called "AnalyticsDataStore" to keep all the required data and its metadata.
This would be the store used to store all the events coming into BAM and
also the place to put summarized data. So basically AnalyticsDataStore will
have several implementations, with backing data stores such as Cassandra,
MongoDB and RDBMS. And, the data bridge connector for BAM will be
implemented to simply write data to AnalyticsDataStore, and also, we will
be having a Hive storage handler called "AnalyticsDataStoreStorageHandler"
which reads and writes data to our common data store. So basically, users
will have no idea about where the data will be going and from where the
data is accessed from, it is simply an implementation detail. And also,
details such as indexing (for activity search, incremental processing),
pagination etc.. would be built into the AnalyticsDataStore interface. This
interface will contain facilities that will be required for aspects such as
data locality features to be used in Hadoop etc.. so all these functions
will be implemented by concrete implementations or at least given no-op
operations if they are not supported. Also, other metadata storage
requirements such as, a Hive metastore implementation can be done using
AnalyticsDataStore, so we can remove another dependency on the current
RDBMS based metastore, and eliminate another configuration point in BAM.

So with the above approach, we can create some solid functionality within
BAM, without thinking of the complexities that would come when we swap out
different data sources. And also, our toolbox analytic scripts can be
written in a way that it is data store agnostic, which will be a big plus
point, where we can implement the toolbox once and forget about it. Since,
we will use this same data store for summarized data as well, it wont goto
the usual RDBMS based tables, where the earlier point there was, many tools
can be already used to visualize data from RDBMS tables. But this
requirement will be reduced, where in BAM itself, we are going to provide
rich visualization support with UES. And also, AnalyticsDataStore
functionalities will also be exposed from a well defined REST API and a
Java API, so external tools also can access this data if needed. And also,
functionalities such as data archival will also use this interface, rather
than directly going to the back-end data store.  And also, because of this
centralized API based data access, multi-tenancy aspects can be implemented
as an implementation detail, where we are free to store the data in any
structure we want internally, for example, for Cassandra, we can keep a
single admin user in a configuration file, and store all the tenant based
data in a single space.

And also, now users will not directly go to the backend data store to
browse for data and all, they will simply use the API with the proper user
credentials to retrieve/update data. So then, we should also remove data
store specific tools such as Cassandra explorer and so on from BAM,
because, browsing the raw data there may not make sense to the users. And
anyway, we should not keep any data store specific tools, since we will be
supporting many. So at the end, the aim is to possibly solve all the issues
mentioned earlier, with the suggested layered approach, to ultimately
create a much more stable and a functional BAM. Any comments on this idea
is appreciated.

Cheers,
Anjana.
-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] Data Storage Architecture Change in BAM

Reply via email to