Re: [Architecture] Data Storage Architecture Change in BAM

Anjana Fernando Mon, 09 Jun 2014 01:11:32 -0700

Hi Srinath,

On Mon, Jun 9, 2014 at 10:31 AM, Srinath Perera <[email protected]> wrote:


> Hi Anjana,
>
>
> * No support for other data stores for storing events
>
> Yes we need to support RDBMS and Hive
>
> * Toolboxes being bound to certain types of data sources
> Need to fix this.
>
> * Transports
> IMO, for this we can depend on ESB to support other transports for near
> future. Need to make sure ESB thrift mediator is working smoothly.
>
> Please note above should go in the release following the toolboxes, not
> the immediate release which will add toolboxes. We should only allocate
> people AFTER tool boxes are out.
>

Product toolboxes are a matter of coordinating with the product teams,
which we are starting to do, and the product teams will be owning their
respective toolboxes. But the BAM team itself, can be allocated to do
product features. So anyways, we can start building the product toolboxes
with the Cassandra storage handler etc.. and migrate to the new
architecture when it is finalized.

Cheers,
Anjana.


>
> --Srinath
>
>
>
>
>
> On Fri, Jun 6, 2014 at 3:31 PM, Anjana Fernando <[email protected]> wrote:
>
>> Hi,
>>
>> The BAM team has been looking into some ways in improving the current
>> approach in we handling the operations in the data layer. So I will here
>> explain the issues we have because of the current BAM architecture, and
>> propose a solution to remedy this.
>>
>> Issues
>> =====
>>
>> * No support for other data stores for storing events
>>
>> At the moment, we are strictly limited to storing events into Cassandra,
>> but there have been strong interest in using other types of data stores
>> such as MongoDB, RDBMS etc.. specially because of easy of use for some
>> users to use their existing databases and so on. And also, in order for BAM
>> functionality to be embeddable to other products, this support is critical,
>> for example, as a light-weight analytics solution, people should be able to
>> use an RDBMS based solution.
>>
>> * Toolboxes being bound to certain types of data sources
>>
>> This is the case where, we assume we always retrieve data from Cassandra
>> and write to some certain RDBMS. This approach does not scale, specially
>> for WSO2 product related toolboxes we have / we going to have, because
>> then, the toolboxes are limited to a certain specific combination of
>> databases, and we will then need to support a different versions of
>> toolboxes for each database combination, which is not practical to
>> maintain, and also a huge effort will be spent on testing these each time.
>>
>> * Multi-tenancy limitations
>>
>> At the moment, we use our own MT Cassandra to store the events
>> tenant-wise, and because of this, we cannot use any other Cassandra
>> distribution that is out there to implement MT features. So effectively,
>> anyone who may use their own Cassandra installation cannot use MT features.
>> Which makes the BAM product inconsistent with its features. So ideally, we
>> should support anyone having their own Cassandra, or actually any type of
>> database that is supported without any special modifications for MT.
>>
>> * Transports
>>
>> CEP introduced a new architecture on defining transports/data formats in
>> the system. And there are many transports such as HTTP/JMS etc.. with data
>> types such as XML/Text/JSON available to get events in. But BAM is limited
>> to using the Thrift transport, where, because we explicitly needs
>> authentication support from the transport, because that is how we
>> authenticate to Cassandra data store. So we cannot use any other transport,
>> because we cannot authenticate to our data store. But ideally, what we need
>> is, a way to have a default system user for a tenant, where by only
>> figuring out the tenant this request belongs to, we should be able to write
>> the events to the data store. For example, we can use a JMS queue, where we
>> can use the data from that to write to super-tenant's space.
>>
>> Also, in toolboxes, the stream definitions needs to contain a
>> username/password pair to create streams and their respective
>> representation in the data store, ideally, it should be just, identify the
>> tenant the toolbox should be deployed and just do the data operations that
>> is needed internally.
>>
>> Solution
>> ======
>>
>> So the proposed solution, is to create a clear data abstraction layer for
>> BAM. Rather than having just having Cassandra and some other RDBMS for
>> storage of events and analyzed data, we propose having a single interface
>> called "AnalyticsDataStore" to keep all the required data and its metadata.
>> This would be the store used to store all the events coming into BAM and
>> also the place to put summarized data. So basically AnalyticsDataStore will
>> have several implementations, with backing data stores such as Cassandra,
>> MongoDB and RDBMS. And, the data bridge connector for BAM will be
>> implemented to simply write data to AnalyticsDataStore, and also, we will
>> be having a Hive storage handler called "AnalyticsDataStoreStorageHandler"
>> which reads and writes data to our common data store. So basically, users
>> will have no idea about where the data will be going and from where the
>> data is accessed from, it is simply an implementation detail. And also,
>> details such as indexing (for activity search, incremental processing),
>> pagination etc.. would be built into the AnalyticsDataStore interface. This
>> interface will contain facilities that will be required for aspects such as
>> data locality features to be used in Hadoop etc.. so all these functions
>> will be implemented by concrete implementations or at least given no-op
>> operations if they are not supported. Also, other metadata storage
>> requirements such as, a Hive metastore implementation can be done using
>> AnalyticsDataStore, so we can remove another dependency on the current
>> RDBMS based metastore, and eliminate another configuration point in BAM.
>>
>> So with the above approach, we can create some solid functionality within
>> BAM, without thinking of the complexities that would come when we swap out
>> different data sources. And also, our toolbox analytic scripts can be
>> written in a way that it is data store agnostic, which will be a big plus
>> point, where we can implement the toolbox once and forget about it. Since,
>> we will use this same data store for summarized data as well, it wont goto
>> the usual RDBMS based tables, where the earlier point there was, many tools
>> can be already used to visualize data from RDBMS tables. But this
>> requirement will be reduced, where in BAM itself, we are going to provide
>> rich visualization support with UES. And also, AnalyticsDataStore
>> functionalities will also be exposed from a well defined REST API and a
>> Java API, so external tools also can access this data if needed. And also,
>> functionalities such as data archival will also use this interface, rather
>> than directly going to the back-end data store.  And also, because of this
>> centralized API based data access, multi-tenancy aspects can be implemented
>> as an implementation detail, where we are free to store the data in any
>> structure we want internally, for example, for Cassandra, we can keep a
>> single admin user in a configuration file, and store all the tenant based
>> data in a single space.
>>
>> And also, now users will not directly go to the backend data store to
>> browse for data and all, they will simply use the API with the proper user
>> credentials to retrieve/update data. So then, we should also remove data
>> store specific tools such as Cassandra explorer and so on from BAM,
>> because, browsing the raw data there may not make sense to the users. And
>> anyway, we should not keep any data store specific tools, since we will be
>> supporting many. So at the end, the aim is to possibly solve all the issues
>> mentioned earlier, with the suggested layered approach, to ultimately
>> create a much more stable and a functional BAM. Any comments on this idea
>> is appreciated.
>>
>> Cheers,
>> Anjana.
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>   Director, Research, WSO2 Inc.
>   Visiting Faculty, University of Moratuwa
>   Member, Apache Software Foundation
>   Research Scientist, Lanka Software Foundation
>   Blog: http://srinathsview.blogspot.com/
>   Photos: http://www.flickr.com/photos/hemapani/
>    Phone: 0772360902
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Data Storage Architecture Change in BAM

Reply via email to