Hi, The BAM team has been looking into some ways in improving the current approach in we handling the operations in the data layer. So I will here explain the issues we have because of the current BAM architecture, and propose a solution to remedy this.
Issues ===== * No support for other data stores for storing events At the moment, we are strictly limited to storing events into Cassandra, but there have been strong interest in using other types of data stores such as MongoDB, RDBMS etc.. specially because of easy of use for some users to use their existing databases and so on. And also, in order for BAM functionality to be embeddable to other products, this support is critical, for example, as a light-weight analytics solution, people should be able to use an RDBMS based solution. * Toolboxes being bound to certain types of data sources This is the case where, we assume we always retrieve data from Cassandra and write to some certain RDBMS. This approach does not scale, specially for WSO2 product related toolboxes we have / we going to have, because then, the toolboxes are limited to a certain specific combination of databases, and we will then need to support a different versions of toolboxes for each database combination, which is not practical to maintain, and also a huge effort will be spent on testing these each time. * Multi-tenancy limitations At the moment, we use our own MT Cassandra to store the events tenant-wise, and because of this, we cannot use any other Cassandra distribution that is out there to implement MT features. So effectively, anyone who may use their own Cassandra installation cannot use MT features. Which makes the BAM product inconsistent with its features. So ideally, we should support anyone having their own Cassandra, or actually any type of database that is supported without any special modifications for MT. * Transports CEP introduced a new architecture on defining transports/data formats in the system. And there are many transports such as HTTP/JMS etc.. with data types such as XML/Text/JSON available to get events in. But BAM is limited to using the Thrift transport, where, because we explicitly needs authentication support from the transport, because that is how we authenticate to Cassandra data store. So we cannot use any other transport, because we cannot authenticate to our data store. But ideally, what we need is, a way to have a default system user for a tenant, where by only figuring out the tenant this request belongs to, we should be able to write the events to the data store. For example, we can use a JMS queue, where we can use the data from that to write to super-tenant's space. Also, in toolboxes, the stream definitions needs to contain a username/password pair to create streams and their respective representation in the data store, ideally, it should be just, identify the tenant the toolbox should be deployed and just do the data operations that is needed internally. Solution ====== So the proposed solution, is to create a clear data abstraction layer for BAM. Rather than having just having Cassandra and some other RDBMS for storage of events and analyzed data, we propose having a single interface called "AnalyticsDataStore" to keep all the required data and its metadata. This would be the store used to store all the events coming into BAM and also the place to put summarized data. So basically AnalyticsDataStore will have several implementations, with backing data stores such as Cassandra, MongoDB and RDBMS. And, the data bridge connector for BAM will be implemented to simply write data to AnalyticsDataStore, and also, we will be having a Hive storage handler called "AnalyticsDataStoreStorageHandler" which reads and writes data to our common data store. So basically, users will have no idea about where the data will be going and from where the data is accessed from, it is simply an implementation detail. And also, details such as indexing (for activity search, incremental processing), pagination etc.. would be built into the AnalyticsDataStore interface. This interface will contain facilities that will be required for aspects such as data locality features to be used in Hadoop etc.. so all these functions will be implemented by concrete implementations or at least given no-op operations if they are not supported. Also, other metadata storage requirements such as, a Hive metastore implementation can be done using AnalyticsDataStore, so we can remove another dependency on the current RDBMS based metastore, and eliminate another configuration point in BAM. So with the above approach, we can create some solid functionality within BAM, without thinking of the complexities that would come when we swap out different data sources. And also, our toolbox analytic scripts can be written in a way that it is data store agnostic, which will be a big plus point, where we can implement the toolbox once and forget about it. Since, we will use this same data store for summarized data as well, it wont goto the usual RDBMS based tables, where the earlier point there was, many tools can be already used to visualize data from RDBMS tables. But this requirement will be reduced, where in BAM itself, we are going to provide rich visualization support with UES. And also, AnalyticsDataStore functionalities will also be exposed from a well defined REST API and a Java API, so external tools also can access this data if needed. And also, functionalities such as data archival will also use this interface, rather than directly going to the back-end data store. And also, because of this centralized API based data access, multi-tenancy aspects can be implemented as an implementation detail, where we are free to store the data in any structure we want internally, for example, for Cassandra, we can keep a single admin user in a configuration file, and store all the tenant based data in a single space. And also, now users will not directly go to the backend data store to browse for data and all, they will simply use the API with the proper user credentials to retrieve/update data. So then, we should also remove data store specific tools such as Cassandra explorer and so on from BAM, because, browsing the raw data there may not make sense to the users. And anyway, we should not keep any data store specific tools, since we will be supporting many. So at the end, the aim is to possibly solve all the issues mentioned earlier, with the suggested layered approach, to ultimately create a much more stable and a functional BAM. Any comments on this idea is appreciated. Cheers, Anjana. -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
