What happens in a situation where the records are frequently deleted as well? When old data are purged what happens to the respective partitions?
Thanks. *Maninda Edirisooriya* Senior Software Engineer *WSO2, Inc.*lean.enterprise.middleware. *Blog* : http://maninda.blogspot.com/ *E-mail* : [email protected] *Skype* : @manindae *Twitter* : @maninda On Mon, Mar 16, 2015 at 11:25 AM, Gokul Balakrishnan <[email protected]> wrote: > Hi all, > > For BAM 3.0, we are introducing a data layer which handles all operations > intended for the back-end storage mechanism in a transparent manner. The > data layer consists of two parts: the data source (which handles the actual > record-level operations) and the data service which provides a way for > clients to interact with the data layer. For the initial 3.0 release, we > are providing RDBMS and HBase/HDFS implementations for this layer. > > In the HBase analytics record store, we require a mechanism to partition > all records in a given table based on a certain partition index, whereby > the partitioned records will be passed through the analytics data source to > Spark for batched processing. However, due to the way HBase handles rows, > there is no way to jump to an arbitrary row and retrieve n subsequent rows, > unless the row ID is known well in advance; in other words, there's no way > to get say 500 more records from record # 1000 OOTB, unless we know the row > ID of record # 1000 (which we can only infer through an expensive serial > scan of the table). > > In order to overcome this limitation, we propose the following strategy > where we keep a virtual histogram of the frequency of incoming records and > create partitions: > > - Maintain a table with meta-information which stores the record count as > they are physically written to the data table, based on their timestamps in > a certain granularity (say 100ms). Then, we can find the number of records > written in any 100ms interval. > > - When the partitioning is requested (say with index i), we first try to > find the limits of the partitions with the largest possible granularity > (say by years). Then, if a pre-set tolerance is exceeded (say 15%), we > drill down to months instead and repeat the same process). Example: if the > number of records from 2015 are larger than those from 2014 by 15%, we > drill down to month level to try and find partitions with uniform sizes. > > - This process is repeated until partitions of records are found with > sizes differing within the threshold; and can go up to the stated 100ms > granularity. > > The following sketch shows the approximate generated partitions for a > partition index 20 (20 partitions): > > > > Since the start/end timestamps of each partition is now known, we could > then lookup the pre-existing index table (which has timestamps as the row > key) with each partition's start/end time and then create batches of actual > records. > > Highly appreciate any feedback/comments on the above. > > Thanks, > Gokul. > > -- > *Balakrishnan Gokulakrishnan* > Software Engineer, > WSO2, Inc. http://wso2.com > Mob: +94 77 593 5789 | +1 650 272 9927 > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > >
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
