Re: [Architecture] Histogram-based Partitioning of Records in the BAM Data Layer

Maninda Edirisooriya Sun, 15 Mar 2015 23:32:42 -0700

What happens in a situation where the records are frequently deleted as
well? When old data are purged what happens to the respective partitions?


Thanks.


*Maninda Edirisooriya*
Senior Software Engineer

*WSO2, Inc.*lean.enterprise.middleware.

*Blog* : http://maninda.blogspot.com/
*E-mail* : [email protected]
*Skype* : @manindae
*Twitter* : @maninda

On Mon, Mar 16, 2015 at 11:25 AM, Gokul Balakrishnan <[email protected]> wrote:

> Hi all,
>
> For BAM 3.0, we are introducing a data layer which handles all operations
> intended for the back-end storage mechanism in a transparent manner. The
> data layer consists of two parts: the data source (which handles the actual
> record-level operations) and the data service which provides a way for
> clients to interact with the data layer. For the initial 3.0 release, we
> are providing RDBMS and HBase/HDFS implementations for this layer.
>
> In the HBase analytics record store, we require a mechanism to partition
> all records in a given table based on a certain partition index, whereby
> the partitioned records will be passed through the analytics data source to
> Spark for batched processing. However, due to the way HBase handles rows,
> there is no way to jump to an arbitrary row and retrieve n subsequent rows,
> unless the row ID is known well in advance; in other words, there's no way
> to get say 500 more records from record # 1000 OOTB, unless we know the row
> ID of record # 1000 (which we can only infer through an expensive serial
> scan of the table).
>
> In order to overcome this limitation, we propose the following strategy
> where we keep a virtual histogram of the frequency of incoming records and
> create partitions:
>
> - Maintain a table with meta-information which stores the record count as
> they are physically written to the data table, based on their timestamps in
> a certain granularity (say 100ms). Then, we can find the number of records
> written in any 100ms interval.
>
> - When the partitioning is requested (say with index i), we first try to
> find the limits of the partitions with the largest possible granularity
> (say by years). Then, if a pre-set tolerance is exceeded (say 15%), we
> drill down to months instead and repeat the same process). Example: if the
> number of records from 2015 are larger than those from 2014 by 15%, we
> drill down to month level to try and find partitions with uniform sizes.
>
> - This process is repeated until partitions of records are found with
> sizes differing within the threshold; and can go up to the stated 100ms
> granularity.
>
> The following sketch shows the approximate generated partitions for a
> partition index 20 (20 partitions):
>
>
>
> Since the start/end timestamps of each partition is now known, we could
> then lookup the pre-existing index table (which has timestamps as the row
> key) with each partition's start/end time and then create batches of actual
> records.
>
> Highly appreciate any feedback/comments on the above.
>
> Thanks,
> Gokul.
>
> --
> *Balakrishnan Gokulakrishnan*
> Software Engineer,
> WSO2, Inc. http://wso2.com
> Mob: +94 77 593 5789 | +1 650 272 9927
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Histogram-based Partitioning of Records in the BAM Data Layer

Reply via email to