[Architecture] Histogram-based Partitioning of Records in the BAM Data Layer

Gokul Balakrishnan Sun, 15 Mar 2015 22:57:13 -0700

Hi all,

For BAM 3.0, we are introducing a data layer which handles all operations
intended for the back-end storage mechanism in a transparent manner. The
data layer consists of two parts: the data source (which handles the actual
record-level operations) and the data service which provides a way for
clients to interact with the data layer. For the initial 3.0 release, we
are providing RDBMS and HBase/HDFS implementations for this layer.


In the HBase analytics record store, we require a mechanism to partition
all records in a given table based on a certain partition index, whereby
the partitioned records will be passed through the analytics data source to
Spark for batched processing. However, due to the way HBase handles rows,
there is no way to jump to an arbitrary row and retrieve n subsequent rows,
unless the row ID is known well in advance; in other words, there's no way
to get say 500 more records from record # 1000 OOTB, unless we know the row
ID of record # 1000 (which we can only infer through an expensive serial
scan of the table).

In order to overcome this limitation, we propose the following strategy
where we keep a virtual histogram of the frequency of incoming records and
create partitions:

- Maintain a table with meta-information which stores the record count as
they are physically written to the data table, based on their timestamps in
a certain granularity (say 100ms). Then, we can find the number of records
written in any 100ms interval.

- When the partitioning is requested (say with index i), we first try to
find the limits of the partitions with the largest possible granularity
(say by years). Then, if a pre-set tolerance is exceeded (say 15%), we
drill down to months instead and repeat the same process). Example: if the
number of records from 2015 are larger than those from 2014 by 15%, we
drill down to month level to try and find partitions with uniform sizes.

- This process is repeated until partitions of records are found with sizes
differing within the threshold; and can go up to the stated 100ms
granularity.

The following sketch shows the approximate generated partitions for a
partition index 20 (20 partitions):



Since the start/end timestamps of each partition is now known, we could
then lookup the pre-existing index table (which has timestamps as the row
key) with each partition's start/end time and then create batches of actual
records.

Highly appreciate any feedback/comments on the above.

Thanks,
Gokul.

-- 
*Balakrishnan Gokulakrishnan*
Software Engineer,
WSO2, Inc. http://wso2.com
Mob: +94 77 593 5789 | +1 650 272 9927

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] Histogram-based Partitioning of Records in the BAM Data Layer

Reply via email to