[Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Srinath Perera Tue, 04 Nov 2014 18:56:56 -0800

Hi All,

Following came out of chat with Sanjiva on a scenario involve very large
number of events coming into BAM.


Currently we use Cassandra to store the events and number we got out of it
has not been great and Cassandra need too much attention to get to those
number.

With Cassandra (or any DB) we write data as records. We can batch it, but
still amount of data in one IO operation is small. In comparison,  file
transfers are much much faster and that is fastest way to get some data
from A to B.

So I am proposing to write the events that comes into a local file in the
Data Receiver, and periodically append them to a HDFS file. We can arrange
data in a folder by stream and files by timestamp (e.g. 1h data go to a new
file), so we can selectively pull and process data using Hive. (We can use
something like https://github.com/OpenHFT/Chronicle-Queue to write data to
disk).

If user needs avoid losing any messages at all in case of a disk failure,
either he can have a SAN or NTFS or can run two replicas of receivers  (we
should write some code so only one of the receivers will actually put data
to HDFS).

Coding wise, this should not be too hard. I am sure this will be factor of
time faster than Cassandra (of course we need to do a PoC and verify).

WDYT?

--Srinath






-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

[Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Reply via email to