Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Anjana Fernando Tue, 04 Nov 2014 19:50:54 -0800

Hi Srinath,

Wouldn't it better, if we just make the batch size bigger, that is, lets
just have a sizable local in-memory store, something probably close to
64MB, which is the default HDFS block size, and only after this is filled,
or if the receiver is idle maybe, we can flush the buffer. I was just
thinking, writing to the file system first itself will be expensive, where
there are additional steps of writing all the records to the local file
system and again reading it back, and then finally writing it to HDFS, and
of course, again having a network file system would be an overhead, and not
to mention the implementation/configuration complications that will come
with this. IMHO, we should try to make these scenarios as simple as
possible.


I'm doing our new BAM data layer implementations here [1], where I'm almost
done with an RDBMS implementation, doing some refactoring now (mail on this
yet to come :)), I can also do an HDFS one after that and check it.

[1]
https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics

Cheers,
Anjana.

On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> wrote:

> Hi All,
>
> Following came out of chat with Sanjiva on a scenario involve very large
> number of events coming into BAM.
>
> Currently we use Cassandra to store the events and number we got out of it
> has not been great and Cassandra need too much attention to get to those
> number.
>
> With Cassandra (or any DB) we write data as records. We can batch it, but
> still amount of data in one IO operation is small. In comparison,  file
> transfers are much much faster and that is fastest way to get some data
> from A to B.
>
> So I am proposing to write the events that comes into a local file in the
> Data Receiver, and periodically append them to a HDFS file. We can arrange
> data in a folder by stream and files by timestamp (e.g. 1h data go to a new
> file), so we can selectively pull and process data using Hive. (We can use
> something like https://github.com/OpenHFT/Chronicle-Queue to write data
> to disk).
>
> If user needs avoid losing any messages at all in case of a disk failure,
> either he can have a SAN or NTFS or can run two replicas of receivers  (we
> should write some code so only one of the receivers will actually put data
> to HDFS).
>
> Coding wise, this should not be too hard. I am sure this will be factor of
> time faster than Cassandra (of course we need to do a PoC and verify).
>
> WDYT?
>
> --Srinath
>
>
>
>
>
>
> --
> ============================
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://people.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Reply via email to