Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Sanjiva Weerawarana Thu, 06 Nov 2014 16:03:02 -0800

Anjana I think the idea was for the file system -> HDFS upload to happen
via a simple cron job type thing.


On Wed, Nov 5, 2014 at 9:19 AM, Anjana Fernando <[email protected]> wrote:

> Hi Srinath,
>
> Wouldn't it better, if we just make the batch size bigger, that is, lets
> just have a sizable local in-memory store, something probably close to
> 64MB, which is the default HDFS block size, and only after this is filled,
> or if the receiver is idle maybe, we can flush the buffer. I was just
> thinking, writing to the file system first itself will be expensive, where
> there are additional steps of writing all the records to the local file
> system and again reading it back, and then finally writing it to HDFS, and
> of course, again having a network file system would be an overhead, and not
> to mention the implementation/configuration complications that will come
> with this. IMHO, we should try to make these scenarios as simple as
> possible.
>
> I'm doing our new BAM data layer implementations here [1], where I'm
> almost done with an RDBMS implementation, doing some refactoring now (mail
> on this yet to come :)), I can also do an HDFS one after that and check it.
>
> [1]
> https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics
>
> Cheers,
> Anjana.
>
> On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> wrote:
>
>> Hi All,
>>
>> Following came out of chat with Sanjiva on a scenario involve very large
>> number of events coming into BAM.
>>
>> Currently we use Cassandra to store the events and number we got out of
>> it has not been great and Cassandra need too much attention to get to those
>> number.
>>
>> With Cassandra (or any DB) we write data as records. We can batch it, but
>> still amount of data in one IO operation is small. In comparison,  file
>> transfers are much much faster and that is fastest way to get some data
>> from A to B.
>>
>> So I am proposing to write the events that comes into a local file in the
>> Data Receiver, and periodically append them to a HDFS file. We can arrange
>> data in a folder by stream and files by timestamp (e.g. 1h data go to a new
>> file), so we can selectively pull and process data using Hive. (We can use
>> something like https://github.com/OpenHFT/Chronicle-Queue to write data
>> to disk).
>>
>> If user needs avoid losing any messages at all in case of a disk failure,
>> either he can have a SAN or NTFS or can run two replicas of receivers  (we
>> should write some code so only one of the receivers will actually put data
>> to HDFS).
>>
>> Coding wise, this should not be too hard. I am sure this will be factor
>> of time faster than Cassandra (of course we need to do a PoC and verify).
>>
>> WDYT?
>>
>> --Srinath
>>
>>
>>
>>
>>
>>
>> --
>> ============================
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
> *Anjana Fernando*
> Senior Technical Lead
> WSO2 Inc. | http://wso2.com
> lean . enterprise . middleware
>



-- 
Sanjiva Weerawarana, Ph.D.
Founder, Chairman & CEO; WSO2, Inc.;  http://wso2.com/
email: [email protected]; office: (+1 650 745 4499 | +94  11 214 5345)
x5700; cell: +94 77 787 6880 | +1 408 466 5099; voip: +1 650 265 8311
blog: http://sanjiva.weerawarana.org/; twitter: @sanjiva
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Reply via email to