Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Anjana Fernando Thu, 06 Nov 2014 17:54:08 -0800

Hi Sanjiva,

On Thu, Nov 6, 2014 at 4:01 PM, Sanjiva Weerawarana <[email protected]>
wrote:


> Anjana I think the idea was for the file system -> HDFS upload to happen
> via a simple cron job type thing.
>

Even so, we will be just moving the problem to another area, the overall
effort done by that hardware is still the same (writing to disk, reading it
back, write it to network). That is, even though we can goto very a high
throughput initially by writing it to the local disk at first, later on we
have to read it back and write it to HDFS via the network, which is the
slower part of our operation. So if we continue to load the machine with an
extreme throughput, you will eventually lose space in that disk.

Cheers,
Anjana.


>
> On Wed, Nov 5, 2014 at 9:19 AM, Anjana Fernando <[email protected]> wrote:
>
>> Hi Srinath,
>>
>> Wouldn't it better, if we just make the batch size bigger, that is, lets
>> just have a sizable local in-memory store, something probably close to
>> 64MB, which is the default HDFS block size, and only after this is filled,
>> or if the receiver is idle maybe, we can flush the buffer. I was just
>> thinking, writing to the file system first itself will be expensive, where
>> there are additional steps of writing all the records to the local file
>> system and again reading it back, and then finally writing it to HDFS, and
>> of course, again having a network file system would be an overhead, and not
>> to mention the implementation/configuration complications that will come
>> with this. IMHO, we should try to make these scenarios as simple as
>> possible.
>>
>> I'm doing our new BAM data layer implementations here [1], where I'm
>> almost done with an RDBMS implementation, doing some refactoring now (mail
>> on this yet to come :)), I can also do an HDFS one after that and check it.
>>
>> [1]
>> https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics
>>
>> Cheers,
>> Anjana.
>>
>> On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> Following came out of chat with Sanjiva on a scenario involve very large
>>> number of events coming into BAM.
>>>
>>> Currently we use Cassandra to store the events and number we got out of
>>> it has not been great and Cassandra need too much attention to get to those
>>> number.
>>>
>>> With Cassandra (or any DB) we write data as records. We can batch it,
>>> but still amount of data in one IO operation is small. In comparison,  file
>>> transfers are much much faster and that is fastest way to get some data
>>> from A to B.
>>>
>>> So I am proposing to write the events that comes into a local file in
>>> the Data Receiver, and periodically append them to a HDFS file. We can
>>> arrange data in a folder by stream and files by timestamp (e.g. 1h data go
>>> to a new file), so we can selectively pull and process data using Hive. (We
>>> can use something like https://github.com/OpenHFT/Chronicle-Queue to
>>> write data to disk).
>>>
>>> If user needs avoid losing any messages at all in case of a disk
>>> failure, either he can have a SAN or NTFS or can run two replicas of
>>> receivers  (we should write some code so only one of the receivers will
>>> actually put data to HDFS).
>>>
>>> Coding wise, this should not be too hard. I am sure this will be factor
>>> of time faster than Cassandra (of course we need to do a PoC and verify).
>>>
>>> WDYT?
>>>
>>> --Srinath
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> Sanjiva Weerawarana, Ph.D.
> Founder, Chairman & CEO; WSO2, Inc.;  http://wso2.com/
> email: [email protected]; office: (+1 650 745 4499 | +94  11 214 5345)
> x5700; cell: +94 77 787 6880 | +1 408 466 5099; voip: +1 650 265 8311
> blog: http://sanjiva.weerawarana.org/; twitter: @sanjiva
> Lean . Enterprise . Middleware
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] RFC: Doing Bulk Events Updates to HDFS instead of Cassandra

Reply via email to