Hi Srinath, Wouldn't it better, if we just make the batch size bigger, that is, lets just have a sizable local in-memory store, something probably close to 64MB, which is the default HDFS block size, and only after this is filled, or if the receiver is idle maybe, we can flush the buffer. I was just thinking, writing to the file system first itself will be expensive, where there are additional steps of writing all the records to the local file system and again reading it back, and then finally writing it to HDFS, and of course, again having a network file system would be an overhead, and not to mention the implementation/configuration complications that will come with this. IMHO, we should try to make these scenarios as simple as possible.
I'm doing our new BAM data layer implementations here [1], where I'm almost done with an RDBMS implementation, doing some refactoring now (mail on this yet to come :)), I can also do an HDFS one after that and check it. [1] https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics Cheers, Anjana. On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> wrote: > Hi All, > > Following came out of chat with Sanjiva on a scenario involve very large > number of events coming into BAM. > > Currently we use Cassandra to store the events and number we got out of it > has not been great and Cassandra need too much attention to get to those > number. > > With Cassandra (or any DB) we write data as records. We can batch it, but > still amount of data in one IO operation is small. In comparison, file > transfers are much much faster and that is fastest way to get some data > from A to B. > > So I am proposing to write the events that comes into a local file in the > Data Receiver, and periodically append them to a HDFS file. We can arrange > data in a folder by stream and files by timestamp (e.g. 1h data go to a new > file), so we can selectively pull and process data using Hive. (We can use > something like https://github.com/OpenHFT/Chronicle-Queue to write data > to disk). > > If user needs avoid losing any messages at all in case of a disk failure, > either he can have a SAN or NTFS or can run two replicas of receivers (we > should write some code so only one of the receivers will actually put data > to HDFS). > > Coding wise, this should not be too hard. I am sure this will be factor of > time faster than Cassandra (of course we need to do a PoC and verify). > > WDYT? > > --Srinath > > > > > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
