Anjana I think the idea was for the file system -> HDFS upload to happen via a simple cron job type thing.
On Wed, Nov 5, 2014 at 9:19 AM, Anjana Fernando <[email protected]> wrote: > Hi Srinath, > > Wouldn't it better, if we just make the batch size bigger, that is, lets > just have a sizable local in-memory store, something probably close to > 64MB, which is the default HDFS block size, and only after this is filled, > or if the receiver is idle maybe, we can flush the buffer. I was just > thinking, writing to the file system first itself will be expensive, where > there are additional steps of writing all the records to the local file > system and again reading it back, and then finally writing it to HDFS, and > of course, again having a network file system would be an overhead, and not > to mention the implementation/configuration complications that will come > with this. IMHO, we should try to make these scenarios as simple as > possible. > > I'm doing our new BAM data layer implementations here [1], where I'm > almost done with an RDBMS implementation, doing some refactoring now (mail > on this yet to come :)), I can also do an HDFS one after that and check it. > > [1] > https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics > > Cheers, > Anjana. > > On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> wrote: > >> Hi All, >> >> Following came out of chat with Sanjiva on a scenario involve very large >> number of events coming into BAM. >> >> Currently we use Cassandra to store the events and number we got out of >> it has not been great and Cassandra need too much attention to get to those >> number. >> >> With Cassandra (or any DB) we write data as records. We can batch it, but >> still amount of data in one IO operation is small. In comparison, file >> transfers are much much faster and that is fastest way to get some data >> from A to B. >> >> So I am proposing to write the events that comes into a local file in the >> Data Receiver, and periodically append them to a HDFS file. We can arrange >> data in a folder by stream and files by timestamp (e.g. 1h data go to a new >> file), so we can selectively pull and process data using Hive. (We can use >> something like https://github.com/OpenHFT/Chronicle-Queue to write data >> to disk). >> >> If user needs avoid losing any messages at all in case of a disk failure, >> either he can have a SAN or NTFS or can run two replicas of receivers (we >> should write some code so only one of the receivers will actually put data >> to HDFS). >> >> Coding wise, this should not be too hard. I am sure this will be factor >> of time faster than Cassandra (of course we need to do a PoC and verify). >> >> WDYT? >> >> --Srinath >> >> >> >> >> >> >> -- >> ============================ >> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >> Site: http://people.apache.org/~hemapani/ >> Photos: http://www.flickr.com/photos/hemapani/ >> Phone: 0772360902 >> > > > > -- > *Anjana Fernando* > Senior Technical Lead > WSO2 Inc. | http://wso2.com > lean . enterprise . middleware > -- Sanjiva Weerawarana, Ph.D. Founder, Chairman & CEO; WSO2, Inc.; http://wso2.com/ email: [email protected]; office: (+1 650 745 4499 | +94 11 214 5345) x5700; cell: +94 77 787 6880 | +1 408 466 5099; voip: +1 650 265 8311 blog: http://sanjiva.weerawarana.org/; twitter: @sanjiva Lean . Enterprise . Middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
