Of course we need to try it out and verify, I am just making a case that we should try it out :)
Also, RDBMS should be default as most scenarios can be handled with DBs and those is no reason to make everyone's life complicated. --Srinath On Fri, Nov 7, 2014 at 7:44 AM, Srinath Perera <[email protected]> wrote: > 1) Anjana you assuming the bandwidth is the bottleneck. Let me give an > example. > > With sequential reads and writes, a HDD can do > 100MB/sec and 1G network > can do > 50 MB/sec > But BAM best number we have seen is about 40k event/sec (that with 4 > machines or so, lets assume one machine). Lets assume 20 bytes events. Then > it will be doing <1MB/sec. > > Problem is Cassandra break data to lot of small operations losing OS level > buffer to buffer transfers files transfers can do. I have tried increasing > batch size for cassandra, which help with smaller batches. But after about > few thousand operations in the same batch, things start get much slower. > > Best numbers will come when we run two receivers instead of NFS. > > 2) Frank, this is analytics data. So it is read only and most cases we > need only time based queries with less resolution (15min smallest > resolution is fine for most case). This to say run this batch query on last > hour of data so on. > > However, we have some scenarios where we do Adhoc queries for things like > activity monitoring. Those would not work for those and we will have to run > a batch job to push that data to RDBMS or Solar etc. Anjana, we need to > discuss this. > > But also there are lot of usecases to receive and write the event to disk > as soon as possible and later run MapReduce on top them. For those above > will work. > > --Srinath > > > > > > > > > > > > > > > > > > On Fri, Nov 7, 2014 at 7:23 AM, Anjana Fernando <[email protected]> wrote: > >> Hi Sanjiva, >> >> On Thu, Nov 6, 2014 at 4:01 PM, Sanjiva Weerawarana <[email protected]> >> wrote: >> >>> Anjana I think the idea was for the file system -> HDFS upload to happen >>> via a simple cron job type thing. >>> >> >> Even so, we will be just moving the problem to another area, the overall >> effort done by that hardware is still the same (writing to disk, reading it >> back, write it to network). That is, even though we can goto very a high >> throughput initially by writing it to the local disk at first, later on we >> have to read it back and write it to HDFS via the network, which is the >> slower part of our operation. So if we continue to load the machine with an >> extreme throughput, you will eventually lose space in that disk. >> >> Cheers, >> Anjana. >> >> >>> >>> On Wed, Nov 5, 2014 at 9:19 AM, Anjana Fernando <[email protected]> wrote: >>> >>>> Hi Srinath, >>>> >>>> Wouldn't it better, if we just make the batch size bigger, that is, >>>> lets just have a sizable local in-memory store, something probably close to >>>> 64MB, which is the default HDFS block size, and only after this is filled, >>>> or if the receiver is idle maybe, we can flush the buffer. I was just >>>> thinking, writing to the file system first itself will be expensive, where >>>> there are additional steps of writing all the records to the local file >>>> system and again reading it back, and then finally writing it to HDFS, and >>>> of course, again having a network file system would be an overhead, and not >>>> to mention the implementation/configuration complications that will come >>>> with this. IMHO, we should try to make these scenarios as simple as >>>> possible. >>>> >>>> I'm doing our new BAM data layer implementations here [1], where I'm >>>> almost done with an RDBMS implementation, doing some refactoring now (mail >>>> on this yet to come :)), I can also do an HDFS one after that and check it. >>>> >>>> [1] >>>> https://github.com/wso2/carbon-analytics/tree/master/components/xanalytics >>>> >>>> Cheers, >>>> Anjana. >>>> >>>> On Tue, Nov 4, 2014 at 6:56 PM, Srinath Perera <[email protected]> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> Following came out of chat with Sanjiva on a scenario involve very >>>>> large number of events coming into BAM. >>>>> >>>>> Currently we use Cassandra to store the events and number we got out >>>>> of it has not been great and Cassandra need too much attention to get to >>>>> those number. >>>>> >>>>> With Cassandra (or any DB) we write data as records. We can batch it, >>>>> but still amount of data in one IO operation is small. In comparison, >>>>> file >>>>> transfers are much much faster and that is fastest way to get some data >>>>> from A to B. >>>>> >>>>> So I am proposing to write the events that comes into a local file in >>>>> the Data Receiver, and periodically append them to a HDFS file. We can >>>>> arrange data in a folder by stream and files by timestamp (e.g. 1h data go >>>>> to a new file), so we can selectively pull and process data using Hive. >>>>> (We >>>>> can use something like https://github.com/OpenHFT/Chronicle-Queue to >>>>> write data to disk). >>>>> >>>>> If user needs avoid losing any messages at all in case of a disk >>>>> failure, either he can have a SAN or NTFS or can run two replicas of >>>>> receivers (we should write some code so only one of the receivers will >>>>> actually put data to HDFS). >>>>> >>>>> Coding wise, this should not be too hard. I am sure this will be >>>>> factor of time faster than Cassandra (of course we need to do a PoC and >>>>> verify). >>>>> >>>>> WDYT? >>>>> >>>>> --Srinath >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ============================ >>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>> Site: http://people.apache.org/~hemapani/ >>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>> Phone: 0772360902 >>>>> >>>> >>>> >>>> >>>> -- >>>> *Anjana Fernando* >>>> Senior Technical Lead >>>> WSO2 Inc. | http://wso2.com >>>> lean . enterprise . middleware >>>> >>> >>> >>> >>> -- >>> Sanjiva Weerawarana, Ph.D. >>> Founder, Chairman & CEO; WSO2, Inc.; http://wso2.com/ >>> email: [email protected]; office: (+1 650 745 4499 | +94 11 214 5345) >>> x5700; cell: +94 77 787 6880 | +1 408 466 5099; voip: +1 650 265 8311 >>> blog: http://sanjiva.weerawarana.org/; twitter: @sanjiva >>> Lean . Enterprise . Middleware >>> >> >> >> >> -- >> *Anjana Fernando* >> Senior Technical Lead >> WSO2 Inc. | http://wso2.com >> lean . enterprise . middleware >> > > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
