Hi Kasun, On Tue, Feb 9, 2016 at 10:10 AM, Kasun Indrasiri <[email protected]> wrote:
> I think for trancing use case we need to publish events one by one from > each mediator (we can't aggregate all such events as it also contains the > message payload) > I think we can still do that with some extra effort. Most of the mediators in a sequence flow does not alter the message payload. We can store the payload only for the mediators which alter the message payload. And for others, we can put a reference to the previous entry. By doing that we can save the memory to a great extent. Thanks. > > ---------- Forwarded message ---------- > From: Supun Sethunga <[email protected]> > Date: Mon, Feb 8, 2016 at 2:54 PM > Subject: Re: ESB Analytics Mediation Event Publishing Mechanism > To: Anjana Fernando <[email protected]> > Cc: "[email protected]" <[email protected]>, Srinath > Perera <[email protected]>, Sanjiva Weerawarana <[email protected]>, Kasun > Indrasiri <[email protected]>, Isuru Udana <[email protected]> > > > Hi all, > > Ran some simple performance tests against the new relational provider, in > comparison with the existing one. Follow are the results: > > *Records in Backend DB Table*: *1,054,057* > > *Conversion:* > Spark Table > id a b c > Backend DB Table 1 xxx yyy zzz > id data 1 ppp qqq rrr > 1 > [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}] > -- > To --> 1 aaa bbb ccc > 2 > [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}] > 2 xxx yyy zzz > 2 aaa bbb ccc > 2 ppp qqq rrr > > > > *Avg Time for Query Execution:* > > Querry > Execution time (~ sec) > Existing Analytics Relation Provider New (ESB) Analytics Relation Provider > * * New relational provider split a single row to multiple rows. Hence > the number of rows in the table equivalent to 3 times (as each row is split > to 3 rows) as the original table. > SELECT COUNT(*) FROM <Table>; 13 16 > SELECT * FROM <Table> ORDER BY id ASC; 13 16 > SELECT * FROM <Table> WHERE id=98435; 13 16 > SELECT id,a,first(b),first(c) FROM <Table> GROUP BY id,a ORDER BY id ASC; > 18 26 > > Regards, > Supun > > On Wed, Feb 3, 2016 at 3:36 PM, Supun Sethunga <[email protected]> wrote: > >> Hi all, >> >> I have started working on implementing a new "relation" / "relation >> provider", to serve the above requirement. This basically is a modified >> version of the existing "Carbon Analytics" relation provider. >> >> Here I have assumed that the encapsulated data for a single execution >> flow are stored in a single row, and the data about the mediators >> invoked during the flow are stored in a known column of each row (say >> "data"), as an array (say a json array). When each row is read in to spark, >> this relational provider create separate rows for each of the element in >> the array stored in "data" column. I have tested this with some mocked >> data, and works as expected. >> >> Need to test with the real data/data-formats, and modify the mapping >> accordingly. Will update the thread with the details. >> >> Regards, >> Supun >> >> >> On Tue, Feb 2, 2016 at 2:36 AM, Anjana Fernando <[email protected]> wrote: >> >>> Hi, >>> >>> In a meeting I'd with Kasun and the ESB team, I got to know that, for >>> their tracing mechanism, they were instructed to publish one event for each >>> of the mediator invocations, where, earlier they had an approach, they >>> publish one event, which encapsulated data of a whole execution flow. I >>> would actually like to support the latter approach, mainly due to >>> performance / resource requirements. And also considering the fact, this is >>> a feature that could be enabled in production. So simply, if we do one >>> event per mediator, this does not scale that well. For example, if the ESB >>> is doing 1k TPS, for a sequence that has 20 mediators, that is 20k TPS for >>> analytics traffic. Combine that with a possible ESB cluster hitting a DAS >>> cluster with a single backend database, this maybe too many rows per second >>> written to the database. Where the main problem here is, one event is, a >>> single row/record in the backend database in DAS, so it may come to a >>> state, where the frequency of row creations by events coming from ESBs >>> cannot be sustained. >>> >>> If we create a single event from the 20 mediators, then it is just 1k >>> TPS for DAS event receivers and the database too, event though the message >>> size is bigger. It is not necessarily same performance, if you publish lots >>> of small events to publishing bigger events. Throughput wise, comparatively >>> bigger events will win (even though if we consider that, small operations >>> will be batched in transport level etc.. still one event = one database >>> row). So I would suggest, we try out a single sequence flow = single event, >>> approach, and from the Spark processing side, we consider one of these big >>> rows as multiple rows in Spark. I was first thinking, if UDFs can help in >>> splitting a single column to multiple rows, and that is not possible, and >>> also, a bit troublesome, considering we have to delete the original data >>> table after we concerted it using a script, and not forgetting, we actually >>> have to schedule and run a separate script to do this post-processing. So a >>> much cleaner way to do this would be, to create a new "relation provider" >>> in Spark (which is like a data adapter for their DataFrames), and in our >>> relation provider, when we are reading rows, we convert a single row's >>> column to multiple rows and return that for processing. So Spark will not >>> know, physically it was a single row from the data layer, and it can >>> summarize the data and all as usual and write to the target summary tables. >>> [1] is our existing implementation of Spark relation provider, which >>> directly maps to our DAS analytics tables, we can create the new one >>> extending / based on it. So I suggest we try out this approach and see, if >>> everyone is okay with it. >>> >>> [1] >>> https://github.com/wso2/carbon-analytics/blob/master/components/analytics-processors/org.wso2.carbon.analytics.spark.core/src/main/java/org/wso2/carbon/analytics/spark/core/sources/AnalyticsRelationProvider.java >>> >>> Cheers, >>> Anjana. >>> -- >>> *Anjana Fernando* >>> Senior Technical Lead >>> WSO2 Inc. | http://wso2.com >>> lean . enterprise . middleware >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "WSO2 Engineering Group" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/a/wso2.com/d/optout. >>> >> >> >> >> -- >> *Supun Sethunga* >> Software Engineer >> WSO2, Inc. >> http://wso2.com/ >> lean | enterprise | middleware >> Mobile : +94 716546324 >> > > > > -- > *Supun Sethunga* > Software Engineer > WSO2, Inc. > http://wso2.com/ > lean | enterprise | middleware > Mobile : +94 716546324 > > > > -- > Kasun Indrasiri > Software Architect > WSO2, Inc.; http://wso2.com > lean.enterprise.middleware > > cell: +94 77 556 5206 > Blog : http://kasunpanorama.blogspot.com/ > -- *Isuru Udana* Associate Technical Lead WSO2 Inc.; http://wso2.com email: [email protected] cell: +94 77 3791887 blog: http://mytecheye.blogspot.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
