Re: [Architecture] ESB Analytics Mediation Event Publishing Mechanism

Isuru Udana Mon, 08 Feb 2016 21:37:03 -0800

Hi Kasun,

On Tue, Feb 9, 2016 at 10:10 AM, Kasun Indrasiri <[email protected]> wrote:


> I think for trancing use case we need to publish events one by one from
> each mediator (we can't aggregate all such events as it also contains the
> message payload)
>
I think we can still do that with some extra effort.
Most of the mediators in a sequence flow does not alter the message
payload. We can store the payload only for the mediators which alter the
message payload. And for others, we can put a reference to the previous
entry. By doing that we can save the memory to a great extent.

Thanks.


>
> ---------- Forwarded message ----------
> From: Supun Sethunga <[email protected]>
> Date: Mon, Feb 8, 2016 at 2:54 PM
> Subject: Re: ESB Analytics Mediation Event Publishing Mechanism
> To: Anjana Fernando <[email protected]>
> Cc: "[email protected]" <[email protected]>, Srinath
> Perera <[email protected]>, Sanjiva Weerawarana <[email protected]>, Kasun
> Indrasiri <[email protected]>, Isuru Udana <[email protected]>
>
>
> Hi all,
>
> Ran some simple performance tests against the new relational provider, in
> comparison with the existing one. Follow are the results:
>
> *Records in Backend DB Table*: *1,054,057*
>
> *Conversion:*
> Spark Table
> id a b c
> Backend DB Table 1 xxx yyy zzz
> id data 1 ppp qqq rrr
> 1
> [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}]
>  --
> To --> 1 aaa bbb ccc
> 2
> [{'a':'aaa','b':'bbb','c':'ccc'},{'a':'xxx','b':'yyy','c':'zzz'},{'a':'ppp','b':'qqq','c':'rrr'}]
> 2 xxx yyy zzz
> 2 aaa bbb ccc
> 2 ppp qqq rrr
>
>
>
> *Avg Time for Query Execution:*
>
> Querry
> Execution time (~ sec)
> Existing Analytics Relation Provider New (ESB) Analytics Relation Provider
> * * New relational provider split a single row to multiple rows. Hence
> the number of rows in the table equivalent to 3 times (as each row is split
> to 3 rows) as the original table.
> SELECT COUNT(*) FROM <Table>; 13 16
> SELECT * FROM <Table> ORDER BY id ASC; 13 16
> SELECT * FROM <Table> WHERE id=98435; 13 16
> SELECT id,a,first(b),first(c) FROM <Table> GROUP BY id,a ORDER BY id ASC;
> 18 26
>
> Regards,
> Supun
>
> On Wed, Feb 3, 2016 at 3:36 PM, Supun Sethunga <[email protected]> wrote:
>
>> Hi all,
>>
>> I have started working on implementing a new "relation" / "relation
>> provider", to serve the above requirement. This basically is a modified
>> version of the existing "Carbon Analytics" relation provider.
>>
>> Here I have assumed that the encapsulated data for a single execution
>> flow are stored in a single row, and the data about the mediators
>> invoked during the flow are stored in a known column of each row (say
>> "data"), as an array (say a json array). When each row is read in to spark,
>> this relational provider create separate rows for each of the element in
>> the array stored in "data" column. I have tested this with some mocked
>> data, and works as expected.
>>
>> Need to test with the real data/data-formats, and modify the mapping
>> accordingly. Will update the thread with the details.
>>
>> Regards,
>> Supun
>>
>>
>> On Tue, Feb 2, 2016 at 2:36 AM, Anjana Fernando <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> In a meeting I'd with Kasun and the ESB team, I got to know that, for
>>> their tracing mechanism, they were instructed to publish one event for each
>>> of the mediator invocations, where, earlier they had an approach, they
>>> publish one event, which encapsulated data of a whole execution flow. I
>>> would actually like to support the latter approach, mainly due to
>>> performance / resource requirements. And also considering the fact, this is
>>> a feature that could be enabled in production. So simply, if we do one
>>> event per mediator, this does not scale that well. For example, if the ESB
>>> is doing 1k TPS, for a sequence that has 20 mediators, that is 20k TPS for
>>> analytics traffic. Combine that with a possible ESB cluster hitting a DAS
>>> cluster with a single backend database, this maybe too many rows per second
>>> written to the database. Where the main problem here is, one event is, a
>>> single row/record in the backend database in DAS, so it may come to a
>>> state, where the frequency of row creations by events coming from ESBs
>>> cannot be sustained.
>>>
>>> If we create a single event from the 20 mediators, then it is just 1k
>>> TPS for DAS event receivers and the database too, event though the message
>>> size is bigger. It is not necessarily same performance, if you publish lots
>>> of small events to publishing bigger events. Throughput wise, comparatively
>>> bigger events will win (even though if we consider that, small operations
>>> will be batched in transport level etc.. still one event = one database
>>> row). So I would suggest, we try out a single sequence flow = single event,
>>> approach, and from the Spark processing side, we consider one of these big
>>> rows as multiple rows in Spark. I was first thinking, if UDFs can help in
>>> splitting a single column to multiple rows, and that is not possible, and
>>> also, a bit troublesome, considering we have to delete the original data
>>> table after we concerted it using a script, and not forgetting, we actually
>>> have to schedule and run a separate script to do this post-processing. So a
>>> much cleaner way to do this would be, to create a new "relation provider"
>>> in Spark (which is like a data adapter for their DataFrames), and in our
>>> relation provider, when we are reading rows, we convert a single row's
>>> column to multiple rows and return that for processing. So Spark will not
>>> know, physically it was a single row from the data layer, and it can
>>> summarize the data and all as usual and write to the target summary tables.
>>> [1] is our existing implementation of Spark relation provider, which
>>> directly maps to our DAS analytics tables, we can create the new one
>>> extending / based on it. So I suggest we try out this approach and see, if
>>> everyone is okay with it.
>>>
>>> [1]
>>> https://github.com/wso2/carbon-analytics/blob/master/components/analytics-processors/org.wso2.carbon.analytics.spark.core/src/main/java/org/wso2/carbon/analytics/spark/core/sources/AnalyticsRelationProvider.java
>>>
>>> Cheers,
>>> Anjana.
>>> --
>>> *Anjana Fernando*
>>> Senior Technical Lead
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "WSO2 Engineering Group" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/a/wso2.com/d/optout.
>>>
>>
>>
>>
>> --
>> *Supun Sethunga*
>> Software Engineer
>> WSO2, Inc.
>> http://wso2.com/
>> lean | enterprise | middleware
>> Mobile : +94 716546324
>>
>
>
>
> --
> *Supun Sethunga*
> Software Engineer
> WSO2, Inc.
> http://wso2.com/
> lean | enterprise | middleware
> Mobile : +94 716546324
>
>
>
> --
> Kasun Indrasiri
> Software Architect
> WSO2, Inc.; http://wso2.com
> lean.enterprise.middleware
>
> cell: +94 77 556 5206
> Blog : http://kasunpanorama.blogspot.com/
>



-- 
*Isuru Udana*
Associate Technical Lead
WSO2 Inc.; http://wso2.com
email: [email protected] cell: +94 77 3791887
blog: http://mytecheye.blogspot.com/

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] ESB Analytics Mediation Event Publishing Mechanism

Reply via email to