Hi Shabir,

Thanks for the reply. Yes it would have been much easier for the analytic
if we could publish both events to a single steam. But currently we cannot
modify existing streams as it can break other stats :( .

Anyway I tried this and looks like its working. This does not require JOINs
in hive or MySQL.

First, created two meta tables that points to the same MySQL table which
has the schema of the table I need to produce [Table 1]. But first meta
table points to api,time and success_count and the second meta table points
to api,time and throttled_count.

CREATE EXTERNAL TABLE IF NOT EXISTS APIRTAllTable (api STRING,time STRING,
success_count INT) STORED BY

'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES
( 'wso2.carbon.datasource.name'='WSO2AM_STATS_DB',

'hive.jdbc.update.on.duplicate' = 'true',

'hive.jdbc.primary.key.fields'='api,time',

'hive.jdbc.table.create.query' = 'CREATE TABLE APIRTAllTable( api
VARCHAR(100), time VARCHAR(30),success_count INT, throttled_count
INT,PRIMARY KEY(api,time))' );

CREATE EXTERNAL TABLE IF NOT EXISTS APITTable (api STRING,time STRING,
throttled_count INT) STORED BY

'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES
( 'wso2.carbon.datasource.name'='WSO2AM_STATS_DB',

'hive.jdbc.update.on.duplicate' = 'true',

'hive.jdbc.primary.key.fields'='api,time',

'hive.jdbc.table.create.query' = 'CREATE TABLE APIRTAllTable( api
VARCHAR(100), time VARCHAR(30),success_count INT, throttled_count
INT,PRIMARY KEY(api,time))' );


Then gave the following ‘insert overwrite’ for first table. This takes data
from success request cassandra table and inserts values for the api,time
and success_count columns only in the MySQL table. For the unspecified
throttled_count
column it has inserted NULLs.

insert overwrite table APIRTAllTable select
api,concat(substring(from_unixtime(cast(requestTime/1000 as BIGINT),
'yyyy-MM-dd HH:mm:ss'),0,13),':00:00') as time,sum(request) as
success_count from APIRequests group by
api,substring(from_unixtime(cast(requestTime/1000 as BIGINT), 'yyyy-MM-dd
HH:mm:ss'),0,13);

After that following query takes data from throttled request cassandra
table and updates the same previous MySQL table. As success_count column is
not specified in the meta table, it has retained the existing value in the
column.

insert overwrite table APITTable select
api,concat(substring(from_unixtime(cast(throttledTime/1000 as BIGINT),
'yyyy-MM-dd HH:mm:ss'),0,13),':00:00') as time,count(api) as throttle_count
from APIThrottled group by
api,substring(from_unixtime(cast(throttledTime/1000 as BIGINT), 'yyyy-MM-dd
HH:mm:ss'),0,13);


Thanks.

On Thu, Jul 30, 2015 at 2:24 PM, Shabir Mohamed <[email protected]> wrote:

> Hi Malintha,
>
> Can't we have a single stream_definition to which each API call event is
> published to. We can have an entry in the definition to indicate the status
> of the API call to be either "SUCCESS" or "THROTTLED" (or may be a boolean).
>
> So we will be writing our summarising scripts against a single column
> family to get the count of SUCCESS or THROTTLED events.
>
> However, if the two column families that you had mentioned above is
> required for any other statistical measures then we might have to retain
> those two separately also.
>
> Regards
>
>
> -------------------------------------
> *Shabir Mohamed*
> *Software Engineer*
> WSO2 Inc.; http://wso2.com
> Email: [email protected] <[email protected]>
> Mobile: +94 77 3516019 | +94 71 6583393
>
> On Thu, Jul 30, 2015 at 11:02 AM, Malintha Amarasinghe <[email protected]
> > wrote:
>
>> Hi all,
>>
>> For a recent requirement, we need to analyse the number of successful and
>> throttled out requests per API over time to visualize them in APIM
>> statistics graphs.
>>
>> Ex:
>>
>> Time Range (per hour)
>>
>> API
>>
>> Number of successful requests
>>
>> Number of Throttled out requests
>>
>> 1.00 pm - 2.00 pm
>>
>> api1:v1.0.0
>>
>> 10
>>
>> 24
>>
>> 2.00 pm - 3.00 pm
>>
>> api1:v1.0.0
>>
>> 11
>>
>> 18
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> [Table 1]
>>
>> Currently In API-M we have a set of cassandra column families stored in
>> BAM which are used for analytics. Following are 2 of them that are useful
>> for the above requirement.
>>
>>
>>    1.
>>
>>    org_wso2_apimgt_statistics_request
>>    2.
>>
>>    org_wso2_apimgt_statistics_throttle
>>
>> When an API call that comes to APIM is throttled out, an event is
>> published to *org_wso2_apimgt_statistics_throttle *stream per each call.
>> Likewise per each successful API call, an event is published to
>> *org_wso2_apimgt_statistics_request* stream.
>>
>>
>> The problem is, to derive the above table [Table 1] using both above
>> cassandra column families.
>>
>>
>> Currently it is done in the following way.
>>
>>
>>    1.
>>
>>    Using the data in *org_wso2_apimgt_statistics_request* we can
>>    summerize data per hour based and derive the following MySQL table.
>>
>>
>> Time Range (per hour)
>>
>> API
>>
>> Number of successful requests
>>
>> 1.00 pm - 2.00 pm
>>
>> api1:v1.0.0
>>
>> 10
>>
>> 2.00 pm - 3.00 pm
>>
>> api1:v1.0.0
>>
>> 11
>>
>> ...
>>
>> ...
>>
>> [Table 2]
>>
>>
>>    1.
>>
>>    Likewise we can derive the same for the
>>    *org_wso2_apimgt_statistics_request* column family.
>>
>>
>> Time Range (per hour)
>>
>> API
>>
>> Number of Throttled out requests
>>
>> 1.00 pm - 2.00 pm
>>
>> api1:v1.0.0
>>
>> 24
>>
>> 2.00 pm - 3.00 pm
>>
>> api1:v1.0.0
>>
>> 18
>>
>> ...
>>
>> ...
>>
>> [Table 3]
>>
>>
>>    1.
>>
>>    The two tables (Table 2 & 3) are joined in MySQL level to derive the
>>    above [Table 1].
>>
>>
>> But the problem with this is, two (still) large tables (when comes over
>> 20s of APIs and months of data) are joined in MySQL level (using a full
>> join) which might be less efficient.
>>
>> Are there any efficient ways do the above 3 steps in Hive queries and
>> produce the final joined table [Table 1]?
>>
>> Although JOINs are supported in Hive too, I think that would not be a
>> good solution as that will try to join an even more large tables in Hive
>> level.
>>
>> Any suggestions are much appreciated.
>>
>> Thank you.
>>
>> --
>> Malintha Amarasinghe
>> Software Engineer
>> *WSO2, Inc. - lean | enterprise | middleware*
>> http://wso2.com/
>>
>> Mobile : +94 712383306
>>
>> _______________________________________________
>> Dev mailing list
>> [email protected]
>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>
>>
>


-- 
Malintha Amarasinghe
Software Engineer
*WSO2, Inc. - lean | enterprise | middleware*
http://wso2.com/

Mobile : +94 712383306
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to