Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Janaka Ranabahu Sat, 05 Mar 2016 18:56:23 -0800

Hi Sanjeewa,

On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]>
wrote:

> Implementing cache is better than having another table mapping IMO. What
> if we query database and keep IP range and network name in memory.
> Then we may do quick search on network name and then based on that rest
> can load some other way.
> WDYT?
>
We thought of having an in memory cache but we faced several issues along
the way. Let me explain the situation as it is per now.

The Max-Mind DB has the IP addresses with the IP and the netmask.
Ex: 192.168.0.0/20

The calculation of the IP address range would be like the following.

Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
=>Network:   192.168.0.0/20        11000000.10101000.0000
0000.00000000 (Class C)
Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
Hosts/Net: 4094                  (Private Internet
<http://www.ietf.org/rfc/rfc1918.txt>)

Therefore what we are currently doing is to calculate the start and end IP
for all the values in the max-mind DB and alter the tables with those
values initially(this is a one time thing that will happen). When the Spark
script executes, we check whether the given IP is between any of the start
and end ranges in the tables. That is the reason why it is taking a long
time to fetch results for a given IP.

As a solution for this, we discussed what Tharindu has mentioned.
1. Have a in memory caching mechanism.
2. Have a DB based caching mechanism.

The only point that we have to highlight is the fact that in both the above
mechanisms we need to cache the IP address(not the ip-netmask as it was in
the max-mind db) against the Geo location.

Ex:-
For 192.168.0.1       - Colombo, Sri Lanka
For 192.168.15.254 - Colombo, Sri Lanka

So as per the above example I took, if there are requests form all the
possible 4094 address we will be caching each IP with the Geo
location(since introducing range queries in a cache is not a good practice).

Please find my comments about both the approaches.

1. Having an in-memory cache would speedup things but based on the IPs in
the data set, there could be number of entries for IPs in the same range.
One problem with this approach is that, if there is a server restart, the
initial script execution would take a lots of time. Also based on certain
scenarios(high number of different IPs) the cache would not have a
significant effect on script execution performance.

2. Having a DB based cache would persist the data even on a restart and the
data fetching query would be searching for an specific value(not a range
query as against the max-mind DB). But the downside is that for a cache
miss there would be minimum 3 DB queries (one for the cache table lookup
and one for the max-mind db lookup and one for the cache persistence).

That is why we have initiated this thread to finalize the caching approach
we should take.

Thanks,
Janaka

> Thanks,
> sanjeewa.
>
> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <[email protected]>
> wrote:
>
>> Hi All,
>>
>> We are going to implement Client IP based Geo-location Graph in API
>> Manager Analytics. When we go through the ways of doing in [1] , we
>> selected [2] as the most suitable way to do.
>>
>>
>> *Overview of max-mind's DB.*
>>
>> As the structure of the db (attached in image), They have two tables
>> which incorporate to get the location.
>>
>> Find geoname_id according to network and get Country,City from locations
>> table.
>>
>> *Limitations*
>>
>> As their database dump we couldn't directly process the ip from those
>> tables. We need to check the given ip is in between the network min and max
>> ip. This query get some long time (10 seconds in indexed data). If we
>> directly do this from spark script for each and every ip which in summary
>> table (regardless if ip is same from two row data) will query from the
>> tables. Therefore this will incur the performance impact on this graph.
>>
>> *Solution*
>>
>> 1. Implement LRU cache against ip address vs location.
>>
>> This will need to implement on custom UDF in Spark. If ip querying from
>> spark available in cache it will give the location from it , IF it is not
>> It will retrieve from DB and put into the cache.
>>
>> 2. Persist in a Table
>>
>> ip as the primary key and Country and city as other columns and retrieve
>> data from that table.
>>
>>
>> Please feel free to give us the most suitable way of doing this solution?.
>>
>> [1] - Implementing Geographical based Analytics in API Manager mail
>> thread.
>>
>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>
>>
>> *Thanks*
>>
>> *Tharindu Dharmarathna*
>> Associate Software Engineer
>> WSO2 Inc.; http://wso2.com
>> lean.enterprise.middleware
>>
>> mobile: *+94779109091 <%2B94779109091>*
>>
>
>
>
> --
>
> *Sanjeewa Malalgoda*
> WSO2 Inc.
> Mobile : +94713068779
>
> <http://sanjeewamalalgoda.blogspot.com/>blog
> :http://sanjeewamalalgoda.blogspot.com/
> <http://sanjeewamalalgoda.blogspot.com/>
>
>
>

-- 
*Janaka Ranabahu*
Associate Technical Lead, WSO2 Inc.
http://wso2.com

*E-mail: [email protected] <http://wso2.com>**M: **+94 718370861*

Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to