Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Sachith Withana Sat, 05 Mar 2016 20:51:07 -0800

Hi all,

>From what I understand/was told, this happens once a day ( or relatively
infrequently), and you wanna avoid searching through all the geo data per
ip ( since you are grouping the requests by IP).


IF that's the case, it would be better to use a separate DB table to cache
these data ( IP, geoID ..etc) with the IP being the primary key ( which
would improve the lookup time), and even though there will be cache misses,
it would eventually reduce the (#cacheMisses/ Hits).

Having a DB cache would be better since you do want to persist these data
to be used over time.

BTW in a cache miss, if we can figure out a way to limit the search range
on the original table or at least stop the search once a match is found, it
would greatly improve the cache miss time as well.

That's my two cents.

Cheers,
Sachith

On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> wrote:

> Hi Sanjeewa,
>
> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]>
> wrote:
>
>> Implementing cache is better than having another table mapping IMO. What
>> if we query database and keep IP range and network name in memory.
>> Then we may do quick search on network name and then based on that rest
>> can load some other way.
>> WDYT?
>>
> We thought of having an in memory cache but we faced several issues along
> the way. Let me explain the situation as it is per now.
>
> The Max-Mind DB has the IP addresses with the IP and the netmask.
> Ex: 192.168.0.0/20
>
> The calculation of the IP address range would be like the following.
>
> Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
> Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
> Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
> =>Network:   192.168.0.0/20        11000000.10101000.0000 0000.00000000 
> (Class C)
> Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
> HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
> HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
> Hosts/Net: 4094                  (Private Internet 
> <http://www.ietf.org/rfc/rfc1918.txt>)
>
>
> Therefore what we are currently doing is to calculate the start and end IP
> for all the values in the max-mind DB and alter the tables with those
> values initially(this is a one time thing that will happen). When the Spark
> script executes, we check whether the given IP is between any of the start
> and end ranges in the tables. That is the reason why it is taking a long
> time to fetch results for a given IP.
>
> As a solution for this, we discussed what Tharindu has mentioned.
> 1. Have a in memory caching mechanism.
> 2. Have a DB based caching mechanism.
>
> The only point that we have to highlight is the fact that in both the
> above mechanisms we need to cache the IP address(not the ip-netmask as it
> was in the max-mind db) against the Geo location.
>
> Ex:-
> For 192.168.0.1       - Colombo, Sri Lanka
> For 192.168.15.254 - Colombo, Sri Lanka
>
> So as per the above example I took, if there are requests form all the
> possible 4094 address we will be caching each IP with the Geo
> location(since introducing range queries in a cache is not a good practice).
>
> Please find my comments about both the approaches.
>
> 1. Having an in-memory cache would speedup things but based on the IPs in
> the data set, there could be number of entries for IPs in the same range.
> One problem with this approach is that, if there is a server restart, the
> initial script execution would take a lots of time. Also based on certain
> scenarios(high number of different IPs) the cache would not have a
> significant effect on script execution performance.
>
> 2. Having a DB based cache would persist the data even on a restart and
> the data fetching query would be searching for an specific value(not a
> range query as against the max-mind DB). But the downside is that for a
> cache miss there would be minimum 3 DB queries (one for the cache table
> lookup and one for the max-mind db lookup and one for the
> cache persistence).
>
> That is why we have initiated this thread to finalize the caching approach
> we should take.
> 
> Thanks,
> Janaka
>
>
>
>> Thanks,
>> sanjeewa.
>>
>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <[email protected]
>> > wrote:
>>
>>> Hi All,
>>>
>>> We are going to implement Client IP based Geo-location Graph in API
>>> Manager Analytics. When we go through the ways of doing in [1] , we
>>> selected [2] as the most suitable way to do.
>>>
>>>
>>> *Overview of max-mind's DB.*
>>>
>>> As the structure of the db (attached in image), They have two tables
>>> which incorporate to get the location.
>>>
>>> Find geoname_id according to network and get Country,City from locations
>>> table.
>>>
>>> *Limitations*
>>>
>>> As their database dump we couldn't directly process the ip from those
>>> tables. We need to check the given ip is in between the network min and max
>>> ip. This query get some long time (10 seconds in indexed data). If we
>>> directly do this from spark script for each and every ip which in summary
>>> table (regardless if ip is same from two row data) will query from the
>>> tables. Therefore this will incur the performance impact on this graph.
>>>
>>> *Solution*
>>>
>>> 1. Implement LRU cache against ip address vs location.
>>>
>>> This will need to implement on custom UDF in Spark. If ip querying from
>>> spark available in cache it will give the location from it , IF it is not
>>> It will retrieve from DB and put into the cache.
>>>
>>> 2. Persist in a Table
>>>
>>> ip as the primary key and Country and city as other columns and retrieve
>>> data from that table.
>>>
>>>
>>> Please feel free to give us the most suitable way of doing this
>>> solution?.
>>>
>>> [1] - Implementing Geographical based Analytics in API Manager mail
>>> thread.
>>>
>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>>
>>>
>>> *Thanks*
>>>
>>> *Tharindu Dharmarathna*
>>> Associate Software Engineer
>>> WSO2 Inc.; http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> mobile: *+94779109091 <%2B94779109091>*
>>>
>>
>>
>>
>> --
>>
>> *Sanjeewa Malalgoda*
>> WSO2 Inc.
>> Mobile : +94713068779
>>
>> <http://sanjeewamalalgoda.blogspot.com/>blog
>> :http://sanjeewamalalgoda.blogspot.com/
>> <http://sanjeewamalalgoda.blogspot.com/>
>>
>>
>>
>
>
> --
> *Janaka Ranabahu*
> Associate Technical Lead, WSO2 Inc.
> http://wso2.com
>
>
> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
> <%2B94%20718370861>*
>
> Lean . Enterprise . Middleware
>



-- 
Sachith Withana
Software Engineer; WSO2 Inc.; http://wso2.com
E-mail: sachith AT wso2.com
M: +94715518127
Linked-In: <http://goog_416592669>https://lk.linkedin.com/in/sachithwithana

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to