Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Lasantha Fernando Sun, 06 Mar 2016 10:36:39 -0800

Hi all,

I think what Sachith suggests also makes sense. But am also rooting for the
in-memory cache implementation suggested by Sanjeewa with ip-netmask
approach.


Please find my comments inline.

On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote:

> Hi all,
>
> From what I understand/was told, this happens once a day ( or relatively
> infrequently), and you wanna avoid searching through all the geo data per
> ip ( since you are grouping the requests by IP).
>
> IF that's the case, it would be better to use a separate DB table to cache
> these data ( IP, geoID ..etc) with the IP being the primary key ( which
> would improve the lookup time), and even though there will be cache misses,
> it would eventually reduce the (#cacheMisses/ Hits).
>
> Having a DB cache would be better since you do want to persist these data
> to be used over time.
>
> BTW in a cache miss, if we can figure out a way to limit the search range
> on the original table or at least stop the search once a match is found, it
> would greatly improve the cache miss time as well.
>
> That's my two cents.
>
> Cheers,
> Sachith
>
> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> wrote:
>
>> Hi Sanjeewa,
>>
>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]>
>> wrote:
>>
>>> Implementing cache is better than having another table mapping IMO. What
>>> if we query database and keep IP range and network name in memory.
>>> Then we may do quick search on network name and then based on that rest
>>> can load some other way.
>>> WDYT?
>>>
>> We thought of having an in memory cache but we faced several issues
>> along the way. Let me explain the situation as it is per now.
>>
>> The Max-Mind DB has the IP addresses with the IP and the netmask.
>> Ex: 192.168.0.0/20
>>
>> The calculation of the IP address range would be like the following.
>>
>> Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
>> Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
>> Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
>> =>Network:   192.168.0.0/20        11000000.10101000.0000 0000.00000000 
>> (Class C)
>> Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
>> HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
>> HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
>> Hosts/Net: 4094                  (Private Internet 
>> <http://www.ietf.org/rfc/rfc1918.txt>)
>>
>>
>> Therefore what we are currently doing is to calculate the start and end
>> IP for all the values in the max-mind DB and alter the tables with those
>> values initially(this is a one time thing that will happen). When the Spark
>> script executes, we check whether the given IP is between any of the start
>> and end ranges in the tables. That is the reason why it is taking a long
>> time to fetch results for a given IP.
>>
>> As a solution for this, we discussed what Tharindu has mentioned.
>> 1. Have a in memory caching mechanism.
>> 2. Have a DB based caching mechanism.
>>
>> The only point that we have to highlight is the fact that in both the
>> above mechanisms we need to cache the IP address(not the ip-netmask as it
>> was in the max-mind db) against the Geo location.
>>
>> Ex:-
>> For 192.168.0.1       - Colombo, Sri Lanka
>> For 192.168.15.254 - Colombo, Sri Lanka
>>
>> So as per the above example I took, if there are requests form all the
>> possible 4094 address we will be caching each IP with the Geo
>> location(since introducing range queries in a cache is not a good practice).
>>
>
Since we are implementing a custom cache, won't we be doing a bitwise
operation for the lookup with netmask and network IP? So basically, we
would keep the network IP and the netmask in cache and simply do a bitwise
AND to determine whether it is a match or not, right? Am thinking such an
operation would not incur much of a performance hit and it won't be as
prohibitive as a normal range query in a cache. If that is the case, I
think we can go with the approach suggested by Sanjeewa.

WDYT?


>> Please find my comments about both the approaches.
>>
>> 1. Having an in-memory cache would speedup things but based on the IPs in
>> the data set, there could be number of entries for IPs in the same range.
>> One problem with this approach is that, if there is a server restart, the
>> initial script execution would take a lots of time. Also based on certain
>> scenarios(high number of different IPs) the cache would not have a
>> significant effect on script execution performance.
>>
>> 2. Having a DB based cache would persist the data even on a restart and
>> the data fetching query would be searching for an specific value(not a
>> range query as against the max-mind DB). But the downside is that for a
>> cache miss there would be minimum 3 DB queries (one for the cache table
>> lookup and one for the max-mind db lookup and one for the
>> cache persistence).
>>
>> That is why we have initiated this thread to finalize the caching
>> approach we should take.
>> 
>> Thanks,
>> Janaka
>>
>>
>>
>>> Thanks,
>>> sanjeewa.
>>>
>>
Thanks,
Lasantha


>
>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <
>>> [email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are going to implement Client IP based Geo-location Graph in API
>>>> Manager Analytics. When we go through the ways of doing in [1] , we
>>>> selected [2] as the most suitable way to do.
>>>>
>>>>
>>>> *Overview of max-mind's DB.*
>>>>
>>>> As the structure of the db (attached in image), They have two tables
>>>> which incorporate to get the location.
>>>>
>>>> Find geoname_id according to network and get Country,City from
>>>> locations table.
>>>>
>>>> *Limitations*
>>>>
>>>> As their database dump we couldn't directly process the ip from those
>>>> tables. We need to check the given ip is in between the network min and max
>>>> ip. This query get some long time (10 seconds in indexed data). If we
>>>> directly do this from spark script for each and every ip which in summary
>>>> table (regardless if ip is same from two row data) will query from the
>>>> tables. Therefore this will incur the performance impact on this graph.
>>>>
>>>> *Solution*
>>>>
>>>> 1. Implement LRU cache against ip address vs location.
>>>>
>>>> This will need to implement on custom UDF in Spark. If ip querying from
>>>> spark available in cache it will give the location from it , IF it is not
>>>> It will retrieve from DB and put into the cache.
>>>>
>>>> 2. Persist in a Table
>>>>
>>>> ip as the primary key and Country and city as other columns and
>>>> retrieve data from that table.
>>>>
>>>>
>>>> Please feel free to give us the most suitable way of doing this
>>>> solution?.
>>>>
>>>> [1] - Implementing Geographical based Analytics in API Manager mail
>>>> thread.
>>>>
>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>>>
>>>>
>>>> *Thanks*
>>>>
>>>> *Tharindu Dharmarathna*
>>>> Associate Software Engineer
>>>> WSO2 Inc.; http://wso2.com
>>>> lean.enterprise.middleware
>>>>
>>>> mobile: *+94779109091 <%2B94779109091>*
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Sanjeewa Malalgoda*
>>> WSO2 Inc.
>>> Mobile : +94713068779
>>>
>>> <http://sanjeewamalalgoda.blogspot.com/>blog
>>> :http://sanjeewamalalgoda.blogspot.com/
>>> <http://sanjeewamalalgoda.blogspot.com/>
>>>
>>>
>>>
>>
>>
>> --
>> *Janaka Ranabahu*
>> Associate Technical Lead, WSO2 Inc.
>> http://wso2.com
>>
>>
>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
>> <%2B94%20718370861>*
>>
>> Lean . Enterprise . Middleware
>>
>
>
>
> --
> Sachith Withana
> Software Engineer; WSO2 Inc.; http://wso2.com
> E-mail: sachith AT wso2.com
> M: +94715518127
> Linked-In: <http://goog_416592669>
> https://lk.linkedin.com/in/sachithwithana
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Lasantha Fernando*
Senior Software Engineer - Data Technologies Team
WSO2 Inc. http://wso2.com

email: [email protected]
mobile: (+94) 71 5247551

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to