Hi Sanjeewa, On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]> wrote:
> Implementing cache is better than having another table mapping IMO. What > if we query database and keep IP range and network name in memory. > Then we may do quick search on network name and then based on that rest > can load some other way. > WDYT? > We thought of having an in memory cache but we faced several issues along the way. Let me explain the situation as it is per now. The Max-Mind DB has the IP addresses with the IP and the netmask. Ex: 192.168.0.0/20 The calculation of the IP address range would be like the following. Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 =>Network: 192.168.0.0/20 11000000.10101000.0000 0000.00000000 (Class C) Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 Hosts/Net: 4094 (Private Internet <http://www.ietf.org/rfc/rfc1918.txt>) Therefore what we are currently doing is to calculate the start and end IP for all the values in the max-mind DB and alter the tables with those values initially(this is a one time thing that will happen). When the Spark script executes, we check whether the given IP is between any of the start and end ranges in the tables. That is the reason why it is taking a long time to fetch results for a given IP. As a solution for this, we discussed what Tharindu has mentioned. 1. Have a in memory caching mechanism. 2. Have a DB based caching mechanism. The only point that we have to highlight is the fact that in both the above mechanisms we need to cache the IP address(not the ip-netmask as it was in the max-mind db) against the Geo location. Ex:- For 192.168.0.1 - Colombo, Sri Lanka For 192.168.15.254 - Colombo, Sri Lanka So as per the above example I took, if there are requests form all the possible 4094 address we will be caching each IP with the Geo location(since introducing range queries in a cache is not a good practice). Please find my comments about both the approaches. 1. Having an in-memory cache would speedup things but based on the IPs in the data set, there could be number of entries for IPs in the same range. One problem with this approach is that, if there is a server restart, the initial script execution would take a lots of time. Also based on certain scenarios(high number of different IPs) the cache would not have a significant effect on script execution performance. 2. Having a DB based cache would persist the data even on a restart and the data fetching query would be searching for an specific value(not a range query as against the max-mind DB). But the downside is that for a cache miss there would be minimum 3 DB queries (one for the cache table lookup and one for the max-mind db lookup and one for the cache persistence). That is why we have initiated this thread to finalize the caching approach we should take. Thanks, Janaka > Thanks, > sanjeewa. > > On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <[email protected]> > wrote: > >> Hi All, >> >> We are going to implement Client IP based Geo-location Graph in API >> Manager Analytics. When we go through the ways of doing in [1] , we >> selected [2] as the most suitable way to do. >> >> >> *Overview of max-mind's DB.* >> >> As the structure of the db (attached in image), They have two tables >> which incorporate to get the location. >> >> Find geoname_id according to network and get Country,City from locations >> table. >> >> *Limitations* >> >> As their database dump we couldn't directly process the ip from those >> tables. We need to check the given ip is in between the network min and max >> ip. This query get some long time (10 seconds in indexed data). If we >> directly do this from spark script for each and every ip which in summary >> table (regardless if ip is same from two row data) will query from the >> tables. Therefore this will incur the performance impact on this graph. >> >> *Solution* >> >> 1. Implement LRU cache against ip address vs location. >> >> This will need to implement on custom UDF in Spark. If ip querying from >> spark available in cache it will give the location from it , IF it is not >> It will retrieve from DB and put into the cache. >> >> 2. Persist in a Table >> >> ip as the primary key and Country and city as other columns and retrieve >> data from that table. >> >> >> Please feel free to give us the most suitable way of doing this solution?. >> >> [1] - Implementing Geographical based Analytics in API Manager mail >> thread. >> >> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >> >> >> *Thanks* >> >> *Tharindu Dharmarathna* >> Associate Software Engineer >> WSO2 Inc.; http://wso2.com >> lean.enterprise.middleware >> >> mobile: *+94779109091 <%2B94779109091>* >> > > > > -- > > *Sanjeewa Malalgoda* > WSO2 Inc. > Mobile : +94713068779 > > <http://sanjeewamalalgoda.blogspot.com/>blog > :http://sanjeewamalalgoda.blogspot.com/ > <http://sanjeewamalalgoda.blogspot.com/> > > > -- *Janaka Ranabahu* Associate Technical Lead, WSO2 Inc. http://wso2.com *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861* Lean . Enterprise . Middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
