Hi all, I think what Sachith suggests also makes sense. But am also rooting for the in-memory cache implementation suggested by Sanjeewa with ip-netmask approach.
Please find my comments inline. On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: > Hi all, > > From what I understand/was told, this happens once a day ( or relatively > infrequently), and you wanna avoid searching through all the geo data per > ip ( since you are grouping the requests by IP). > > IF that's the case, it would be better to use a separate DB table to cache > these data ( IP, geoID ..etc) with the IP being the primary key ( which > would improve the lookup time), and even though there will be cache misses, > it would eventually reduce the (#cacheMisses/ Hits). > > Having a DB cache would be better since you do want to persist these data > to be used over time. > > BTW in a cache miss, if we can figure out a way to limit the search range > on the original table or at least stop the search once a match is found, it > would greatly improve the cache miss time as well. > > That's my two cents. > > Cheers, > Sachith > > On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> wrote: > >> Hi Sanjeewa, >> >> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]> >> wrote: >> >>> Implementing cache is better than having another table mapping IMO. What >>> if we query database and keep IP range and network name in memory. >>> Then we may do quick search on network name and then based on that rest >>> can load some other way. >>> WDYT? >>> >> We thought of having an in memory cache but we faced several issues >> along the way. Let me explain the situation as it is per now. >> >> The Max-Mind DB has the IP addresses with the IP and the netmask. >> Ex: 192.168.0.0/20 >> >> The calculation of the IP address range would be like the following. >> >> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >> =>Network: 192.168.0.0/20 11000000.10101000.0000 0000.00000000 >> (Class C) >> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >> Hosts/Net: 4094 (Private Internet >> <http://www.ietf.org/rfc/rfc1918.txt>) >> >> >> Therefore what we are currently doing is to calculate the start and end >> IP for all the values in the max-mind DB and alter the tables with those >> values initially(this is a one time thing that will happen). When the Spark >> script executes, we check whether the given IP is between any of the start >> and end ranges in the tables. That is the reason why it is taking a long >> time to fetch results for a given IP. >> >> As a solution for this, we discussed what Tharindu has mentioned. >> 1. Have a in memory caching mechanism. >> 2. Have a DB based caching mechanism. >> >> The only point that we have to highlight is the fact that in both the >> above mechanisms we need to cache the IP address(not the ip-netmask as it >> was in the max-mind db) against the Geo location. >> >> Ex:- >> For 192.168.0.1 - Colombo, Sri Lanka >> For 192.168.15.254 - Colombo, Sri Lanka >> >> So as per the above example I took, if there are requests form all the >> possible 4094 address we will be caching each IP with the Geo >> location(since introducing range queries in a cache is not a good practice). >> > Since we are implementing a custom cache, won't we be doing a bitwise operation for the lookup with netmask and network IP? So basically, we would keep the network IP and the netmask in cache and simply do a bitwise AND to determine whether it is a match or not, right? Am thinking such an operation would not incur much of a performance hit and it won't be as prohibitive as a normal range query in a cache. If that is the case, I think we can go with the approach suggested by Sanjeewa. WDYT? >> Please find my comments about both the approaches. >> >> 1. Having an in-memory cache would speedup things but based on the IPs in >> the data set, there could be number of entries for IPs in the same range. >> One problem with this approach is that, if there is a server restart, the >> initial script execution would take a lots of time. Also based on certain >> scenarios(high number of different IPs) the cache would not have a >> significant effect on script execution performance. >> >> 2. Having a DB based cache would persist the data even on a restart and >> the data fetching query would be searching for an specific value(not a >> range query as against the max-mind DB). But the downside is that for a >> cache miss there would be minimum 3 DB queries (one for the cache table >> lookup and one for the max-mind db lookup and one for the >> cache persistence). >> >> That is why we have initiated this thread to finalize the caching >> approach we should take. >> >> Thanks, >> Janaka >> >> >> >>> Thanks, >>> sanjeewa. >>> >> Thanks, Lasantha > >>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>> [email protected]> wrote: >>> >>>> Hi All, >>>> >>>> We are going to implement Client IP based Geo-location Graph in API >>>> Manager Analytics. When we go through the ways of doing in [1] , we >>>> selected [2] as the most suitable way to do. >>>> >>>> >>>> *Overview of max-mind's DB.* >>>> >>>> As the structure of the db (attached in image), They have two tables >>>> which incorporate to get the location. >>>> >>>> Find geoname_id according to network and get Country,City from >>>> locations table. >>>> >>>> *Limitations* >>>> >>>> As their database dump we couldn't directly process the ip from those >>>> tables. We need to check the given ip is in between the network min and max >>>> ip. This query get some long time (10 seconds in indexed data). If we >>>> directly do this from spark script for each and every ip which in summary >>>> table (regardless if ip is same from two row data) will query from the >>>> tables. Therefore this will incur the performance impact on this graph. >>>> >>>> *Solution* >>>> >>>> 1. Implement LRU cache against ip address vs location. >>>> >>>> This will need to implement on custom UDF in Spark. If ip querying from >>>> spark available in cache it will give the location from it , IF it is not >>>> It will retrieve from DB and put into the cache. >>>> >>>> 2. Persist in a Table >>>> >>>> ip as the primary key and Country and city as other columns and >>>> retrieve data from that table. >>>> >>>> >>>> Please feel free to give us the most suitable way of doing this >>>> solution?. >>>> >>>> [1] - Implementing Geographical based Analytics in API Manager mail >>>> thread. >>>> >>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>> >>>> >>>> *Thanks* >>>> >>>> *Tharindu Dharmarathna* >>>> Associate Software Engineer >>>> WSO2 Inc.; http://wso2.com >>>> lean.enterprise.middleware >>>> >>>> mobile: *+94779109091 <%2B94779109091>* >>>> >>> >>> >>> >>> -- >>> >>> *Sanjeewa Malalgoda* >>> WSO2 Inc. >>> Mobile : +94713068779 >>> >>> <http://sanjeewamalalgoda.blogspot.com/>blog >>> :http://sanjeewamalalgoda.blogspot.com/ >>> <http://sanjeewamalalgoda.blogspot.com/> >>> >>> >>> >> >> >> -- >> *Janaka Ranabahu* >> Associate Technical Lead, WSO2 Inc. >> http://wso2.com >> >> >> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >> <%2B94%20718370861>* >> >> Lean . Enterprise . Middleware >> > > > > -- > Sachith Withana > Software Engineer; WSO2 Inc.; http://wso2.com > E-mail: sachith AT wso2.com > M: +94715518127 > Linked-In: <http://goog_416592669> > https://lk.linkedin.com/in/sachithwithana > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Lasantha Fernando* Senior Software Engineer - Data Technologies Team WSO2 Inc. http://wso2.com email: [email protected] mobile: (+94) 71 5247551
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
