Hi Tharindu, On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote:
> > 2. Having a DB based cache would persist the data even on a restart and >> the data fetching query would be searching for an specific value(not a >> range query as against the max-mind DB). But the downside is that for a >> cache miss there would be minimum 3 DB queries (one for the cache table >> lookup and one for the max-mind db lookup and one for the >> cache persistence). >> > > In order to avoid expensive cache misses we may eagerly populate the DB > table cache. i.e. When there's a cache miss we do the lookup in max-mind db > and then add multiple entries for multiple IPs of that netwokrk_cid to the > Cache DB table instead of only for that particular IP. That way we reduce > the chance of cache miss being very expensive, as we increase the chance of > it being found on the first DB lookup. > > We might need to do some evaluation to determine how much entries that we > are going to add to the DB cache for IP belongs to a particular > netwokrk_cid. For an example if requests from a certain netwokrk_cidr is > frequent we may want to add more entries with compared to a less frequent > netwokrk_cidr. > > The downside is the DB cache tend to be more big. > > Thanks > *,Sajith Ravindra* > Senior Software Engineer > WSO2 Inc.; http://wso2.com > lean.enterprise.middleware > > mobile: +94 77 2273550 > blog: http://sajithr.blogspot.com/ > <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab> > > On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna <[email protected]> > wrote: > >> Hi Lasantha, >> >> Upto now we are doing the following way in order to get the geo location >> from the stated dump. >> >> 1. two columns added filled with long value of lower and upper value of >> network ip addresses. Then get the geoname_id with respect to the long >> value for the given ip which between this above long values. Hope you will >> got this idea on our approach. Is there any way to do bit wise operation in >> order to get the network_cidr value ? . >> > Can't we do it by keeping the network IP and the subnet as two columns and the geoname_id as the third. Say for example, if 192.168.0.0/20 is the cidr, for IPv4 routing what is usually done is we get the IP as int, then do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e. 11111111 11111111 11110000 00000) and check whether that returns the network IP. You might find more info here [1]. I think there should be libraries that wrap this operation. But if performance is a concern and we need to keep the cache search implementation very lean, we can implement it ourselves. WDYT? [1] http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask Thanks, Lasantha >> Thanks >> Tharindu >> >> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]> >> wrote: >> >>> Hi all, >>> >>> I think what Sachith suggests also makes sense. But am also rooting for >>> the in-memory cache implementation suggested by Sanjeewa with ip-netmask >>> approach. >>> >>> Please find my comments inline. >>> >>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> From what I understand/was told, this happens once a day ( or >>>> relatively infrequently), and you wanna avoid searching through all the geo >>>> data per ip ( since you are grouping the requests by IP). >>>> >>>> IF that's the case, it would be better to use a separate DB table to >>>> cache these data ( IP, geoID ..etc) with the IP being the primary key ( >>>> which would improve the lookup time), and even though there will be cache >>>> misses, it would eventually reduce the (#cacheMisses/ Hits). >>>> >>>> Having a DB cache would be better since you do want to persist these >>>> data to be used over time. >>>> >>>> BTW in a cache miss, if we can figure out a way to limit the search >>>> range on the original table or at least stop the search once a match is >>>> found, it would greatly improve the cache miss time as well. >>>> >>>> That's my two cents. >>>> >>>> Cheers, >>>> Sachith >>>> >>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> >>>> wrote: >>>> >>>>> Hi Sanjeewa, >>>>> >>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]> >>>>> wrote: >>>>> >>>>>> Implementing cache is better than having another table mapping IMO. >>>>>> What if we query database and keep IP range and network name in memory. >>>>>> Then we may do quick search on network name and then based on that >>>>>> rest can load some other way. >>>>>> WDYT? >>>>>> >>>>> We thought of having an in memory cache but we faced several issues >>>>> along the way. Let me explain the situation as it is per now. >>>>> >>>>> The Max-Mind DB has the IP addresses with the IP and the netmask. >>>>> Ex: 192.168.0.0/20 >>>>> >>>>> The calculation of the IP address range would be like the following. >>>>> >>>>> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >>>>> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >>>>> =>Network: 192.168.0.0/20 11000000.10101000.0000 0000.00000000 >>>>> (Class C) >>>>> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >>>>> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >>>>> Hosts/Net: 4094 (Private Internet >>>>> <http://www.ietf.org/rfc/rfc1918.txt>) >>>>> >>>>> >>>>> Therefore what we are currently doing is to calculate the start and >>>>> end IP for all the values in the max-mind DB and alter the tables with >>>>> those values initially(this is a one time thing that will happen). When >>>>> the >>>>> Spark script executes, we check whether the given IP is between any of the >>>>> start and end ranges in the tables. That is the reason why it is taking a >>>>> long time to fetch results for a given IP. >>>>> >>>>> As a solution for this, we discussed what Tharindu has mentioned. >>>>> 1. Have a in memory caching mechanism. >>>>> 2. Have a DB based caching mechanism. >>>>> >>>>> The only point that we have to highlight is the fact that in both the >>>>> above mechanisms we need to cache the IP address(not the ip-netmask as it >>>>> was in the max-mind db) against the Geo location. >>>>> >>>>> Ex:- >>>>> For 192.168.0.1 - Colombo, Sri Lanka >>>>> For 192.168.15.254 - Colombo, Sri Lanka >>>>> >>>>> So as per the above example I took, if there are requests form all the >>>>> possible 4094 address we will be caching each IP with the Geo >>>>> location(since introducing range queries in a cache is not a good >>>>> practice). >>>>> >>>> >>> Since we are implementing a custom cache, won't we be doing a bitwise >>> operation for the lookup with netmask and network IP? So basically, we >>> would keep the network IP and the netmask in cache and simply do a bitwise >>> AND to determine whether it is a match or not, right? Am thinking such an >>> operation would not incur much of a performance hit and it won't be as >>> prohibitive as a normal range query in a cache. If that is the case, I >>> think we can go with the approach suggested by Sanjeewa. >>> >>> WDYT? >>> >>> >>>>> Please find my comments about both the approaches. >>>>> >>>>> 1. Having an in-memory cache would speedup things but based on the IPs >>>>> in the data set, there could be number of entries for IPs in the same >>>>> range. One problem with this approach is that, if there is a server >>>>> restart, the initial script execution would take a lots of time. Also >>>>> based >>>>> on certain scenarios(high number of different IPs) the cache would not >>>>> have >>>>> a significant effect on script execution performance. >>>>> >>>>> 2. Having a DB based cache would persist the data even on a restart >>>>> and the data fetching query would be searching for an specific value(not a >>>>> range query as against the max-mind DB). But the downside is that for a >>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>> lookup and one for the max-mind db lookup and one for the >>>>> cache persistence). >>>>> >>>>> That is why we have initiated this thread to finalize the caching >>>>> approach we should take. >>>>> >>>>> Thanks, >>>>> Janaka >>>>> >>>>> >>>>> >>>>>> Thanks, >>>>>> sanjeewa. >>>>>> >>>>> >>> Thanks, >>> Lasantha >>> >>> >>>> >>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We are going to implement Client IP based Geo-location Graph in API >>>>>>> Manager Analytics. When we go through the ways of doing in [1] , we >>>>>>> selected [2] as the most suitable way to do. >>>>>>> >>>>>>> >>>>>>> *Overview of max-mind's DB.* >>>>>>> >>>>>>> As the structure of the db (attached in image), They have two tables >>>>>>> which incorporate to get the location. >>>>>>> >>>>>>> Find geoname_id according to network and get Country,City from >>>>>>> locations table. >>>>>>> >>>>>>> *Limitations* >>>>>>> >>>>>>> As their database dump we couldn't directly process the ip from >>>>>>> those tables. We need to check the given ip is in between the network >>>>>>> min >>>>>>> and max ip. This query get some long time (10 seconds in indexed data). >>>>>>> If >>>>>>> we directly do this from spark script for each and every ip which in >>>>>>> summary table (regardless if ip is same from two row data) will query >>>>>>> from >>>>>>> the tables. Therefore this will incur the performance impact on this >>>>>>> graph. >>>>>>> >>>>>>> *Solution* >>>>>>> >>>>>>> 1. Implement LRU cache against ip address vs location. >>>>>>> >>>>>>> This will need to implement on custom UDF in Spark. If ip querying >>>>>>> from spark available in cache it will give the location from it , IF it >>>>>>> is >>>>>>> not It will retrieve from DB and put into the cache. >>>>>>> >>>>>>> 2. Persist in a Table >>>>>>> >>>>>>> ip as the primary key and Country and city as other columns and >>>>>>> retrieve data from that table. >>>>>>> >>>>>>> >>>>>>> Please feel free to give us the most suitable way of doing this >>>>>>> solution?. >>>>>>> >>>>>>> [1] - Implementing Geographical based Analytics in API Manager mail >>>>>>> thread. >>>>>>> >>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>>>>> >>>>>>> >>>>>>> *Thanks* >>>>>>> >>>>>>> *Tharindu Dharmarathna* >>>>>>> Associate Software Engineer >>>>>>> WSO2 Inc.; http://wso2.com >>>>>>> lean.enterprise.middleware >>>>>>> >>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> *Sanjeewa Malalgoda* >>>>>> WSO2 Inc. >>>>>> Mobile : +94713068779 >>>>>> >>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog >>>>>> :http://sanjeewamalalgoda.blogspot.com/ >>>>>> <http://sanjeewamalalgoda.blogspot.com/> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Janaka Ranabahu* >>>>> Associate Technical Lead, WSO2 Inc. >>>>> http://wso2.com >>>>> >>>>> >>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >>>>> <%2B94%20718370861>* >>>>> >>>>> Lean . Enterprise . Middleware >>>>> >>>> >>>> >>>> >>>> -- >>>> Sachith Withana >>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>> E-mail: sachith AT wso2.com >>>> M: +94715518127 >>>> Linked-In: <http://goog_416592669> >>>> https://lk.linkedin.com/in/sachithwithana >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> *Lasantha Fernando* >>> Senior Software Engineer - Data Technologies Team >>> WSO2 Inc. http://wso2.com >>> >>> email: [email protected] >>> mobile: (+94) 71 5247551 >>> >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> >> *Tharindu Dharmarathna*Associate Software Engineer >> WSO2 Inc.; http://wso2.com >> lean.enterprise.middleware >> >> mobile: *+94779109091 <%2B94779109091>* >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > -- *Lasantha Fernando* Senior Software Engineer - Data Technologies Team WSO2 Inc. http://wso2.com email: [email protected] mobile: (+94) 71 5247551
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
