Hi All, We have come across following ways to do the above task after the Initial POC.
1. Using File type database which given by max-mind (.mmdb) and use there database readers. >From this approach we got lesser value to get the location from the above using JAX-RS service which is used to wrap the above database. This JAX-RS implementation is by default used the max-mind's Cache implementation which can find from [1] . *Limitations* - Hosting of the Jax-RS app in another server. - # of http calls will high. 2. Call query server as above thread and cached the location with ip. Here you can find the execution time for a single query which get for each method. *Method 1 : 4.5 seconds* *Method 2: 4.76 seconds* Thanks Tharindu On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]> wrote: > Hi Tharindu, > > On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote: > >> >> 2. Having a DB based cache would persist the data even on a restart and >>> the data fetching query would be searching for an specific value(not a >>> range query as against the max-mind DB). But the downside is that for a >>> cache miss there would be minimum 3 DB queries (one for the cache table >>> lookup and one for the max-mind db lookup and one for the >>> cache persistence). >>> >> >> In order to avoid expensive cache misses we may eagerly populate the DB >> table cache. i.e. When there's a cache miss we do the lookup in max-mind db >> and then add multiple entries for multiple IPs of that netwokrk_cid to the >> Cache DB table instead of only for that particular IP. That way we reduce >> the chance of cache miss being very expensive, as we increase the chance of >> it being found on the first DB lookup. >> >> We might need to do some evaluation to determine how much entries that we >> are going to add to the DB cache for IP belongs to a particular >> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is >> frequent we may want to add more entries with compared to a less frequent >> netwokrk_cidr. >> >> The downside is the DB cache tend to be more big. >> >> Thanks >> *,Sajith Ravindra* >> Senior Software Engineer >> WSO2 Inc.; http://wso2.com >> lean.enterprise.middleware >> >> mobile: +94 77 2273550 >> blog: http://sajithr.blogspot.com/ >> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab> >> >> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna <[email protected] >> > wrote: >> >>> Hi Lasantha, >>> >>> Upto now we are doing the following way in order to get the geo location >>> from the stated dump. >>> >>> 1. two columns added filled with long value of lower and upper value of >>> network ip addresses. Then get the geoname_id with respect to the long >>> value for the given ip which between this above long values. Hope you will >>> got this idea on our approach. Is there any way to do bit wise operation in >>> order to get the network_cidr value ? . >>> >> > Can't we do it by keeping the network IP and the subnet as two columns and > the geoname_id as the third. Say for example, if 192.168.0.0/20 is the > cidr, for IPv4 routing what is usually done is we get the IP as int, then > do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that > would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e. > 11111111 11111111 11110000 00000) and check whether that returns the > network IP. > > You might find more info here [1]. I think there should be libraries that > wrap this operation. But if performance is a concern and we need to keep > the cache search implementation very lean, we can implement it ourselves. > > WDYT? > > [1] > http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask > > Thanks, > Lasantha > > >>> Thanks >>> Tharindu >>> >>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I think what Sachith suggests also makes sense. But am also rooting for >>>> the in-memory cache implementation suggested by Sanjeewa with ip-netmask >>>> approach. >>>> >>>> Please find my comments inline. >>>> >>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> From what I understand/was told, this happens once a day ( or >>>>> relatively infrequently), and you wanna avoid searching through all the >>>>> geo >>>>> data per ip ( since you are grouping the requests by IP). >>>>> >>>>> IF that's the case, it would be better to use a separate DB table to >>>>> cache these data ( IP, geoID ..etc) with the IP being the primary key ( >>>>> which would improve the lookup time), and even though there will be cache >>>>> misses, it would eventually reduce the (#cacheMisses/ Hits). >>>>> >>>>> Having a DB cache would be better since you do want to persist these >>>>> data to be used over time. >>>>> >>>>> BTW in a cache miss, if we can figure out a way to limit the search >>>>> range on the original table or at least stop the search once a match is >>>>> found, it would greatly improve the cache miss time as well. >>>>> >>>>> That's my two cents. >>>>> >>>>> Cheers, >>>>> Sachith >>>>> >>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Sanjeewa, >>>>>> >>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Implementing cache is better than having another table mapping IMO. >>>>>>> What if we query database and keep IP range and network name in memory. >>>>>>> Then we may do quick search on network name and then based on that >>>>>>> rest can load some other way. >>>>>>> WDYT? >>>>>>> >>>>>> We thought of having an in memory cache but we faced several issues >>>>>> along the way. Let me explain the situation as it is per now. >>>>>> >>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask. >>>>>> Ex: 192.168.0.0/20 >>>>>> >>>>>> The calculation of the IP address range would be like the following. >>>>>> >>>>>> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >>>>>> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >>>>>> =>Network: 192.168.0.0/20 11000000.10101000.0000 0000.00000000 >>>>>> (Class C) >>>>>> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >>>>>> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >>>>>> Hosts/Net: 4094 (Private Internet >>>>>> <http://www.ietf.org/rfc/rfc1918.txt>) >>>>>> >>>>>> >>>>>> Therefore what we are currently doing is to calculate the start and >>>>>> end IP for all the values in the max-mind DB and alter the tables with >>>>>> those values initially(this is a one time thing that will happen). When >>>>>> the >>>>>> Spark script executes, we check whether the given IP is between any of >>>>>> the >>>>>> start and end ranges in the tables. That is the reason why it is taking a >>>>>> long time to fetch results for a given IP. >>>>>> >>>>>> As a solution for this, we discussed what Tharindu has mentioned. >>>>>> 1. Have a in memory caching mechanism. >>>>>> 2. Have a DB based caching mechanism. >>>>>> >>>>>> The only point that we have to highlight is the fact that in both the >>>>>> above mechanisms we need to cache the IP address(not the ip-netmask as it >>>>>> was in the max-mind db) against the Geo location. >>>>>> >>>>>> Ex:- >>>>>> For 192.168.0.1 - Colombo, Sri Lanka >>>>>> For 192.168.15.254 - Colombo, Sri Lanka >>>>>> >>>>>> So as per the above example I took, if there are requests form all >>>>>> the possible 4094 address we will be caching each IP with the Geo >>>>>> location(since introducing range queries in a cache is not a good >>>>>> practice). >>>>>> >>>>> >>>> Since we are implementing a custom cache, won't we be doing a bitwise >>>> operation for the lookup with netmask and network IP? So basically, we >>>> would keep the network IP and the netmask in cache and simply do a bitwise >>>> AND to determine whether it is a match or not, right? Am thinking such an >>>> operation would not incur much of a performance hit and it won't be as >>>> prohibitive as a normal range query in a cache. If that is the case, I >>>> think we can go with the approach suggested by Sanjeewa. >>>> >>>> WDYT? >>>> >>>> >>>>>> Please find my comments about both the approaches. >>>>>> >>>>>> 1. Having an in-memory cache would speedup things but based on the >>>>>> IPs in the data set, there could be number of entries for IPs in the same >>>>>> range. One problem with this approach is that, if there is a server >>>>>> restart, the initial script execution would take a lots of time. Also >>>>>> based >>>>>> on certain scenarios(high number of different IPs) the cache would not >>>>>> have >>>>>> a significant effect on script execution performance. >>>>>> >>>>>> 2. Having a DB based cache would persist the data even on a restart >>>>>> and the data fetching query would be searching for an specific value(not >>>>>> a >>>>>> range query as against the max-mind DB). But the downside is that for a >>>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>>> lookup and one for the max-mind db lookup and one for the >>>>>> cache persistence). >>>>>> >>>>>> That is why we have initiated this thread to finalize the caching >>>>>> approach we should take. >>>>>> >>>>>> Thanks, >>>>>> Janaka >>>>>> >>>>>> >>>>>> >>>>>>> Thanks, >>>>>>> sanjeewa. >>>>>>> >>>>>> >>>> Thanks, >>>> Lasantha >>>> >>>> >>>>> >>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We are going to implement Client IP based Geo-location Graph in API >>>>>>>> Manager Analytics. When we go through the ways of doing in [1] , we >>>>>>>> selected [2] as the most suitable way to do. >>>>>>>> >>>>>>>> >>>>>>>> *Overview of max-mind's DB.* >>>>>>>> >>>>>>>> As the structure of the db (attached in image), They have two >>>>>>>> tables which incorporate to get the location. >>>>>>>> >>>>>>>> Find geoname_id according to network and get Country,City from >>>>>>>> locations table. >>>>>>>> >>>>>>>> *Limitations* >>>>>>>> >>>>>>>> As their database dump we couldn't directly process the ip from >>>>>>>> those tables. We need to check the given ip is in between the network >>>>>>>> min >>>>>>>> and max ip. This query get some long time (10 seconds in indexed >>>>>>>> data). If >>>>>>>> we directly do this from spark script for each and every ip which in >>>>>>>> summary table (regardless if ip is same from two row data) will query >>>>>>>> from >>>>>>>> the tables. Therefore this will incur the performance impact on this >>>>>>>> graph. >>>>>>>> >>>>>>>> *Solution* >>>>>>>> >>>>>>>> 1. Implement LRU cache against ip address vs location. >>>>>>>> >>>>>>>> This will need to implement on custom UDF in Spark. If ip querying >>>>>>>> from spark available in cache it will give the location from it , IF >>>>>>>> it is >>>>>>>> not It will retrieve from DB and put into the cache. >>>>>>>> >>>>>>>> 2. Persist in a Table >>>>>>>> >>>>>>>> ip as the primary key and Country and city as other columns and >>>>>>>> retrieve data from that table. >>>>>>>> >>>>>>>> >>>>>>>> Please feel free to give us the most suitable way of doing this >>>>>>>> solution?. >>>>>>>> >>>>>>>> [1] - Implementing Geographical based Analytics in API Manager mail >>>>>>>> thread. >>>>>>>> >>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>>>>>> >>>>>>>> >>>>>>>> *Thanks* >>>>>>>> >>>>>>>> *Tharindu Dharmarathna* >>>>>>>> Associate Software Engineer >>>>>>>> WSO2 Inc.; http://wso2.com >>>>>>>> lean.enterprise.middleware >>>>>>>> >>>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> *Sanjeewa Malalgoda* >>>>>>> WSO2 Inc. >>>>>>> Mobile : +94713068779 >>>>>>> >>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog >>>>>>> :http://sanjeewamalalgoda.blogspot.com/ >>>>>>> <http://sanjeewamalalgoda.blogspot.com/> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Janaka Ranabahu* >>>>>> Associate Technical Lead, WSO2 Inc. >>>>>> http://wso2.com >>>>>> >>>>>> >>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >>>>>> <%2B94%20718370861>* >>>>>> >>>>>> Lean . Enterprise . Middleware >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sachith Withana >>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>> E-mail: sachith AT wso2.com >>>>> M: +94715518127 >>>>> Linked-In: <http://goog_416592669> >>>>> https://lk.linkedin.com/in/sachithwithana >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Lasantha Fernando* >>>> Senior Software Engineer - Data Technologies Team >>>> WSO2 Inc. http://wso2.com >>>> >>>> email: [email protected] >>>> mobile: (+94) 71 5247551 >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> >>> *Tharindu Dharmarathna*Associate Software Engineer >>> WSO2 Inc.; http://wso2.com >>> lean.enterprise.middleware >>> >>> mobile: *+94779109091 <%2B94779109091>* >>> >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> > > > -- > *Lasantha Fernando* > Senior Software Engineer - Data Technologies Team > WSO2 Inc. http://wso2.com > > email: [email protected] > mobile: (+94) 71 5247551 > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Tharindu Dharmarathna*Associate Software Engineer WSO2 Inc.; http://wso2.com lean.enterprise.middleware mobile: *+94779109091*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
