Hi Tharindu, Great work. Can we do a performance test of this and share the results. Basically what we need to check is to see how much time a script would take to execute.
Thanks, Janaka On Thu, Mar 17, 2016 at 2:37 PM, Tharindu Dharmarathna <[email protected]> wrote: > Hi All, > > After Going though above discussion , We had implemented the Plug-gable > User Define Extension point. From this configuration We can write our own > implementation which can used to get the Country and State of the Given IP. > > *Caching Implementation* > > We define two level of caching as below. > > When IP address checked from the *UDF* , First It check on Cache to get > the Location Information. If it is not in cache It I'll check on another > database which contain IP to Location Direct Mapping as *Sajith* > Mentioned. If it is there it will return and cache that location. If > location not in that database , IP will check against the *MAXMIND* > database. and store the location on cache and the above table. > > Thanks > Tharindu > > > On Tue, Mar 8, 2016 at 2:34 PM, Tharindu Dharmarathna <[email protected]> > wrote: > >> Hi All, >> >> We have come across following ways to do the above task after the Initial >> POC. >> >> 1. Using File type database which given by max-mind (.mmdb) and use there >> database readers. >> >> From this approach we got lesser value to get the location from the above >> using JAX-RS service which is used to wrap the above database. This JAX-RS >> implementation is by default used the max-mind's Cache implementation which >> can find from [1] . >> >> *Limitations* >> >> >> - Hosting of the Jax-RS app in another server. >> - # of http calls will high. >> >> >> 2. Call query server as above thread and cached the location with ip. >> >> Here you can find the execution time for a single query which get for >> each method. >> >> >> *Method 1 : 4.5 seconds* >> >> *Method 2: 4.76 seconds* >> >> >> Thanks >> Tharindu >> >> >> On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]> >> wrote: >> >>> Hi Tharindu, >>> >>> On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote: >>> >>>> >>>> 2. Having a DB based cache would persist the data even on a restart and >>>>> the data fetching query would be searching for an specific value(not a >>>>> range query as against the max-mind DB). But the downside is that for a >>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>> lookup and one for the max-mind db lookup and one for the >>>>> cache persistence). >>>>> >>>> >>>> In order to avoid expensive cache misses we may eagerly populate the DB >>>> table cache. i.e. When there's a cache miss we do the lookup in max-mind db >>>> and then add multiple entries for multiple IPs of that netwokrk_cid to the >>>> Cache DB table instead of only for that particular IP. That way we reduce >>>> the chance of cache miss being very expensive, as we increase the chance of >>>> it being found on the first DB lookup. >>>> >>>> We might need to do some evaluation to determine how much entries that >>>> we are going to add to the DB cache for IP belongs to a particular >>>> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is >>>> frequent we may want to add more entries with compared to a less frequent >>>> netwokrk_cidr. >>>> >>>> The downside is the DB cache tend to be more big. >>>> >>>> Thanks >>>> *,Sajith Ravindra* >>>> Senior Software Engineer >>>> WSO2 Inc.; http://wso2.com >>>> lean.enterprise.middleware >>>> >>>> mobile: +94 77 2273550 >>>> blog: http://sajithr.blogspot.com/ >>>> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab> >>>> >>>> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna < >>>> [email protected]> wrote: >>>> >>>>> Hi Lasantha, >>>>> >>>>> Upto now we are doing the following way in order to get the geo >>>>> location from the stated dump. >>>>> >>>>> 1. two columns added filled with long value of lower and upper value >>>>> of network ip addresses. Then get the geoname_id with respect to the long >>>>> value for the given ip which between this above long values. Hope you will >>>>> got this idea on our approach. Is there any way to do bit wise operation >>>>> in >>>>> order to get the network_cidr value ? . >>>>> >>>> >>> Can't we do it by keeping the network IP and the subnet as two columns >>> and the geoname_id as the third. Say for example, if 192.168.0.0/20 is >>> the cidr, for IPv4 routing what is usually done is we get the IP as int, >>> then do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that >>> would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e. >>> 11111111 11111111 11110000 00000) and check whether that returns the >>> network IP. >>> >>> You might find more info here [1]. I think there should be libraries >>> that wrap this operation. But if performance is a concern and we need to >>> keep the cache search implementation very lean, we can implement it >>> ourselves. >>> >>> WDYT? >>> >>> [1] >>> http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask >>> >>> Thanks, >>> Lasantha >>> >>> >>>>> Thanks >>>>> Tharindu >>>>> >>>>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I think what Sachith suggests also makes sense. But am also rooting >>>>>> for the in-memory cache implementation suggested by Sanjeewa with >>>>>> ip-netmask approach. >>>>>> >>>>>> Please find my comments inline. >>>>>> >>>>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> From what I understand/was told, this happens once a day ( or >>>>>>> relatively infrequently), and you wanna avoid searching through all the >>>>>>> geo >>>>>>> data per ip ( since you are grouping the requests by IP). >>>>>>> >>>>>>> IF that's the case, it would be better to use a separate DB table to >>>>>>> cache these data ( IP, geoID ..etc) with the IP being the primary key ( >>>>>>> which would improve the lookup time), and even though there will be >>>>>>> cache >>>>>>> misses, it would eventually reduce the (#cacheMisses/ Hits). >>>>>>> >>>>>>> Having a DB cache would be better since you do want to persist these >>>>>>> data to be used over time. >>>>>>> >>>>>>> BTW in a cache miss, if we can figure out a way to limit the search >>>>>>> range on the original table or at least stop the search once a match is >>>>>>> found, it would greatly improve the cache miss time as well. >>>>>>> >>>>>>> That's my two cents. >>>>>>> >>>>>>> Cheers, >>>>>>> Sachith >>>>>>> >>>>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Sanjeewa, >>>>>>>> >>>>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Implementing cache is better than having another table mapping >>>>>>>>> IMO. What if we query database and keep IP range and network name in >>>>>>>>> memory. >>>>>>>>> Then we may do quick search on network name and then based on that >>>>>>>>> rest can load some other way. >>>>>>>>> WDYT? >>>>>>>>> >>>>>>>> We thought of having an in memory cache but we faced several >>>>>>>> issues along the way. Let me explain the situation as it is per now. >>>>>>>> >>>>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask. >>>>>>>> Ex: 192.168.0.0/20 >>>>>>>> >>>>>>>> The calculation of the IP address range would be like the following. >>>>>>>> >>>>>>>> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >>>>>>>> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >>>>>>>> =>Network: 192.168.0.0/20 11000000.10101000.0000 >>>>>>>> 0000.00000000 (Class C) >>>>>>>> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >>>>>>>> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >>>>>>>> Hosts/Net: 4094 (Private Internet >>>>>>>> <http://www.ietf.org/rfc/rfc1918.txt>) >>>>>>>> >>>>>>>> >>>>>>>> Therefore what we are currently doing is to calculate the start and >>>>>>>> end IP for all the values in the max-mind DB and alter the tables with >>>>>>>> those values initially(this is a one time thing that will happen). >>>>>>>> When the >>>>>>>> Spark script executes, we check whether the given IP is between any of >>>>>>>> the >>>>>>>> start and end ranges in the tables. That is the reason why it is >>>>>>>> taking a >>>>>>>> long time to fetch results for a given IP. >>>>>>>> >>>>>>>> As a solution for this, we discussed what Tharindu has mentioned. >>>>>>>> 1. Have a in memory caching mechanism. >>>>>>>> 2. Have a DB based caching mechanism. >>>>>>>> >>>>>>>> The only point that we have to highlight is the fact that in both >>>>>>>> the above mechanisms we need to cache the IP address(not the >>>>>>>> ip-netmask as >>>>>>>> it was in the max-mind db) against the Geo location. >>>>>>>> >>>>>>>> Ex:- >>>>>>>> For 192.168.0.1 - Colombo, Sri Lanka >>>>>>>> For 192.168.15.254 - Colombo, Sri Lanka >>>>>>>> >>>>>>>> So as per the above example I took, if there are requests form all >>>>>>>> the possible 4094 address we will be caching each IP with the Geo >>>>>>>> location(since introducing range queries in a cache is not a good >>>>>>>> practice). >>>>>>>> >>>>>>> >>>>>> Since we are implementing a custom cache, won't we be doing a bitwise >>>>>> operation for the lookup with netmask and network IP? So basically, we >>>>>> would keep the network IP and the netmask in cache and simply do a >>>>>> bitwise >>>>>> AND to determine whether it is a match or not, right? Am thinking such an >>>>>> operation would not incur much of a performance hit and it won't be as >>>>>> prohibitive as a normal range query in a cache. If that is the case, I >>>>>> think we can go with the approach suggested by Sanjeewa. >>>>>> >>>>>> WDYT? >>>>>> >>>>>> >>>>>>>> Please find my comments about both the approaches. >>>>>>>> >>>>>>>> 1. Having an in-memory cache would speedup things but based on the >>>>>>>> IPs in the data set, there could be number of entries for IPs in the >>>>>>>> same >>>>>>>> range. One problem with this approach is that, if there is a server >>>>>>>> restart, the initial script execution would take a lots of time. Also >>>>>>>> based >>>>>>>> on certain scenarios(high number of different IPs) the cache would not >>>>>>>> have >>>>>>>> a significant effect on script execution performance. >>>>>>>> >>>>>>>> 2. Having a DB based cache would persist the data even on a restart >>>>>>>> and the data fetching query would be searching for an specific >>>>>>>> value(not a >>>>>>>> range query as against the max-mind DB). But the downside is that for a >>>>>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>>>>> lookup and one for the max-mind db lookup and one for the >>>>>>>> cache persistence). >>>>>>>> >>>>>>>> That is why we have initiated this thread to finalize the caching >>>>>>>> approach we should take. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Janaka >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> sanjeewa. >>>>>>>>> >>>>>>>> >>>>>> Thanks, >>>>>> Lasantha >>>>>> >>>>>> >>>>>>> >>>>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> We are going to implement Client IP based Geo-location Graph in >>>>>>>>>> API Manager Analytics. When we go through the ways of doing in [1] , >>>>>>>>>> we >>>>>>>>>> selected [2] as the most suitable way to do. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Overview of max-mind's DB.* >>>>>>>>>> >>>>>>>>>> As the structure of the db (attached in image), They have two >>>>>>>>>> tables which incorporate to get the location. >>>>>>>>>> >>>>>>>>>> Find geoname_id according to network and get Country,City from >>>>>>>>>> locations table. >>>>>>>>>> >>>>>>>>>> *Limitations* >>>>>>>>>> >>>>>>>>>> As their database dump we couldn't directly process the ip from >>>>>>>>>> those tables. We need to check the given ip is in between the >>>>>>>>>> network min >>>>>>>>>> and max ip. This query get some long time (10 seconds in indexed >>>>>>>>>> data). If >>>>>>>>>> we directly do this from spark script for each and every ip which in >>>>>>>>>> summary table (regardless if ip is same from two row data) will >>>>>>>>>> query from >>>>>>>>>> the tables. Therefore this will incur the performance impact on this >>>>>>>>>> graph. >>>>>>>>>> >>>>>>>>>> *Solution* >>>>>>>>>> >>>>>>>>>> 1. Implement LRU cache against ip address vs location. >>>>>>>>>> >>>>>>>>>> This will need to implement on custom UDF in Spark. If ip >>>>>>>>>> querying from spark available in cache it will give the location >>>>>>>>>> from it , >>>>>>>>>> IF it is not It will retrieve from DB and put into the cache. >>>>>>>>>> >>>>>>>>>> 2. Persist in a Table >>>>>>>>>> >>>>>>>>>> ip as the primary key and Country and city as other columns and >>>>>>>>>> retrieve data from that table. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Please feel free to give us the most suitable way of doing this >>>>>>>>>> solution?. >>>>>>>>>> >>>>>>>>>> [1] - Implementing Geographical based Analytics in API Manager >>>>>>>>>> mail thread. >>>>>>>>>> >>>>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Thanks* >>>>>>>>>> >>>>>>>>>> *Tharindu Dharmarathna* >>>>>>>>>> Associate Software Engineer >>>>>>>>>> WSO2 Inc.; http://wso2.com >>>>>>>>>> lean.enterprise.middleware >>>>>>>>>> >>>>>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> *Sanjeewa Malalgoda* >>>>>>>>> WSO2 Inc. >>>>>>>>> Mobile : +94713068779 >>>>>>>>> >>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog >>>>>>>>> :http://sanjeewamalalgoda.blogspot.com/ >>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Janaka Ranabahu* >>>>>>>> Associate Technical Lead, WSO2 Inc. >>>>>>>> http://wso2.com >>>>>>>> >>>>>>>> >>>>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >>>>>>>> <%2B94%20718370861>* >>>>>>>> >>>>>>>> Lean . Enterprise . Middleware >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sachith Withana >>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>> E-mail: sachith AT wso2.com >>>>>>> M: +94715518127 >>>>>>> Linked-In: <http://goog_416592669> >>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Architecture mailing list >>>>>>> [email protected] >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Lasantha Fernando* >>>>>> Senior Software Engineer - Data Technologies Team >>>>>> WSO2 Inc. http://wso2.com >>>>>> >>>>>> email: [email protected] >>>>>> mobile: (+94) 71 5247551 >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> *Tharindu Dharmarathna*Associate Software Engineer >>>>> WSO2 Inc.; http://wso2.com >>>>> lean.enterprise.middleware >>>>> >>>>> mobile: *+94779109091 <%2B94779109091>* >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>> >>> >>> -- >>> *Lasantha Fernando* >>> Senior Software Engineer - Data Technologies Team >>> WSO2 Inc. http://wso2.com >>> >>> email: [email protected] >>> mobile: (+94) 71 5247551 >>> >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> >> *Tharindu Dharmarathna*Associate Software Engineer >> WSO2 Inc.; http://wso2.com >> lean.enterprise.middleware >> >> mobile: *+94779109091 <%2B94779109091>* >> > > > > -- > > *Tharindu Dharmarathna*Associate Software Engineer > WSO2 Inc.; http://wso2.com > lean.enterprise.middleware > > mobile: *+94779109091 <%2B94779109091>* > -- *Janaka Ranabahu* Associate Technical Lead, WSO2 Inc. http://wso2.com *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861* Lean . Enterprise . Middleware
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
