Lochana, let's try to reuse this udf for AS analytics too. On Thu, Mar 17, 2016 at 2:43 PM, Janaka Ranabahu <[email protected]> wrote:
> Hi Tharindu, > > Great work. Can we do a performance test of this and share the results. > Basically what we need to check is to see how much time a script would take > to execute. > > Thanks, > Janaka > > On Thu, Mar 17, 2016 at 2:37 PM, Tharindu Dharmarathna <[email protected] > > wrote: > >> Hi All, >> >> After Going though above discussion , We had implemented the Plug-gable >> User Define Extension point. From this configuration We can write our own >> implementation which can used to get the Country and State of the Given IP. >> >> *Caching Implementation* >> >> We define two level of caching as below. >> >> When IP address checked from the *UDF* , First It check on Cache to get >> the Location Information. If it is not in cache It I'll check on another >> database which contain IP to Location Direct Mapping as *Sajith* >> Mentioned. If it is there it will return and cache that location. If >> location not in that database , IP will check against the *MAXMIND* >> database. and store the location on cache and the above table. >> >> Thanks >> Tharindu >> >> >> On Tue, Mar 8, 2016 at 2:34 PM, Tharindu Dharmarathna <[email protected] >> > wrote: >> >>> Hi All, >>> >>> We have come across following ways to do the above task after the >>> Initial POC. >>> >>> 1. Using File type database which given by max-mind (.mmdb) and use >>> there database readers. >>> >>> From this approach we got lesser value to get the location from the >>> above using JAX-RS service which is used to wrap the above database. This >>> JAX-RS implementation is by default used the max-mind's Cache >>> implementation which can find from [1] . >>> >>> *Limitations* >>> >>> >>> - Hosting of the Jax-RS app in another server. >>> - # of http calls will high. >>> >>> >>> 2. Call query server as above thread and cached the location with ip. >>> >>> Here you can find the execution time for a single query which get for >>> each method. >>> >>> >>> *Method 1 : 4.5 seconds* >>> >>> *Method 2: 4.76 seconds* >>> >>> >>> Thanks >>> Tharindu >>> >>> >>> On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]> >>> wrote: >>> >>>> Hi Tharindu, >>>> >>>> On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote: >>>> >>>>> >>>>> 2. Having a DB based cache would persist the data even on a restart >>>>>> and the data fetching query would be searching for an specific value(not >>>>>> a >>>>>> range query as against the max-mind DB). But the downside is that for a >>>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>>> lookup and one for the max-mind db lookup and one for the >>>>>> cache persistence). >>>>>> >>>>> >>>>> In order to avoid expensive cache misses we may eagerly populate the >>>>> DB table cache. i.e. When there's a cache miss we do the lookup in >>>>> max-mind >>>>> db and then add multiple entries for multiple IPs of that netwokrk_cid to >>>>> the Cache DB table instead of only for that particular IP. That way we >>>>> reduce the chance of cache miss being very expensive, as we increase the >>>>> chance of it being found on the first DB lookup. >>>>> >>>>> We might need to do some evaluation to determine how much entries that >>>>> we are going to add to the DB cache for IP belongs to a particular >>>>> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is >>>>> frequent we may want to add more entries with compared to a less frequent >>>>> netwokrk_cidr. >>>>> >>>>> The downside is the DB cache tend to be more big. >>>>> >>>>> Thanks >>>>> *,Sajith Ravindra* >>>>> Senior Software Engineer >>>>> WSO2 Inc.; http://wso2.com >>>>> lean.enterprise.middleware >>>>> >>>>> mobile: +94 77 2273550 >>>>> blog: http://sajithr.blogspot.com/ >>>>> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab> >>>>> >>>>> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Lasantha, >>>>>> >>>>>> Upto now we are doing the following way in order to get the geo >>>>>> location from the stated dump. >>>>>> >>>>>> 1. two columns added filled with long value of lower and upper value >>>>>> of network ip addresses. Then get the geoname_id with respect to the long >>>>>> value for the given ip which between this above long values. Hope you >>>>>> will >>>>>> got this idea on our approach. Is there any way to do bit wise operation >>>>>> in >>>>>> order to get the network_cidr value ? . >>>>>> >>>>> >>>> Can't we do it by keeping the network IP and the subnet as two columns >>>> and the geoname_id as the third. Say for example, if 192.168.0.0/20 is >>>> the cidr, for IPv4 routing what is usually done is we get the IP as int, >>>> then do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that >>>> would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e. >>>> 11111111 11111111 11110000 00000) and check whether that returns the >>>> network IP. >>>> >>>> You might find more info here [1]. I think there should be libraries >>>> that wrap this operation. But if performance is a concern and we need to >>>> keep the cache search implementation very lean, we can implement it >>>> ourselves. >>>> >>>> WDYT? >>>> >>>> [1] >>>> http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask >>>> >>>> Thanks, >>>> Lasantha >>>> >>>> >>>>>> Thanks >>>>>> Tharindu >>>>>> >>>>>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I think what Sachith suggests also makes sense. But am also rooting >>>>>>> for the in-memory cache implementation suggested by Sanjeewa with >>>>>>> ip-netmask approach. >>>>>>> >>>>>>> Please find my comments inline. >>>>>>> >>>>>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> From what I understand/was told, this happens once a day ( or >>>>>>>> relatively infrequently), and you wanna avoid searching through all >>>>>>>> the geo >>>>>>>> data per ip ( since you are grouping the requests by IP). >>>>>>>> >>>>>>>> IF that's the case, it would be better to use a separate DB table >>>>>>>> to cache these data ( IP, geoID ..etc) with the IP being the primary >>>>>>>> key ( >>>>>>>> which would improve the lookup time), and even though there will be >>>>>>>> cache >>>>>>>> misses, it would eventually reduce the (#cacheMisses/ Hits). >>>>>>>> >>>>>>>> Having a DB cache would be better since you do want to persist >>>>>>>> these data to be used over time. >>>>>>>> >>>>>>>> BTW in a cache miss, if we can figure out a way to limit the search >>>>>>>> range on the original table or at least stop the search once a match is >>>>>>>> found, it would greatly improve the cache miss time as well. >>>>>>>> >>>>>>>> That's my two cents. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Sachith >>>>>>>> >>>>>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Sanjeewa, >>>>>>>>> >>>>>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Implementing cache is better than having another table mapping >>>>>>>>>> IMO. What if we query database and keep IP range and network name in >>>>>>>>>> memory. >>>>>>>>>> Then we may do quick search on network name and then based on >>>>>>>>>> that rest can load some other way. >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>> We thought of having an in memory cache but we faced several >>>>>>>>> issues along the way. Let me explain the situation as it is per now. >>>>>>>>> >>>>>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask. >>>>>>>>> Ex: 192.168.0.0/20 >>>>>>>>> >>>>>>>>> The calculation of the IP address range would be like the >>>>>>>>> following. >>>>>>>>> >>>>>>>>> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>>> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >>>>>>>>> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >>>>>>>>> =>Network: 192.168.0.0/20 11000000.10101000.0000 >>>>>>>>> 0000.00000000 (Class C) >>>>>>>>> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >>>>>>>>> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>>> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >>>>>>>>> Hosts/Net: 4094 (Private Internet >>>>>>>>> <http://www.ietf.org/rfc/rfc1918.txt>) >>>>>>>>> >>>>>>>>> >>>>>>>>> Therefore what we are currently doing is to calculate the start >>>>>>>>> and end IP for all the values in the max-mind DB and alter the tables >>>>>>>>> with >>>>>>>>> those values initially(this is a one time thing that will happen). >>>>>>>>> When the >>>>>>>>> Spark script executes, we check whether the given IP is between any >>>>>>>>> of the >>>>>>>>> start and end ranges in the tables. That is the reason why it is >>>>>>>>> taking a >>>>>>>>> long time to fetch results for a given IP. >>>>>>>>> >>>>>>>>> As a solution for this, we discussed what Tharindu has mentioned. >>>>>>>>> 1. Have a in memory caching mechanism. >>>>>>>>> 2. Have a DB based caching mechanism. >>>>>>>>> >>>>>>>>> The only point that we have to highlight is the fact that in both >>>>>>>>> the above mechanisms we need to cache the IP address(not the >>>>>>>>> ip-netmask as >>>>>>>>> it was in the max-mind db) against the Geo location. >>>>>>>>> >>>>>>>>> Ex:- >>>>>>>>> For 192.168.0.1 - Colombo, Sri Lanka >>>>>>>>> For 192.168.15.254 - Colombo, Sri Lanka >>>>>>>>> >>>>>>>>> So as per the above example I took, if there are requests form all >>>>>>>>> the possible 4094 address we will be caching each IP with the Geo >>>>>>>>> location(since introducing range queries in a cache is not a good >>>>>>>>> practice). >>>>>>>>> >>>>>>>> >>>>>>> Since we are implementing a custom cache, won't we be doing a >>>>>>> bitwise operation for the lookup with netmask and network IP? So >>>>>>> basically, >>>>>>> we would keep the network IP and the netmask in cache and simply do a >>>>>>> bitwise AND to determine whether it is a match or not, right? Am >>>>>>> thinking >>>>>>> such an operation would not incur much of a performance hit and it >>>>>>> won't be >>>>>>> as prohibitive as a normal range query in a cache. If that is the case, >>>>>>> I >>>>>>> think we can go with the approach suggested by Sanjeewa. >>>>>>> >>>>>>> WDYT? >>>>>>> >>>>>>> >>>>>>>>> Please find my comments about both the approaches. >>>>>>>>> >>>>>>>>> 1. Having an in-memory cache would speedup things but based on the >>>>>>>>> IPs in the data set, there could be number of entries for IPs in the >>>>>>>>> same >>>>>>>>> range. One problem with this approach is that, if there is a server >>>>>>>>> restart, the initial script execution would take a lots of time. Also >>>>>>>>> based >>>>>>>>> on certain scenarios(high number of different IPs) the cache would >>>>>>>>> not have >>>>>>>>> a significant effect on script execution performance. >>>>>>>>> >>>>>>>>> 2. Having a DB based cache would persist the data even on a >>>>>>>>> restart and the data fetching query would be searching for an specific >>>>>>>>> value(not a range query as against the max-mind DB). But the downside >>>>>>>>> is >>>>>>>>> that for a cache miss there would be minimum 3 DB queries (one for the >>>>>>>>> cache table lookup and one for the max-mind db lookup and one for the >>>>>>>>> cache persistence). >>>>>>>>> >>>>>>>>> That is why we have initiated this thread to finalize the caching >>>>>>>>> approach we should take. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Janaka >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> sanjeewa. >>>>>>>>>> >>>>>>>>> >>>>>>> Thanks, >>>>>>> Lasantha >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> We are going to implement Client IP based Geo-location Graph in >>>>>>>>>>> API Manager Analytics. When we go through the ways of doing in [1] >>>>>>>>>>> , we >>>>>>>>>>> selected [2] as the most suitable way to do. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Overview of max-mind's DB.* >>>>>>>>>>> >>>>>>>>>>> As the structure of the db (attached in image), They have two >>>>>>>>>>> tables which incorporate to get the location. >>>>>>>>>>> >>>>>>>>>>> Find geoname_id according to network and get Country,City from >>>>>>>>>>> locations table. >>>>>>>>>>> >>>>>>>>>>> *Limitations* >>>>>>>>>>> >>>>>>>>>>> As their database dump we couldn't directly process the ip from >>>>>>>>>>> those tables. We need to check the given ip is in between the >>>>>>>>>>> network min >>>>>>>>>>> and max ip. This query get some long time (10 seconds in indexed >>>>>>>>>>> data). If >>>>>>>>>>> we directly do this from spark script for each and every ip which in >>>>>>>>>>> summary table (regardless if ip is same from two row data) will >>>>>>>>>>> query from >>>>>>>>>>> the tables. Therefore this will incur the performance impact on >>>>>>>>>>> this graph. >>>>>>>>>>> >>>>>>>>>>> *Solution* >>>>>>>>>>> >>>>>>>>>>> 1. Implement LRU cache against ip address vs location. >>>>>>>>>>> >>>>>>>>>>> This will need to implement on custom UDF in Spark. If ip >>>>>>>>>>> querying from spark available in cache it will give the location >>>>>>>>>>> from it , >>>>>>>>>>> IF it is not It will retrieve from DB and put into the cache. >>>>>>>>>>> >>>>>>>>>>> 2. Persist in a Table >>>>>>>>>>> >>>>>>>>>>> ip as the primary key and Country and city as other columns and >>>>>>>>>>> retrieve data from that table. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Please feel free to give us the most suitable way of doing this >>>>>>>>>>> solution?. >>>>>>>>>>> >>>>>>>>>>> [1] - Implementing Geographical based Analytics in API Manager >>>>>>>>>>> mail thread. >>>>>>>>>>> >>>>>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Thanks* >>>>>>>>>>> >>>>>>>>>>> *Tharindu Dharmarathna* >>>>>>>>>>> Associate Software Engineer >>>>>>>>>>> WSO2 Inc.; http://wso2.com >>>>>>>>>>> lean.enterprise.middleware >>>>>>>>>>> >>>>>>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> *Sanjeewa Malalgoda* >>>>>>>>>> WSO2 Inc. >>>>>>>>>> Mobile : +94713068779 >>>>>>>>>> >>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog >>>>>>>>>> :http://sanjeewamalalgoda.blogspot.com/ >>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Janaka Ranabahu* >>>>>>>>> Associate Technical Lead, WSO2 Inc. >>>>>>>>> http://wso2.com >>>>>>>>> >>>>>>>>> >>>>>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >>>>>>>>> <%2B94%20718370861>* >>>>>>>>> >>>>>>>>> Lean . Enterprise . Middleware >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sachith Withana >>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>>> E-mail: sachith AT wso2.com >>>>>>>> M: +94715518127 >>>>>>>> Linked-In: <http://goog_416592669> >>>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Architecture mailing list >>>>>>>> [email protected] >>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Lasantha Fernando* >>>>>>> Senior Software Engineer - Data Technologies Team >>>>>>> WSO2 Inc. http://wso2.com >>>>>>> >>>>>>> email: [email protected] >>>>>>> mobile: (+94) 71 5247551 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Architecture mailing list >>>>>>> [email protected] >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> *Tharindu Dharmarathna*Associate Software Engineer >>>>>> WSO2 Inc.; http://wso2.com >>>>>> lean.enterprise.middleware >>>>>> >>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Lasantha Fernando* >>>> Senior Software Engineer - Data Technologies Team >>>> WSO2 Inc. http://wso2.com >>>> >>>> email: [email protected] >>>> mobile: (+94) 71 5247551 >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> >>> *Tharindu Dharmarathna*Associate Software Engineer >>> WSO2 Inc.; http://wso2.com >>> lean.enterprise.middleware >>> >>> mobile: *+94779109091 <%2B94779109091>* >>> >> >> >> >> -- >> >> *Tharindu Dharmarathna*Associate Software Engineer >> WSO2 Inc.; http://wso2.com >> lean.enterprise.middleware >> >> mobile: *+94779109091 <%2B94779109091>* >> > > > > -- > *Janaka Ranabahu* > Associate Technical Lead, WSO2 Inc. > http://wso2.com > > > *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 > <%2B94%20718370861>* > > Lean . Enterprise . Middleware > -- *Kishanthan Thangarajah* Associate Technical Lead, Platform Technologies Team, WSO2, Inc. lean.enterprise.middleware Mobile - +94773426635 Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>* Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
