Hi Kishanthan, Started work on using the same UDF.
Thanks, Lochana On Thu, Mar 17, 2016 at 11:04 PM, Kishanthan Thangarajah < [email protected]> wrote: > Lochana, let's try to reuse this udf for AS analytics too. > > On Thu, Mar 17, 2016 at 2:43 PM, Janaka Ranabahu <[email protected]> wrote: > >> Hi Tharindu, >> >> Great work. Can we do a performance test of this and share the results. >> Basically what we need to check is to see how much time a script would take >> to execute. >> >> Thanks, >> Janaka >> >> On Thu, Mar 17, 2016 at 2:37 PM, Tharindu Dharmarathna < >> [email protected]> wrote: >> >>> Hi All, >>> >>> After Going though above discussion , We had implemented the Plug-gable >>> User Define Extension point. From this configuration We can write our own >>> implementation which can used to get the Country and State of the Given IP. >>> >>> *Caching Implementation* >>> >>> We define two level of caching as below. >>> >>> When IP address checked from the *UDF* , First It check on Cache to get >>> the Location Information. If it is not in cache It I'll check on another >>> database which contain IP to Location Direct Mapping as *Sajith* >>> Mentioned. If it is there it will return and cache that location. If >>> location not in that database , IP will check against the *MAXMIND* >>> database. and store the location on cache and the above table. >>> >>> Thanks >>> Tharindu >>> >>> >>> On Tue, Mar 8, 2016 at 2:34 PM, Tharindu Dharmarathna < >>> [email protected]> wrote: >>> >>>> Hi All, >>>> >>>> We have come across following ways to do the above task after the >>>> Initial POC. >>>> >>>> 1. Using File type database which given by max-mind (.mmdb) and use >>>> there database readers. >>>> >>>> From this approach we got lesser value to get the location from the >>>> above using JAX-RS service which is used to wrap the above database. This >>>> JAX-RS implementation is by default used the max-mind's Cache >>>> implementation which can find from [1] . >>>> >>>> *Limitations* >>>> >>>> >>>> - Hosting of the Jax-RS app in another server. >>>> - # of http calls will high. >>>> >>>> >>>> 2. Call query server as above thread and cached the location with ip. >>>> >>>> Here you can find the execution time for a single query which get for >>>> each method. >>>> >>>> >>>> *Method 1 : 4.5 seconds* >>>> >>>> *Method 2: 4.76 seconds* >>>> >>>> >>>> Thanks >>>> Tharindu >>>> >>>> >>>> On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]> >>>> wrote: >>>> >>>>> Hi Tharindu, >>>>> >>>>> On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote: >>>>> >>>>>> >>>>>> 2. Having a DB based cache would persist the data even on a restart >>>>>>> and the data fetching query would be searching for an specific >>>>>>> value(not a >>>>>>> range query as against the max-mind DB). But the downside is that for a >>>>>>> cache miss there would be minimum 3 DB queries (one for the cache table >>>>>>> lookup and one for the max-mind db lookup and one for the >>>>>>> cache persistence). >>>>>>> >>>>>> >>>>>> In order to avoid expensive cache misses we may eagerly populate the >>>>>> DB table cache. i.e. When there's a cache miss we do the lookup in >>>>>> max-mind >>>>>> db and then add multiple entries for multiple IPs of that netwokrk_cid to >>>>>> the Cache DB table instead of only for that particular IP. That way we >>>>>> reduce the chance of cache miss being very expensive, as we increase the >>>>>> chance of it being found on the first DB lookup. >>>>>> >>>>>> We might need to do some evaluation to determine how much entries >>>>>> that we are going to add to the DB cache for IP belongs to a particular >>>>>> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is >>>>>> frequent we may want to add more entries with compared to a less frequent >>>>>> netwokrk_cidr. >>>>>> >>>>>> The downside is the DB cache tend to be more big. >>>>>> >>>>>> Thanks >>>>>> *,Sajith Ravindra* >>>>>> Senior Software Engineer >>>>>> WSO2 Inc.; http://wso2.com >>>>>> lean.enterprise.middleware >>>>>> >>>>>> mobile: +94 77 2273550 >>>>>> blog: http://sajithr.blogspot.com/ >>>>>> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab> >>>>>> >>>>>> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Lasantha, >>>>>>> >>>>>>> Upto now we are doing the following way in order to get the geo >>>>>>> location from the stated dump. >>>>>>> >>>>>>> 1. two columns added filled with long value of lower and upper >>>>>>> value of network ip addresses. Then get the geoname_id with respect to >>>>>>> the >>>>>>> long value for the given ip which between this above long values. Hope >>>>>>> you >>>>>>> will got this idea on our approach. Is there any way to do bit wise >>>>>>> operation in order to get the network_cidr value ? . >>>>>>> >>>>>> >>>>> Can't we do it by keeping the network IP and the subnet as two columns >>>>> and the geoname_id as the third. Say for example, if 192.168.0.0/20 >>>>> is the cidr, for IPv4 routing what is usually done is we get the IP as >>>>> int, >>>>> then do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, >>>>> that >>>>> would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e. >>>>> 11111111 11111111 11110000 00000) and check whether that returns the >>>>> network IP. >>>>> >>>>> You might find more info here [1]. I think there should be libraries >>>>> that wrap this operation. But if performance is a concern and we need to >>>>> keep the cache search implementation very lean, we can implement it >>>>> ourselves. >>>>> >>>>> WDYT? >>>>> >>>>> [1] >>>>> http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask >>>>> >>>>> Thanks, >>>>> Lasantha >>>>> >>>>> >>>>>>> Thanks >>>>>>> Tharindu >>>>>>> >>>>>>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I think what Sachith suggests also makes sense. But am also rooting >>>>>>>> for the in-memory cache implementation suggested by Sanjeewa with >>>>>>>> ip-netmask approach. >>>>>>>> >>>>>>>> Please find my comments inline. >>>>>>>> >>>>>>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> From what I understand/was told, this happens once a day ( or >>>>>>>>> relatively infrequently), and you wanna avoid searching through all >>>>>>>>> the geo >>>>>>>>> data per ip ( since you are grouping the requests by IP). >>>>>>>>> >>>>>>>>> IF that's the case, it would be better to use a separate DB table >>>>>>>>> to cache these data ( IP, geoID ..etc) with the IP being the primary >>>>>>>>> key ( >>>>>>>>> which would improve the lookup time), and even though there will be >>>>>>>>> cache >>>>>>>>> misses, it would eventually reduce the (#cacheMisses/ Hits). >>>>>>>>> >>>>>>>>> Having a DB cache would be better since you do want to persist >>>>>>>>> these data to be used over time. >>>>>>>>> >>>>>>>>> BTW in a cache miss, if we can figure out a way to limit the >>>>>>>>> search range on the original table or at least stop the search once a >>>>>>>>> match >>>>>>>>> is found, it would greatly improve the cache miss time as well. >>>>>>>>> >>>>>>>>> That's my two cents. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Sachith >>>>>>>>> >>>>>>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Sanjeewa, >>>>>>>>>> >>>>>>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Implementing cache is better than having another table mapping >>>>>>>>>>> IMO. What if we query database and keep IP range and network name >>>>>>>>>>> in memory. >>>>>>>>>>> Then we may do quick search on network name and then based on >>>>>>>>>>> that rest can load some other way. >>>>>>>>>>> WDYT? >>>>>>>>>>> >>>>>>>>>> We thought of having an in memory cache but we faced several >>>>>>>>>> issues along the way. Let me explain the situation as it is per now. >>>>>>>>>> >>>>>>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask. >>>>>>>>>> Ex: 192.168.0.0/20 >>>>>>>>>> >>>>>>>>>> The calculation of the IP address range would be like the >>>>>>>>>> following. >>>>>>>>>> >>>>>>>>>> Address: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>>>> Netmask: 255.255.240.0 = 20 11111111.11111111.1111 0000.00000000 >>>>>>>>>> Wildcard: 0.0.15.255 00000000.00000000.0000 1111.11111111 >>>>>>>>>> =>Network: 192.168.0.0/20 11000000.10101000.0000 >>>>>>>>>> 0000.00000000 (Class C) >>>>>>>>>> Broadcast: 192.168.15.255 11000000.10101000.0000 1111.11111111 >>>>>>>>>> HostMin: 192.168.0.1 11000000.10101000.0000 0000.00000001 >>>>>>>>>> HostMax: 192.168.15.254 11000000.10101000.0000 1111.11111110 >>>>>>>>>> Hosts/Net: 4094 (Private Internet >>>>>>>>>> <http://www.ietf.org/rfc/rfc1918.txt>) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Therefore what we are currently doing is to calculate the start >>>>>>>>>> and end IP for all the values in the max-mind DB and alter the >>>>>>>>>> tables with >>>>>>>>>> those values initially(this is a one time thing that will happen). >>>>>>>>>> When the >>>>>>>>>> Spark script executes, we check whether the given IP is between any >>>>>>>>>> of the >>>>>>>>>> start and end ranges in the tables. That is the reason why it is >>>>>>>>>> taking a >>>>>>>>>> long time to fetch results for a given IP. >>>>>>>>>> >>>>>>>>>> As a solution for this, we discussed what Tharindu has mentioned. >>>>>>>>>> 1. Have a in memory caching mechanism. >>>>>>>>>> 2. Have a DB based caching mechanism. >>>>>>>>>> >>>>>>>>>> The only point that we have to highlight is the fact that in both >>>>>>>>>> the above mechanisms we need to cache the IP address(not the >>>>>>>>>> ip-netmask as >>>>>>>>>> it was in the max-mind db) against the Geo location. >>>>>>>>>> >>>>>>>>>> Ex:- >>>>>>>>>> For 192.168.0.1 - Colombo, Sri Lanka >>>>>>>>>> For 192.168.15.254 - Colombo, Sri Lanka >>>>>>>>>> >>>>>>>>>> So as per the above example I took, if there are requests form >>>>>>>>>> all the possible 4094 address we will be caching each IP with the Geo >>>>>>>>>> location(since introducing range queries in a cache is not a good >>>>>>>>>> practice). >>>>>>>>>> >>>>>>>>> >>>>>>>> Since we are implementing a custom cache, won't we be doing a >>>>>>>> bitwise operation for the lookup with netmask and network IP? So >>>>>>>> basically, >>>>>>>> we would keep the network IP and the netmask in cache and simply do a >>>>>>>> bitwise AND to determine whether it is a match or not, right? Am >>>>>>>> thinking >>>>>>>> such an operation would not incur much of a performance hit and it >>>>>>>> won't be >>>>>>>> as prohibitive as a normal range query in a cache. If that is the >>>>>>>> case, I >>>>>>>> think we can go with the approach suggested by Sanjeewa. >>>>>>>> >>>>>>>> WDYT? >>>>>>>> >>>>>>>> >>>>>>>>>> Please find my comments about both the approaches. >>>>>>>>>> >>>>>>>>>> 1. Having an in-memory cache would speedup things but based on >>>>>>>>>> the IPs in the data set, there could be number of entries for IPs in >>>>>>>>>> the >>>>>>>>>> same range. One problem with this approach is that, if there is a >>>>>>>>>> server >>>>>>>>>> restart, the initial script execution would take a lots of time. >>>>>>>>>> Also based >>>>>>>>>> on certain scenarios(high number of different IPs) the cache would >>>>>>>>>> not have >>>>>>>>>> a significant effect on script execution performance. >>>>>>>>>> >>>>>>>>>> 2. Having a DB based cache would persist the data even on a >>>>>>>>>> restart and the data fetching query would be searching for an >>>>>>>>>> specific >>>>>>>>>> value(not a range query as against the max-mind DB). But the >>>>>>>>>> downside is >>>>>>>>>> that for a cache miss there would be minimum 3 DB queries (one for >>>>>>>>>> the >>>>>>>>>> cache table lookup and one for the max-mind db lookup and one for the >>>>>>>>>> cache persistence). >>>>>>>>>> >>>>>>>>>> That is why we have initiated this thread to finalize the caching >>>>>>>>>> approach we should take. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Janaka >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> sanjeewa. >>>>>>>>>>> >>>>>>>>>> >>>>>>>> Thanks, >>>>>>>> Lasantha >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> We are going to implement Client IP based Geo-location Graph in >>>>>>>>>>>> API Manager Analytics. When we go through the ways of doing in [1] >>>>>>>>>>>> , we >>>>>>>>>>>> selected [2] as the most suitable way to do. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Overview of max-mind's DB.* >>>>>>>>>>>> >>>>>>>>>>>> As the structure of the db (attached in image), They have two >>>>>>>>>>>> tables which incorporate to get the location. >>>>>>>>>>>> >>>>>>>>>>>> Find geoname_id according to network and get Country,City from >>>>>>>>>>>> locations table. >>>>>>>>>>>> >>>>>>>>>>>> *Limitations* >>>>>>>>>>>> >>>>>>>>>>>> As their database dump we couldn't directly process the ip from >>>>>>>>>>>> those tables. We need to check the given ip is in between the >>>>>>>>>>>> network min >>>>>>>>>>>> and max ip. This query get some long time (10 seconds in indexed >>>>>>>>>>>> data). If >>>>>>>>>>>> we directly do this from spark script for each and every ip which >>>>>>>>>>>> in >>>>>>>>>>>> summary table (regardless if ip is same from two row data) will >>>>>>>>>>>> query from >>>>>>>>>>>> the tables. Therefore this will incur the performance impact on >>>>>>>>>>>> this graph. >>>>>>>>>>>> >>>>>>>>>>>> *Solution* >>>>>>>>>>>> >>>>>>>>>>>> 1. Implement LRU cache against ip address vs location. >>>>>>>>>>>> >>>>>>>>>>>> This will need to implement on custom UDF in Spark. If ip >>>>>>>>>>>> querying from spark available in cache it will give the location >>>>>>>>>>>> from it , >>>>>>>>>>>> IF it is not It will retrieve from DB and put into the cache. >>>>>>>>>>>> >>>>>>>>>>>> 2. Persist in a Table >>>>>>>>>>>> >>>>>>>>>>>> ip as the primary key and Country and city as other columns and >>>>>>>>>>>> retrieve data from that table. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please feel free to give us the most suitable way of doing this >>>>>>>>>>>> solution?. >>>>>>>>>>>> >>>>>>>>>>>> [1] - Implementing Geographical based Analytics in API Manager >>>>>>>>>>>> mail thread. >>>>>>>>>>>> >>>>>>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Thanks* >>>>>>>>>>>> >>>>>>>>>>>> *Tharindu Dharmarathna* >>>>>>>>>>>> Associate Software Engineer >>>>>>>>>>>> WSO2 Inc.; http://wso2.com >>>>>>>>>>>> lean.enterprise.middleware >>>>>>>>>>>> >>>>>>>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> *Sanjeewa Malalgoda* >>>>>>>>>>> WSO2 Inc. >>>>>>>>>>> Mobile : +94713068779 >>>>>>>>>>> >>>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog >>>>>>>>>>> :http://sanjeewamalalgoda.blogspot.com/ >>>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Janaka Ranabahu* >>>>>>>>>> Associate Technical Lead, WSO2 Inc. >>>>>>>>>> http://wso2.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >>>>>>>>>> <%2B94%20718370861>* >>>>>>>>>> >>>>>>>>>> Lean . Enterprise . Middleware >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sachith Withana >>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com >>>>>>>>> E-mail: sachith AT wso2.com >>>>>>>>> M: +94715518127 >>>>>>>>> Linked-In: <http://goog_416592669> >>>>>>>>> https://lk.linkedin.com/in/sachithwithana >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Architecture mailing list >>>>>>>>> [email protected] >>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Lasantha Fernando* >>>>>>>> Senior Software Engineer - Data Technologies Team >>>>>>>> WSO2 Inc. http://wso2.com >>>>>>>> >>>>>>>> email: [email protected] >>>>>>>> mobile: (+94) 71 5247551 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Architecture mailing list >>>>>>>> [email protected] >>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> *Tharindu Dharmarathna*Associate Software Engineer >>>>>>> WSO2 Inc.; http://wso2.com >>>>>>> lean.enterprise.middleware >>>>>>> >>>>>>> mobile: *+94779109091 <%2B94779109091>* >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Architecture mailing list >>>>>>> [email protected] >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Lasantha Fernando* >>>>> Senior Software Engineer - Data Technologies Team >>>>> WSO2 Inc. http://wso2.com >>>>> >>>>> email: [email protected] >>>>> mobile: (+94) 71 5247551 >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> *Tharindu Dharmarathna*Associate Software Engineer >>>> WSO2 Inc.; http://wso2.com >>>> lean.enterprise.middleware >>>> >>>> mobile: *+94779109091 <%2B94779109091>* >>>> >>> >>> >>> >>> -- >>> >>> *Tharindu Dharmarathna*Associate Software Engineer >>> WSO2 Inc.; http://wso2.com >>> lean.enterprise.middleware >>> >>> mobile: *+94779109091 <%2B94779109091>* >>> >> >> >> >> -- >> *Janaka Ranabahu* >> Associate Technical Lead, WSO2 Inc. >> http://wso2.com >> >> >> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861 >> <%2B94%20718370861>* >> >> Lean . Enterprise . Middleware >> > > > > -- > *Kishanthan Thangarajah* > Associate Technical Lead, > Platform Technologies Team, > WSO2, Inc. > lean.enterprise.middleware > > Mobile - +94773426635 > Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>* > Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>* > -- Lochana Ranaweera Intern Software Engineer WSO2 Inc: http://wso2.com Blog: https://lochanaranaweera.wordpress.com/ Mobile: +94716487055 <http://tel%2B716487055>
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
