Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Kishanthan Thangarajah Sat, 19 Mar 2016 02:41:33 -0700

Lochana, let's try to reuse this udf for AS analytics too.

On Thu, Mar 17, 2016 at 2:43 PM, Janaka Ranabahu <[email protected]> wrote:


> Hi Tharindu,
>
> Great work. Can we do a performance test of this and share the results.
> Basically what we need to check is to see how much time a script would take
> to execute.
>
> Thanks,
> Janaka
>
> On Thu, Mar 17, 2016 at 2:37 PM, Tharindu Dharmarathna <[email protected]
> > wrote:
>
>> Hi All,
>>
>> After Going though above discussion , We had implemented the Plug-gable
>> User Define Extension point. From this configuration We can write our own
>> implementation which can used to get the Country and State of the Given IP.
>>
>> *Caching Implementation*
>>
>> We define two level of caching as below.
>>
>> When IP address checked from the *UDF* , First It check on Cache to get
>> the Location Information. If it is not in cache  It I'll check on another
>> database which contain IP to Location Direct Mapping as *Sajith*
>> Mentioned. If it is there it will return and cache that location. If
>> location not in that database , IP will check against the *MAXMIND*
>> database. and store the location on cache and the above table.
>>
>> Thanks
>> Tharindu
>>
>>
>> On Tue, Mar 8, 2016 at 2:34 PM, Tharindu Dharmarathna <[email protected]
>> > wrote:
>>
>>> Hi All,
>>>
>>> We have come across following ways to do the above task after the
>>> Initial POC.
>>>
>>> 1. Using File type database which given by max-mind (.mmdb) and use
>>> there database readers.
>>>
>>> From this approach we got lesser value to get the location from the
>>> above using JAX-RS service which is used to wrap the above database. This
>>> JAX-RS implementation is by default used the max-mind's Cache
>>> implementation which can find from [1] .
>>>
>>> *Limitations*
>>>
>>>
>>>    - Hosting of the Jax-RS app in another server.
>>>    - # of http calls will high.
>>>
>>>
>>> 2. Call query server as above thread and cached the location with ip.
>>>
>>> Here you can find the execution time for a single query which get for
>>> each method.
>>>
>>>
>>> *Method 1 : 4.5 seconds*
>>>
>>> *Method 2: 4.76 seconds*
>>>
>>>
>>> Thanks
>>> Tharindu
>>>
>>>
>>> On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]>
>>> wrote:
>>>
>>>> Hi Tharindu,
>>>>
>>>> On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote:
>>>>
>>>>>
>>>>> 2. Having a DB based cache would persist the data even on a restart
>>>>>> and the data fetching query would be searching for an specific value(not 
>>>>>> a
>>>>>> range query as against the max-mind DB). But the downside is that for a
>>>>>> cache miss there would be minimum 3 DB queries (one for the cache table
>>>>>> lookup and one for the max-mind db lookup and one for the
>>>>>> cache persistence).
>>>>>>
>>>>>
>>>>> In order to avoid expensive cache misses we may eagerly populate the
>>>>> DB table cache. i.e. When there's a cache miss we do the lookup in 
>>>>> max-mind
>>>>> db and then add multiple entries for multiple IPs of that netwokrk_cid to
>>>>> the Cache DB table instead of only for that particular IP. That way we
>>>>> reduce the chance of cache miss being very expensive, as we increase the
>>>>> chance of it being found on the first DB lookup.
>>>>>
>>>>> We might need to do some evaluation to determine how much entries that
>>>>> we are going to add to the DB cache for IP belongs to a  particular
>>>>> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is
>>>>> frequent we may want to add more entries with compared to a less frequent
>>>>> netwokrk_cidr.
>>>>>
>>>>> The downside is the DB cache tend to be more big.
>>>>>
>>>>> Thanks
>>>>> *,Sajith Ravindra*
>>>>> Senior Software Engineer
>>>>> WSO2 Inc.; http://wso2.com
>>>>> lean.enterprise.middleware
>>>>>
>>>>> mobile: +94 77 2273550
>>>>> blog: http://sajithr.blogspot.com/
>>>>> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab>
>>>>>
>>>>> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Lasantha,
>>>>>>
>>>>>> Upto now we are doing the following way in order to get the geo
>>>>>> location from the stated dump.
>>>>>>
>>>>>> 1.  two columns added filled with long value of lower and upper value
>>>>>> of network ip addresses. Then get the geoname_id with respect to the long
>>>>>> value for the given ip which between this above long values. Hope you 
>>>>>> will
>>>>>> got this idea on our approach. Is there any way to do bit wise operation 
>>>>>> in
>>>>>> order to get the network_cidr value ? .
>>>>>>
>>>>>
>>>> Can't we do it by keeping the network IP and the subnet as two columns
>>>> and the geoname_id as the third. Say for example, if 192.168.0.0/20 is
>>>> the cidr, for IPv4 routing what is usually done is we get the IP as int,
>>>> then do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that
>>>> would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e.
>>>> 11111111 11111111 11110000 00000) and check whether that returns the
>>>> network IP.
>>>>
>>>> You might find more info here [1]. I think there should be libraries
>>>> that wrap this operation. But if performance is a concern and we need to
>>>> keep the cache search implementation very lean, we can implement it
>>>> ourselves.
>>>>
>>>> WDYT?
>>>>
>>>> [1]
>>>> http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask
>>>>
>>>> Thanks,
>>>> Lasantha
>>>>
>>>>
>>>>>> Thanks
>>>>>> Tharindu
>>>>>>
>>>>>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I think what Sachith suggests also makes sense. But am also rooting
>>>>>>> for the in-memory cache implementation suggested by Sanjeewa with
>>>>>>> ip-netmask approach.
>>>>>>>
>>>>>>> Please find my comments inline.
>>>>>>>
>>>>>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> From what I understand/was told, this happens once a day ( or
>>>>>>>> relatively infrequently), and you wanna avoid searching through all 
>>>>>>>> the geo
>>>>>>>> data per ip ( since you are grouping the requests by IP).
>>>>>>>>
>>>>>>>> IF that's the case, it would be better to use a separate DB table
>>>>>>>> to cache these data ( IP, geoID ..etc) with the IP being the primary 
>>>>>>>> key (
>>>>>>>> which would improve the lookup time), and even though there will be 
>>>>>>>> cache
>>>>>>>> misses, it would eventually reduce the (#cacheMisses/ Hits).
>>>>>>>>
>>>>>>>> Having a DB cache would be better since you do want to persist
>>>>>>>> these data to be used over time.
>>>>>>>>
>>>>>>>> BTW in a cache miss, if we can figure out a way to limit the search
>>>>>>>> range on the original table or at least stop the search once a match is
>>>>>>>> found, it would greatly improve the cache miss time as well.
>>>>>>>>
>>>>>>>> That's my two cents.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Sachith
>>>>>>>>
>>>>>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Sanjeewa,
>>>>>>>>>
>>>>>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Implementing cache is better than having another table mapping
>>>>>>>>>> IMO. What if we query database and keep IP range and network name in 
>>>>>>>>>> memory.
>>>>>>>>>> Then we may do quick search on network name and then based on
>>>>>>>>>> that rest can load some other way.
>>>>>>>>>> WDYT?
>>>>>>>>>>
>>>>>>>>> We thought of having an in memory cache but we faced several
>>>>>>>>> issues along the way. Let me explain the situation as it is per now.
>>>>>>>>>
>>>>>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask.
>>>>>>>>> Ex: 192.168.0.0/20
>>>>>>>>>
>>>>>>>>> The calculation of the IP address range would be like the
>>>>>>>>> following.
>>>>>>>>>
>>>>>>>>> Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>>>>>> Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
>>>>>>>>> Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
>>>>>>>>> =>Network:   192.168.0.0/20        11000000.10101000.0000 
>>>>>>>>> 0000.00000000 (Class C)
>>>>>>>>> Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
>>>>>>>>> HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>>>>>> HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
>>>>>>>>> Hosts/Net: 4094                  (Private Internet 
>>>>>>>>> <http://www.ietf.org/rfc/rfc1918.txt>)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Therefore what we are currently doing is to calculate the start
>>>>>>>>> and end IP for all the values in the max-mind DB and alter the tables 
>>>>>>>>> with
>>>>>>>>> those values initially(this is a one time thing that will happen). 
>>>>>>>>> When the
>>>>>>>>> Spark script executes, we check whether the given IP is between any 
>>>>>>>>> of the
>>>>>>>>> start and end ranges in the tables. That is the reason why it is 
>>>>>>>>> taking a
>>>>>>>>> long time to fetch results for a given IP.
>>>>>>>>>
>>>>>>>>> As a solution for this, we discussed what Tharindu has mentioned.
>>>>>>>>> 1. Have a in memory caching mechanism.
>>>>>>>>> 2. Have a DB based caching mechanism.
>>>>>>>>>
>>>>>>>>> The only point that we have to highlight is the fact that in both
>>>>>>>>> the above mechanisms we need to cache the IP address(not the 
>>>>>>>>> ip-netmask as
>>>>>>>>> it was in the max-mind db) against the Geo location.
>>>>>>>>>
>>>>>>>>> Ex:-
>>>>>>>>> For 192.168.0.1       - Colombo, Sri Lanka
>>>>>>>>> For 192.168.15.254 - Colombo, Sri Lanka
>>>>>>>>>
>>>>>>>>> So as per the above example I took, if there are requests form all
>>>>>>>>> the possible 4094 address we will be caching each IP with the Geo
>>>>>>>>> location(since introducing range queries in a cache is not a good 
>>>>>>>>> practice).
>>>>>>>>>
>>>>>>>>
>>>>>>> Since we are implementing a custom cache, won't we be doing a
>>>>>>> bitwise operation for the lookup with netmask and network IP? So 
>>>>>>> basically,
>>>>>>> we would keep the network IP and the netmask in cache and simply do a
>>>>>>> bitwise AND to determine whether it is a match or not, right? Am 
>>>>>>> thinking
>>>>>>> such an operation would not incur much of a performance hit and it 
>>>>>>> won't be
>>>>>>> as prohibitive as a normal range query in a cache. If that is the case, 
>>>>>>> I
>>>>>>> think we can go with the approach suggested by Sanjeewa.
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>>
>>>>>>>>> Please find my comments about both the approaches.
>>>>>>>>>
>>>>>>>>> 1. Having an in-memory cache would speedup things but based on the
>>>>>>>>> IPs in the data set, there could be number of entries for IPs in the 
>>>>>>>>> same
>>>>>>>>> range. One problem with this approach is that, if there is a server
>>>>>>>>> restart, the initial script execution would take a lots of time. Also 
>>>>>>>>> based
>>>>>>>>> on certain scenarios(high number of different IPs) the cache would 
>>>>>>>>> not have
>>>>>>>>> a significant effect on script execution performance.
>>>>>>>>>
>>>>>>>>> 2. Having a DB based cache would persist the data even on a
>>>>>>>>> restart and the data fetching query would be searching for an specific
>>>>>>>>> value(not a range query as against the max-mind DB). But the downside 
>>>>>>>>> is
>>>>>>>>> that for a cache miss there would be minimum 3 DB queries (one for the
>>>>>>>>> cache table lookup and one for the max-mind db lookup and one for the
>>>>>>>>> cache persistence).
>>>>>>>>>
>>>>>>>>> That is why we have initiated this thread to finalize the caching
>>>>>>>>> approach we should take.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Janaka
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> sanjeewa.
>>>>>>>>>>
>>>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lasantha
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> We are going to implement Client IP based Geo-location Graph in
>>>>>>>>>>> API Manager Analytics. When we go through the ways of doing in [1] 
>>>>>>>>>>> , we
>>>>>>>>>>> selected [2] as the most suitable way to do.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Overview of max-mind's DB.*
>>>>>>>>>>>
>>>>>>>>>>> As the structure of the db (attached in image), They have two
>>>>>>>>>>> tables which incorporate to get the location.
>>>>>>>>>>>
>>>>>>>>>>> Find geoname_id according to network and get Country,City from
>>>>>>>>>>> locations table.
>>>>>>>>>>>
>>>>>>>>>>> *Limitations*
>>>>>>>>>>>
>>>>>>>>>>> As their database dump we couldn't directly process the ip from
>>>>>>>>>>> those tables. We need to check the given ip is in between the 
>>>>>>>>>>> network min
>>>>>>>>>>> and max ip. This query get some long time (10 seconds in indexed 
>>>>>>>>>>> data). If
>>>>>>>>>>> we directly do this from spark script for each and every ip which in
>>>>>>>>>>> summary table (regardless if ip is same from two row data) will 
>>>>>>>>>>> query from
>>>>>>>>>>> the tables. Therefore this will incur the performance impact on 
>>>>>>>>>>> this graph.
>>>>>>>>>>>
>>>>>>>>>>> *Solution*
>>>>>>>>>>>
>>>>>>>>>>> 1. Implement LRU cache against ip address vs location.
>>>>>>>>>>>
>>>>>>>>>>> This will need to implement on custom UDF in Spark. If ip
>>>>>>>>>>> querying from spark available in cache it will give the location 
>>>>>>>>>>> from it ,
>>>>>>>>>>> IF it is not It will retrieve from DB and put into the cache.
>>>>>>>>>>>
>>>>>>>>>>> 2. Persist in a Table
>>>>>>>>>>>
>>>>>>>>>>> ip as the primary key and Country and city as other columns and
>>>>>>>>>>> retrieve data from that table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please feel free to give us the most suitable way of doing this
>>>>>>>>>>> solution?.
>>>>>>>>>>>
>>>>>>>>>>> [1] - Implementing Geographical based Analytics in API Manager
>>>>>>>>>>> mail thread.
>>>>>>>>>>>
>>>>>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Thanks*
>>>>>>>>>>>
>>>>>>>>>>> *Tharindu Dharmarathna*
>>>>>>>>>>> Associate Software Engineer
>>>>>>>>>>> WSO2 Inc.; http://wso2.com
>>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>>
>>>>>>>>>>> mobile: *+94779109091 <%2B94779109091>*
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> *Sanjeewa Malalgoda*
>>>>>>>>>> WSO2 Inc.
>>>>>>>>>> Mobile : +94713068779
>>>>>>>>>>
>>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog
>>>>>>>>>> :http://sanjeewamalalgoda.blogspot.com/
>>>>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Janaka Ranabahu*
>>>>>>>>> Associate Technical Lead, WSO2 Inc.
>>>>>>>>> http://wso2.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
>>>>>>>>> <%2B94%20718370861>*
>>>>>>>>>
>>>>>>>>> Lean . Enterprise . Middleware
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sachith Withana
>>>>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>>>>> E-mail: sachith AT wso2.com
>>>>>>>> M: +94715518127
>>>>>>>> Linked-In: <http://goog_416592669>
>>>>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Lasantha Fernando*
>>>>>>> Senior Software Engineer - Data Technologies Team
>>>>>>> WSO2 Inc. http://wso2.com
>>>>>>>
>>>>>>> email: [email protected]
>>>>>>> mobile: (+94) 71 5247551
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *Tharindu Dharmarathna*Associate Software Engineer
>>>>>> WSO2 Inc.; http://wso2.com
>>>>>> lean.enterprise.middleware
>>>>>>
>>>>>> mobile: *+94779109091 <%2B94779109091>*
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Lasantha Fernando*
>>>> Senior Software Engineer - Data Technologies Team
>>>> WSO2 Inc. http://wso2.com
>>>>
>>>> email: [email protected]
>>>> mobile: (+94) 71 5247551
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Tharindu Dharmarathna*Associate Software Engineer
>>> WSO2 Inc.; http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> mobile: *+94779109091 <%2B94779109091>*
>>>
>>
>>
>>
>> --
>>
>> *Tharindu Dharmarathna*Associate Software Engineer
>> WSO2 Inc.; http://wso2.com
>> lean.enterprise.middleware
>>
>> mobile: *+94779109091 <%2B94779109091>*
>>
>
>
>
> --
> *Janaka Ranabahu*
> Associate Technical Lead, WSO2 Inc.
> http://wso2.com
>
>
> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
> <%2B94%20718370861>*
>
> Lean . Enterprise . Middleware
>



-- 
*Kishanthan Thangarajah*
Associate Technical Lead,
Platform Technologies Team,
WSO2, Inc.
lean.enterprise.middleware

Mobile - +94773426635
Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>*
Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to