Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Tharindu Dharmarathna Tue, 08 Mar 2016 01:05:24 -0800

Hi All,

We have come across following ways to do the above task after the Initial
POC.


1. Using File type database which given by max-mind (.mmdb) and use there
database readers.

>From this approach we got lesser value to get the location from the above
using JAX-RS service which is used to wrap the above database. This JAX-RS
implementation is by default used the max-mind's Cache implementation which
can find from [1] .

*Limitations*


   - Hosting of the Jax-RS app in another server.
   - # of http calls will high.


2. Call query server as above thread and cached the location with ip.

Here you can find the execution time for a single query which get for each
method.


*Method 1 : 4.5 seconds*

*Method 2: 4.76 seconds*


Thanks
Tharindu


On Tue, Mar 8, 2016 at 8:29 AM, Lasantha Fernando <[email protected]> wrote:

> Hi Tharindu,
>
> On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote:
>
>>
>> 2. Having a DB based cache would persist the data even on a restart and
>>> the data fetching query would be searching for an specific value(not a
>>> range query as against the max-mind DB). But the downside is that for a
>>> cache miss there would be minimum 3 DB queries (one for the cache table
>>> lookup and one for the max-mind db lookup and one for the
>>> cache persistence).
>>>
>>
>> In order to avoid expensive cache misses we may eagerly populate the DB
>> table cache. i.e. When there's a cache miss we do the lookup in max-mind db
>> and then add multiple entries for multiple IPs of that netwokrk_cid to the
>> Cache DB table instead of only for that particular IP. That way we reduce
>> the chance of cache miss being very expensive, as we increase the chance of
>> it being found on the first DB lookup.
>>
>> We might need to do some evaluation to determine how much entries that we
>> are going to add to the DB cache for IP belongs to a  particular
>> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is
>> frequent we may want to add more entries with compared to a less frequent
>> netwokrk_cidr.
>>
>> The downside is the DB cache tend to be more big.
>>
>> Thanks
>> *,Sajith Ravindra*
>> Senior Software Engineer
>> WSO2 Inc.; http://wso2.com
>> lean.enterprise.middleware
>>
>> mobile: +94 77 2273550
>> blog: http://sajithr.blogspot.com/
>> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab>
>>
>> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna <[email protected]
>> > wrote:
>>
>>> Hi Lasantha,
>>>
>>> Upto now we are doing the following way in order to get the geo location
>>> from the stated dump.
>>>
>>> 1.  two columns added filled with long value of lower and upper value of
>>> network ip addresses. Then get the geoname_id with respect to the long
>>> value for the given ip which between this above long values. Hope you will
>>> got this idea on our approach. Is there any way to do bit wise operation in
>>> order to get the network_cidr value ? .
>>>
>>
> Can't we do it by keeping the network IP and the subnet as two columns and
> the geoname_id as the third. Say for example, if 192.168.0.0/20 is the
> cidr, for IPv4 routing what is usually done is we get the IP as int, then
> do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that
> would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e.
> 11111111 11111111 11110000 00000) and check whether that returns the
> network IP.
>
> You might find more info here [1]. I think there should be libraries that
> wrap this operation. But if performance is a concern and we need to keep
> the cache search implementation very lean, we can implement it ourselves.
>
> WDYT?
>
> [1]
> http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask
>
> Thanks,
> Lasantha
>
>
>>> Thanks
>>> Tharindu
>>>
>>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I think what Sachith suggests also makes sense. But am also rooting for
>>>> the in-memory cache implementation suggested by Sanjeewa with ip-netmask
>>>> approach.
>>>>
>>>> Please find my comments inline.
>>>>
>>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> From what I understand/was told, this happens once a day ( or
>>>>> relatively infrequently), and you wanna avoid searching through all the 
>>>>> geo
>>>>> data per ip ( since you are grouping the requests by IP).
>>>>>
>>>>> IF that's the case, it would be better to use a separate DB table to
>>>>> cache these data ( IP, geoID ..etc) with the IP being the primary key (
>>>>> which would improve the lookup time), and even though there will be cache
>>>>> misses, it would eventually reduce the (#cacheMisses/ Hits).
>>>>>
>>>>> Having a DB cache would be better since you do want to persist these
>>>>> data to be used over time.
>>>>>
>>>>> BTW in a cache miss, if we can figure out a way to limit the search
>>>>> range on the original table or at least stop the search once a match is
>>>>> found, it would greatly improve the cache miss time as well.
>>>>>
>>>>> That's my two cents.
>>>>>
>>>>> Cheers,
>>>>> Sachith
>>>>>
>>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Sanjeewa,
>>>>>>
>>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Implementing cache is better than having another table mapping IMO.
>>>>>>> What if we query database and keep IP range and network name in memory.
>>>>>>> Then we may do quick search on network name and then based on that
>>>>>>> rest can load some other way.
>>>>>>> WDYT?
>>>>>>>
>>>>>> We thought of having an in memory cache but we faced several issues
>>>>>> along the way. Let me explain the situation as it is per now.
>>>>>>
>>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask.
>>>>>> Ex: 192.168.0.0/20
>>>>>>
>>>>>> The calculation of the IP address range would be like the following.
>>>>>>
>>>>>> Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>>> Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
>>>>>> Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
>>>>>> =>Network:   192.168.0.0/20        11000000.10101000.0000 0000.00000000 
>>>>>> (Class C)
>>>>>> Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
>>>>>> HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>>> HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
>>>>>> Hosts/Net: 4094                  (Private Internet 
>>>>>> <http://www.ietf.org/rfc/rfc1918.txt>)
>>>>>>
>>>>>>
>>>>>> Therefore what we are currently doing is to calculate the start and
>>>>>> end IP for all the values in the max-mind DB and alter the tables with
>>>>>> those values initially(this is a one time thing that will happen). When 
>>>>>> the
>>>>>> Spark script executes, we check whether the given IP is between any of 
>>>>>> the
>>>>>> start and end ranges in the tables. That is the reason why it is taking a
>>>>>> long time to fetch results for a given IP.
>>>>>>
>>>>>> As a solution for this, we discussed what Tharindu has mentioned.
>>>>>> 1. Have a in memory caching mechanism.
>>>>>> 2. Have a DB based caching mechanism.
>>>>>>
>>>>>> The only point that we have to highlight is the fact that in both the
>>>>>> above mechanisms we need to cache the IP address(not the ip-netmask as it
>>>>>> was in the max-mind db) against the Geo location.
>>>>>>
>>>>>> Ex:-
>>>>>> For 192.168.0.1       - Colombo, Sri Lanka
>>>>>> For 192.168.15.254 - Colombo, Sri Lanka
>>>>>>
>>>>>> So as per the above example I took, if there are requests form all
>>>>>> the possible 4094 address we will be caching each IP with the Geo
>>>>>> location(since introducing range queries in a cache is not a good 
>>>>>> practice).
>>>>>>
>>>>>
>>>> Since we are implementing a custom cache, won't we be doing a bitwise
>>>> operation for the lookup with netmask and network IP? So basically, we
>>>> would keep the network IP and the netmask in cache and simply do a bitwise
>>>> AND to determine whether it is a match or not, right? Am thinking such an
>>>> operation would not incur much of a performance hit and it won't be as
>>>> prohibitive as a normal range query in a cache. If that is the case, I
>>>> think we can go with the approach suggested by Sanjeewa.
>>>>
>>>> WDYT?
>>>>
>>>>
>>>>>> Please find my comments about both the approaches.
>>>>>>
>>>>>> 1. Having an in-memory cache would speedup things but based on the
>>>>>> IPs in the data set, there could be number of entries for IPs in the same
>>>>>> range. One problem with this approach is that, if there is a server
>>>>>> restart, the initial script execution would take a lots of time. Also 
>>>>>> based
>>>>>> on certain scenarios(high number of different IPs) the cache would not 
>>>>>> have
>>>>>> a significant effect on script execution performance.
>>>>>>
>>>>>> 2. Having a DB based cache would persist the data even on a restart
>>>>>> and the data fetching query would be searching for an specific value(not 
>>>>>> a
>>>>>> range query as against the max-mind DB). But the downside is that for a
>>>>>> cache miss there would be minimum 3 DB queries (one for the cache table
>>>>>> lookup and one for the max-mind db lookup and one for the
>>>>>> cache persistence).
>>>>>>
>>>>>> That is why we have initiated this thread to finalize the caching
>>>>>> approach we should take.
>>>>>> 
>>>>>> Thanks,
>>>>>> Janaka
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> sanjeewa.
>>>>>>>
>>>>>>
>>>> Thanks,
>>>> Lasantha
>>>>
>>>>
>>>>>
>>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We are going to implement Client IP based Geo-location Graph in API
>>>>>>>> Manager Analytics. When we go through the ways of doing in [1] , we
>>>>>>>> selected [2] as the most suitable way to do.
>>>>>>>>
>>>>>>>>
>>>>>>>> *Overview of max-mind's DB.*
>>>>>>>>
>>>>>>>> As the structure of the db (attached in image), They have two
>>>>>>>> tables which incorporate to get the location.
>>>>>>>>
>>>>>>>> Find geoname_id according to network and get Country,City from
>>>>>>>> locations table.
>>>>>>>>
>>>>>>>> *Limitations*
>>>>>>>>
>>>>>>>> As their database dump we couldn't directly process the ip from
>>>>>>>> those tables. We need to check the given ip is in between the network 
>>>>>>>> min
>>>>>>>> and max ip. This query get some long time (10 seconds in indexed 
>>>>>>>> data). If
>>>>>>>> we directly do this from spark script for each and every ip which in
>>>>>>>> summary table (regardless if ip is same from two row data) will query 
>>>>>>>> from
>>>>>>>> the tables. Therefore this will incur the performance impact on this 
>>>>>>>> graph.
>>>>>>>>
>>>>>>>> *Solution*
>>>>>>>>
>>>>>>>> 1. Implement LRU cache against ip address vs location.
>>>>>>>>
>>>>>>>> This will need to implement on custom UDF in Spark. If ip querying
>>>>>>>> from spark available in cache it will give the location from it , IF 
>>>>>>>> it is
>>>>>>>> not It will retrieve from DB and put into the cache.
>>>>>>>>
>>>>>>>> 2. Persist in a Table
>>>>>>>>
>>>>>>>> ip as the primary key and Country and city as other columns and
>>>>>>>> retrieve data from that table.
>>>>>>>>
>>>>>>>>
>>>>>>>> Please feel free to give us the most suitable way of doing this
>>>>>>>> solution?.
>>>>>>>>
>>>>>>>> [1] - Implementing Geographical based Analytics in API Manager mail
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>>>>>>>
>>>>>>>>
>>>>>>>> *Thanks*
>>>>>>>>
>>>>>>>> *Tharindu Dharmarathna*
>>>>>>>> Associate Software Engineer
>>>>>>>> WSO2 Inc.; http://wso2.com
>>>>>>>> lean.enterprise.middleware
>>>>>>>>
>>>>>>>> mobile: *+94779109091 <%2B94779109091>*
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> *Sanjeewa Malalgoda*
>>>>>>> WSO2 Inc.
>>>>>>> Mobile : +94713068779
>>>>>>>
>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog
>>>>>>> :http://sanjeewamalalgoda.blogspot.com/
>>>>>>> <http://sanjeewamalalgoda.blogspot.com/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Janaka Ranabahu*
>>>>>> Associate Technical Lead, WSO2 Inc.
>>>>>> http://wso2.com
>>>>>>
>>>>>>
>>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
>>>>>> <%2B94%20718370861>*
>>>>>>
>>>>>> Lean . Enterprise . Middleware
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sachith Withana
>>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>>> E-mail: sachith AT wso2.com
>>>>> M: +94715518127
>>>>> Linked-In: <http://goog_416592669>
>>>>> https://lk.linkedin.com/in/sachithwithana
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Lasantha Fernando*
>>>> Senior Software Engineer - Data Technologies Team
>>>> WSO2 Inc. http://wso2.com
>>>>
>>>> email: [email protected]
>>>> mobile: (+94) 71 5247551
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Tharindu Dharmarathna*Associate Software Engineer
>>> WSO2 Inc.; http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> mobile: *+94779109091 <%2B94779109091>*
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>
>
> --
> *Lasantha Fernando*
> Senior Software Engineer - Data Technologies Team
> WSO2 Inc. http://wso2.com
>
> email: [email protected]
> mobile: (+94) 71 5247551
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 

*Tharindu Dharmarathna*Associate Software Engineer
WSO2 Inc.; http://wso2.com
lean.enterprise.middleware

mobile: *+94779109091*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to