Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Lasantha Fernando Mon, 07 Mar 2016 19:01:14 -0800

Hi Tharindu,

On 7 March 2016 at 21:10, Sajith Ravindra <[email protected]> wrote:


>
> 2. Having a DB based cache would persist the data even on a restart and
>> the data fetching query would be searching for an specific value(not a
>> range query as against the max-mind DB). But the downside is that for a
>> cache miss there would be minimum 3 DB queries (one for the cache table
>> lookup and one for the max-mind db lookup and one for the
>> cache persistence).
>>
>
> In order to avoid expensive cache misses we may eagerly populate the DB
> table cache. i.e. When there's a cache miss we do the lookup in max-mind db
> and then add multiple entries for multiple IPs of that netwokrk_cid to the
> Cache DB table instead of only for that particular IP. That way we reduce
> the chance of cache miss being very expensive, as we increase the chance of
> it being found on the first DB lookup.
>
> We might need to do some evaluation to determine how much entries that we
> are going to add to the DB cache for IP belongs to a  particular
> netwokrk_cid. For an example if requests from a certain netwokrk_cidr is
> frequent we may want to add more entries with compared to a less frequent
> netwokrk_cidr.
>
> The downside is the DB cache tend to be more big.
>
> Thanks
> *,Sajith Ravindra*
> Senior Software Engineer
> WSO2 Inc.; http://wso2.com
> lean.enterprise.middleware
>
> mobile: +94 77 2273550
> blog: http://sajithr.blogspot.com/
> <http://lk.linkedin.com/pub/shani-ranasinghe/34/111/ab>
>
> On Mon, Mar 7, 2016 at 4:37 AM, Tharindu Dharmarathna <[email protected]>
> wrote:
>
>> Hi Lasantha,
>>
>> Upto now we are doing the following way in order to get the geo location
>> from the stated dump.
>>
>> 1.  two columns added filled with long value of lower and upper value of
>> network ip addresses. Then get the geoname_id with respect to the long
>> value for the given ip which between this above long values. Hope you will
>> got this idea on our approach. Is there any way to do bit wise operation in
>> order to get the network_cidr value ? .
>>
>
Can't we do it by keeping the network IP and the subnet as two columns and
the geoname_id as the third. Say for example, if 192.168.0.0/20 is the
cidr, for IPv4 routing what is usually done is we get the IP as int, then
do a bitwise AND with the subnet mask (e.g. if subnet mask is 20, that
would mean 20 bits with value 1 and remaining 12 bits of value 0, i.e.
11111111 11111111 11110000 00000) and check whether that returns the
network IP.

You might find more info here [1]. I think there should be libraries that
wrap this operation. But if performance is a concern and we need to keep
the cache search implementation very lean, we can implement it ourselves.

WDYT?

[1]
http://stackoverflow.com/questions/4209760/validate-an-ip-address-with-mask

Thanks,
Lasantha


>> Thanks
>> Tharindu
>>
>> On Mon, Mar 7, 2016 at 12:05 AM, Lasantha Fernando <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I think what Sachith suggests also makes sense. But am also rooting for
>>> the in-memory cache implementation suggested by Sanjeewa with ip-netmask
>>> approach.
>>>
>>> Please find my comments inline.
>>>
>>> On 5 March 2016 at 23:50, Sachith Withana <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> From what I understand/was told, this happens once a day ( or
>>>> relatively infrequently), and you wanna avoid searching through all the geo
>>>> data per ip ( since you are grouping the requests by IP).
>>>>
>>>> IF that's the case, it would be better to use a separate DB table to
>>>> cache these data ( IP, geoID ..etc) with the IP being the primary key (
>>>> which would improve the lookup time), and even though there will be cache
>>>> misses, it would eventually reduce the (#cacheMisses/ Hits).
>>>>
>>>> Having a DB cache would be better since you do want to persist these
>>>> data to be used over time.
>>>>
>>>> BTW in a cache miss, if we can figure out a way to limit the search
>>>> range on the original table or at least stop the search once a match is
>>>> found, it would greatly improve the cache miss time as well.
>>>>
>>>> That's my two cents.
>>>>
>>>> Cheers,
>>>> Sachith
>>>>
>>>> On Sun, Mar 6, 2016 at 8:24 AM, Janaka Ranabahu <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Sanjeewa,
>>>>>
>>>>> On Sun, Mar 6, 2016 at 7:25 AM, Sanjeewa Malalgoda <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Implementing cache is better than having another table mapping IMO.
>>>>>> What if we query database and keep IP range and network name in memory.
>>>>>> Then we may do quick search on network name and then based on that
>>>>>> rest can load some other way.
>>>>>> WDYT?
>>>>>>
>>>>> We thought of having an in memory cache but we faced several issues
>>>>> along the way. Let me explain the situation as it is per now.
>>>>>
>>>>> The Max-Mind DB has the IP addresses with the IP and the netmask.
>>>>> Ex: 192.168.0.0/20
>>>>>
>>>>> The calculation of the IP address range would be like the following.
>>>>>
>>>>> Address:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>> Netmask:   255.255.240.0 = 20    11111111.11111111.1111 0000.00000000
>>>>> Wildcard:  0.0.15.255            00000000.00000000.0000 1111.11111111
>>>>> =>Network:   192.168.0.0/20        11000000.10101000.0000 0000.00000000 
>>>>> (Class C)
>>>>> Broadcast: 192.168.15.255        11000000.10101000.0000 1111.11111111
>>>>> HostMin:   192.168.0.1           11000000.10101000.0000 0000.00000001
>>>>> HostMax:   192.168.15.254        11000000.10101000.0000 1111.11111110
>>>>> Hosts/Net: 4094                  (Private Internet 
>>>>> <http://www.ietf.org/rfc/rfc1918.txt>)
>>>>>
>>>>>
>>>>> Therefore what we are currently doing is to calculate the start and
>>>>> end IP for all the values in the max-mind DB and alter the tables with
>>>>> those values initially(this is a one time thing that will happen). When 
>>>>> the
>>>>> Spark script executes, we check whether the given IP is between any of the
>>>>> start and end ranges in the tables. That is the reason why it is taking a
>>>>> long time to fetch results for a given IP.
>>>>>
>>>>> As a solution for this, we discussed what Tharindu has mentioned.
>>>>> 1. Have a in memory caching mechanism.
>>>>> 2. Have a DB based caching mechanism.
>>>>>
>>>>> The only point that we have to highlight is the fact that in both the
>>>>> above mechanisms we need to cache the IP address(not the ip-netmask as it
>>>>> was in the max-mind db) against the Geo location.
>>>>>
>>>>> Ex:-
>>>>> For 192.168.0.1       - Colombo, Sri Lanka
>>>>> For 192.168.15.254 - Colombo, Sri Lanka
>>>>>
>>>>> So as per the above example I took, if there are requests form all the
>>>>> possible 4094 address we will be caching each IP with the Geo
>>>>> location(since introducing range queries in a cache is not a good 
>>>>> practice).
>>>>>
>>>>
>>> Since we are implementing a custom cache, won't we be doing a bitwise
>>> operation for the lookup with netmask and network IP? So basically, we
>>> would keep the network IP and the netmask in cache and simply do a bitwise
>>> AND to determine whether it is a match or not, right? Am thinking such an
>>> operation would not incur much of a performance hit and it won't be as
>>> prohibitive as a normal range query in a cache. If that is the case, I
>>> think we can go with the approach suggested by Sanjeewa.
>>>
>>> WDYT?
>>>
>>>
>>>>> Please find my comments about both the approaches.
>>>>>
>>>>> 1. Having an in-memory cache would speedup things but based on the IPs
>>>>> in the data set, there could be number of entries for IPs in the same
>>>>> range. One problem with this approach is that, if there is a server
>>>>> restart, the initial script execution would take a lots of time. Also 
>>>>> based
>>>>> on certain scenarios(high number of different IPs) the cache would not 
>>>>> have
>>>>> a significant effect on script execution performance.
>>>>>
>>>>> 2. Having a DB based cache would persist the data even on a restart
>>>>> and the data fetching query would be searching for an specific value(not a
>>>>> range query as against the max-mind DB). But the downside is that for a
>>>>> cache miss there would be minimum 3 DB queries (one for the cache table
>>>>> lookup and one for the max-mind db lookup and one for the
>>>>> cache persistence).
>>>>>
>>>>> That is why we have initiated this thread to finalize the caching
>>>>> approach we should take.
>>>>> 
>>>>> Thanks,
>>>>> Janaka
>>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> sanjeewa.
>>>>>>
>>>>>
>>> Thanks,
>>> Lasantha
>>>
>>>
>>>>
>>>>>> On Fri, Mar 4, 2016 at 3:12 PM, Tharindu Dharmarathna <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We are going to implement Client IP based Geo-location Graph in API
>>>>>>> Manager Analytics. When we go through the ways of doing in [1] , we
>>>>>>> selected [2] as the most suitable way to do.
>>>>>>>
>>>>>>>
>>>>>>> *Overview of max-mind's DB.*
>>>>>>>
>>>>>>> As the structure of the db (attached in image), They have two tables
>>>>>>> which incorporate to get the location.
>>>>>>>
>>>>>>> Find geoname_id according to network and get Country,City from
>>>>>>> locations table.
>>>>>>>
>>>>>>> *Limitations*
>>>>>>>
>>>>>>> As their database dump we couldn't directly process the ip from
>>>>>>> those tables. We need to check the given ip is in between the network 
>>>>>>> min
>>>>>>> and max ip. This query get some long time (10 seconds in indexed data). 
>>>>>>> If
>>>>>>> we directly do this from spark script for each and every ip which in
>>>>>>> summary table (regardless if ip is same from two row data) will query 
>>>>>>> from
>>>>>>> the tables. Therefore this will incur the performance impact on this 
>>>>>>> graph.
>>>>>>>
>>>>>>> *Solution*
>>>>>>>
>>>>>>> 1. Implement LRU cache against ip address vs location.
>>>>>>>
>>>>>>> This will need to implement on custom UDF in Spark. If ip querying
>>>>>>> from spark available in cache it will give the location from it , IF it 
>>>>>>> is
>>>>>>> not It will retrieve from DB and put into the cache.
>>>>>>>
>>>>>>> 2. Persist in a Table
>>>>>>>
>>>>>>> ip as the primary key and Country and city as other columns and
>>>>>>> retrieve data from that table.
>>>>>>>
>>>>>>>
>>>>>>> Please feel free to give us the most suitable way of doing this
>>>>>>> solution?.
>>>>>>>
>>>>>>> [1] - Implementing Geographical based Analytics in API Manager mail
>>>>>>> thread.
>>>>>>>
>>>>>>> [2] - http://dev.maxmind.com/geoip/geoip2/geolite2/
>>>>>>>
>>>>>>>
>>>>>>> *Thanks*
>>>>>>>
>>>>>>> *Tharindu Dharmarathna*
>>>>>>> Associate Software Engineer
>>>>>>> WSO2 Inc.; http://wso2.com
>>>>>>> lean.enterprise.middleware
>>>>>>>
>>>>>>> mobile: *+94779109091 <%2B94779109091>*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *Sanjeewa Malalgoda*
>>>>>> WSO2 Inc.
>>>>>> Mobile : +94713068779
>>>>>>
>>>>>> <http://sanjeewamalalgoda.blogspot.com/>blog
>>>>>> :http://sanjeewamalalgoda.blogspot.com/
>>>>>> <http://sanjeewamalalgoda.blogspot.com/>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Janaka Ranabahu*
>>>>> Associate Technical Lead, WSO2 Inc.
>>>>> http://wso2.com
>>>>>
>>>>>
>>>>> *E-mail: [email protected] <http://wso2.com>**M: **+94 718370861
>>>>> <%2B94%20718370861>*
>>>>>
>>>>> Lean . Enterprise . Middleware
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sachith Withana
>>>> Software Engineer; WSO2 Inc.; http://wso2.com
>>>> E-mail: sachith AT wso2.com
>>>> M: +94715518127
>>>> Linked-In: <http://goog_416592669>
>>>> https://lk.linkedin.com/in/sachithwithana
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> *Lasantha Fernando*
>>> Senior Software Engineer - Data Technologies Team
>>> WSO2 Inc. http://wso2.com
>>>
>>> email: [email protected]
>>> mobile: (+94) 71 5247551
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>>
>> *Tharindu Dharmarathna*Associate Software Engineer
>> WSO2 Inc.; http://wso2.com
>> lean.enterprise.middleware
>>
>> mobile: *+94779109091 <%2B94779109091>*
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>


-- 
*Lasantha Fernando*
Senior Software Engineer - Data Technologies Team
WSO2 Inc. http://wso2.com

email: [email protected]
mobile: (+94) 71 5247551

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [Analytics][APIM] - Implement Geo location graph in Analytics

Reply via email to