Pe sâmbătă, 13 iulie 2019, Maarten Dammers <[email protected]> a scris:
> Hi Huij, > > In my day job I'm a network engineer. Nothing smaller than a /24 gets > routed on the internet. I would just do a quick and dirty approach: Ignore > the last octet. So cache based on /24. If you want to go more complicated > you can loose the length. An ipv4 address is 32 bit. A /24 says: Network is > 24 bits and the host part is 8 bits. So for a /23 it's 23 bits of network > and 9 bits of host. It's always on the bit boundary so a /24 is alway from > 0 (network) to 255 (broadcast). Just Google a bit to find posts like > https://learningnetwork.cisco.com/blogs/vip-perspectives/ > 2014/05/15/network-binary-math-explained . So comparison is very easy and > very efficient. > IPv4 is easy, you can just go with a bit map and be done with it on a decent pc. The problems are IPv6 and usage fragmentation, as described below. > How are you going to deal with providers that announce large chunks of ip > space (like a /13) that are used for all sorts of things? I assume you want > to use INET objects and not ROUTE objects? Be aware that mass harvesting of > databases like RIPE isn't allowed. Also the quality of these objects differ > greatly depending on the LIR/country/RIR. > I suspect the other apis used in the script are going to split these networks a lot, thus my concern with running a trie at Wikipedia scale. Maybe there's a way to split the ipv6 space just enough to be feasible to use a bitmap as well? > Maarten > On 12-07-19 04:43, Huji Lee wrote: > > Hi all, > > I am working on a bot that fetches a list of anonymous editors on fawiki, > uses WHOIS to retrieve more info about that IP, and uses a number of online > APIs to check if the IP is a proxy or not.[1] > > I would like to improve the code by implementing a CIDR cache, so that if > I run whois on 8.4.4.8 and determine that its ASN range is 8.4.4.0/24 and > then I encounter 8.4.4.9 in the next iteration of my for loop, I would > quickly determine this IP also belongs to the same range and skip the WHOIS > part for it. > > The search space would consist of IP ranges like "8.4.4.0 - 8.4.4.25" > (these are the beginning and end IP addresses of the 8.4.4.0/24 range). > Obviously, we can convert these IPs to Hex to make comparisons easier. > Given an IP like 8.4.4.9, we need the object to efficiently determine if it > already has an IP range that encompasses this given IP and if so, return > the previously cached details for that IP pair. If not, we will store that > in cache. > > The part that I am not fully clear about is the following: how can I avoid > having to loop through every range in the cache? Is there a way to create a > hash function that checks two inequality comparisons efficiently? > > Thanks! > > Huji > > [1] https://github.com/PersianWikipedia/fawikibot/ > blob/master/HujiBot/findproxy.py > > _______________________________________________ > pywikibot mailing > [email protected]https://lists.wikimedia.org/mailman/listinfo/pywikibot > >
_______________________________________________ pywikibot mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot
