Re: [analog-help] Saving time on DNS lookups

Sean Straw / PSE Wed, 28 Mar 2001 10:18:52 -0800
At 22:00 2001-03-27 +0100, Aengus wrote:
>From: <[EMAIL PROTECTED]>
>
> > Thanks for the suggestion of using jdresolve but I don't see how
> > that helps from the description on the web page.  It looks like it is a
>Unix
> > only tool and I am running on Windows 2K only.
>
>It's written in perl, so it probably runs on Win2K too. Unfortunately, I
>don't have tar or gzip

PKZIP (as in PKWare, the folks who created the ZIP format) for Windows will 
manipulate tar files (with the *nix LFNs too, unlike the DOS port of the 
Gnu tar/gzip utilities, which ARE available as well).

>make it available in .zip format.

.. or use a good unzipper which supports other formats.

>Analog is a Log Analysis tool. Stephen has devoted a lot of time to making

I concur with this POV - Analog should NOT have a bunch of extra code added 
to do rDNS.  If anything, having Analog automatically invoke some program 
with some options, specifying the logfile which Analog will be running 
against, but before Analog opens it, might be a nice option (to make the 
preprocessing basically transparent), but Analog itself should still focus 
on the actual log analysis -- not resolving data that perhaps should have 
been resolved by the webserver or by some other program intended to handle 
that sort of thing.

BTW, if you're running a webserver, also running an installation of BIND 
just to manage cached lookups (if you're not already running BIND for 
primary DNS) can significantly improve DNS lookup response, because once a 
specific host has been resolved, you're not pounding on the network looking 
the same host up just a moment later.

The suggested changes to Analog, those of adding a syntax for managing 
blocks of IPs associated with certain domains, is problematic -- for one, 
you have a not insignificant amount of administration just to manage the 
list of mappings, since this isn't automatic like rDNS.

OTOH, perhaps revising the format of the DNS cache file would be a good 
idea.  As I don't use jdresolve, and haven't poked into the cache file 
structure, I'm going off of the statements about how the cache file is read 
sequentially for each lookup, which implies a flat, unsorted 
database.  This is terribly inefficient, and we don't need a real database 
around to fix it - just a change in the approach.

For the time being, lets assume that having jdresolv and analog use an in 
memory AVL tree (or any of a number of other tree formats) would pose 
memory issues (though it'd make lookups REALLY fast).  Instead, why not 
bust the cache file into two separate files, one with IP data and the other 
with the hostnames (and whatever other data is to be stored).

ulong (4 bytes) IP ADDR (yes, perhaps it should be IPV6, whatever)
ulong (4 bytes) offset into hostname file

This makes the IP cache index a fixed number of bytes for each record 
(benefit one - any record can be indexed using the record number multiplied 
by the record size), AND because the IPs are stored as ulongs (actual 
binary representation of the numeric value, versus an ASCII one), the 
eventual comparison operations are really fast too, since we're not doing a 
string compare (benefit two - comparing two ulongs is very fast, whereas 
comparing two strings is not).  It also makes it VERY lightweight to sort, 
and because the hostname file is SEPARATE, the hostname file does not need 
to be sorted or rewritten (in fact, never even needs to be in memory) - it 
is merely appended to (which means the in-memory requirement for the index 
construction drops sharply, since the hostnames don't need to be in memory, 
just the IP table - and this applies only to jdresolve, not Analog, which 
doens't need to load it all into memory).  jdresolve loads this table into 
memory (quite possibly as the aforementioned tree), then resolves IPs in 
the logfile, appending the hostname to the hostname file and taking the 
offset and adding that with the IP to the in-memory table.  When complete, 
it outputs a SORTED (low to high) table to the IP index file.

Let be describe the operation of a binary search (not an in-memory tree, 
but a search of a sorted file).  Apologies to those on the list who already 
understand bhis, but I think it'd help some of the others:

Analog opens this sorted IP index file (statically, not for each individual 
query).  It'd get the record count (file_length/record_size) and initialize 
initial pointers.  It would also retrieve the high and low records (for 
first-level evaluations).  These it would preserve, so it wouldn't have to 
reevaluate them on each query.

You have three pointers: Low, High, and Midpoint.  Initially, Low = first 
record in the file (call it 0), and High = Last record in file (in our 
case, file_length/record_size).  Midpoint is computed from those: 
(High-Low)/2 + Low.

Okay now, so let's say we have a cache file with oh, 1 million entries 
(which, given the above record size info, is 8MB -- and this DOESN'T 
include the hostnames themseleves, which might be a file approx 20-30MB in 
size).  Gosh, a million entries - this no doubt is going to be a lot more 
than most people will need to deal with, but even so, the IP index size IS 
manageable.

For a decent and cheap optimization, Analog could produce a table that goes 
a few query levels into the file - midpoint (1 record, 1st level query), 
midpoints on either side of that (2 records, 2nd level), the midpoints on 
either side of those (4 more records, 3rd level), and for good measure, the 
midpoints around those (16 more records, 4th level).  Now, you have 23 
slices into your cake (=24 parts).  That's a trivially small table to have 
in memory (+ the high and low records) - a the record size stated above, 
that's a whopping 200 bytes of memory.  Now, this IN MEMORY table can be 
used to pre-determine at what point in the index file you'll need to focus 
your search on (as described, this will trim 4 queries against the index 
file off of most lookups - which doesn't SEEM like much, but go multiply 
that against the number of lookups you have to do, and you'll realize it's 
really a big number).  The pre-index will save you 'x' number of file seeks 
and record reads from the file - the first few which will be performed on 
virtually every record you look up (which arguably, if disk cacheing is 
working well, would be handled there).  The larger this pre-index is, the 
more savings you'll get when manipulating large logfiles needing lots of 
queries.

Anyway, back to our million record index.  Ignoring the above optimization 
suggestion, Analog takes the IP it needs to lookup and converts it from 
ASCII representation into a ulong (once), then it would compare the ulong 
IP addr to the low record, which if higher than our IP, it's not in the 
database, emit as unresolved.  Compare it against the high record, if this 
is lower than our IP, same thing.  Go to the midpoint record, compare.  If 
equal, emit as resolved.  If the returned record is higher than what we're 
looking for, then set the high pointer equal to this record and iterate for 
a new midpoint.  Similarly, if it looked up record is lower, set the low 
index, and do the same iteration.  When the new midpoint record index 
evaluates equal to either HIGH or LOW records, the search is completed 
(note some care should be placed in how the midpoint is computed, so you 
don't miss records on odd divisions).

When a match is found, we take the (presumably also statically open) hosts 
file and seek to the offset field from the IP index match, and retrieve 
that line from the file.

Okay, why did I mention a million hostname records?  Well, the above 
sequence of events, in a WORST case situation (where the match - or 
non-match is found at the last possible comparision - rather than right up 
front), is a whopping 20 queries.  And with the mentioned optimization (a 
preliminary division of the file), that goes down to 16 (for a cost of a 
200 byte table in memory!).  The average record will be found in 10 queries 
(or 6, with the optimization).  A total savings of 6 queries could be 
stripped for a 1K in-memory table (just 130 data points).

Multiply THAT by the number of hostname lookups you do, and the savings 
will be quite apparent - not just the boost from the optimization, but 
especially when compared to the sequential lookup method.


I think the above changes (and any number of variations) might be best 
served by a library which Analog (written in perl after all) would use -- 
say Analog initialized the resolution library by calling a function like 
hostresolve_init() to allow the resolution library to do initializations 
(whether this is loading the cache file in its entirety, or doing something 
like I outline above, or just opening the files), then when a resolution is 
needed, calling hosresolve( ipaddr ), and on program completion, doing the 
right thing can calling hostresolve_shutdown().

Analog could ship with a library module which provides functionality 
identical to what it does right now, but if someone wants to employ other 
methods (including a combination of them - such as netblock regexps 
followed by a binary lookup, or whatever), all they have to do is write 
their helper app and write a library module which provides the necessary 
functions which Analog would be using.  All the while keeping their fingers 
out of the core Analog code - so an update to Analog doesn't (necessarily) 
require changes to their resolution library.

This would allow a *LOT* of latitude with host resolution, make it fairly 
seamless with Analog, yet offload the programming support from Stephen 
(though he could still choose to integrate support for alternate resolution 
systems if he were to choose to do that, but it'd still be dereferenced 
from the core of the logfile analysis code).

My .02, adjusted for inflation.

---
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.

  Sean B. Straw / Professional Software Engineering
  Post Box 2395 / San Rafael, CA  94912-2395

------------------------------------------------------------------------
This is the analog-help mailing list. To unsubscribe from this
mailing list, go to
http://lists.isite.net/listgate/analog-help/unsubscribe.html
List archives are available at
http://www.mail-archive.com/[email protected]/
and
http://lists.isite.net/listgate/analog-help/archives/
------------------------------------------------------------------------
Re: [analog-help] Saving time on DNS lookups

Reply via email to