At 22:00 2001-03-27 +0100, Aengus wrote:
>From: <[EMAIL PROTECTED]>
>
> > Thanks for the suggestion of using jdresolve but I don't see how
> > that helps from the description on the web page. It looks like it is a
>Unix
> > only tool and I am running on Windows 2K only.
>
>It's written in perl, so it probably runs on Win2K too. Unfortunately, I
>don't have tar or gzip
PKZIP (as in PKWare, the folks who created the ZIP format) for Windows will
manipulate tar files (with the *nix LFNs too, unlike the DOS port of the
Gnu tar/gzip utilities, which ARE available as well).
>make it available in .zip format.
.. or use a good unzipper which supports other formats.
>Analog is a Log Analysis tool. Stephen has devoted a lot of time to making
I concur with this POV - Analog should NOT have a bunch of extra code added
to do rDNS. If anything, having Analog automatically invoke some program
with some options, specifying the logfile which Analog will be running
against, but before Analog opens it, might be a nice option (to make the
preprocessing basically transparent), but Analog itself should still focus
on the actual log analysis -- not resolving data that perhaps should have
been resolved by the webserver or by some other program intended to handle
that sort of thing.
BTW, if you're running a webserver, also running an installation of BIND
just to manage cached lookups (if you're not already running BIND for
primary DNS) can significantly improve DNS lookup response, because once a
specific host has been resolved, you're not pounding on the network looking
the same host up just a moment later.
The suggested changes to Analog, those of adding a syntax for managing
blocks of IPs associated with certain domains, is problematic -- for one,
you have a not insignificant amount of administration just to manage the
list of mappings, since this isn't automatic like rDNS.
OTOH, perhaps revising the format of the DNS cache file would be a good
idea. As I don't use jdresolve, and haven't poked into the cache file
structure, I'm going off of the statements about how the cache file is read
sequentially for each lookup, which implies a flat, unsorted
database. This is terribly inefficient, and we don't need a real database
around to fix it - just a change in the approach.
For the time being, lets assume that having jdresolv and analog use an in
memory AVL tree (or any of a number of other tree formats) would pose
memory issues (though it'd make lookups REALLY fast). Instead, why not
bust the cache file into two separate files, one with IP data and the other
with the hostnames (and whatever other data is to be stored).
ulong (4 bytes) IP ADDR (yes, perhaps it should be IPV6, whatever)
ulong (4 bytes) offset into hostname file
This makes the IP cache index a fixed number of bytes for each record
(benefit one - any record can be indexed using the record number multiplied
by the record size), AND because the IPs are stored as ulongs (actual
binary representation of the numeric value, versus an ASCII one), the
eventual comparison operations are really fast too, since we're not doing a
string compare (benefit two - comparing two ulongs is very fast, whereas
comparing two strings is not). It also makes it VERY lightweight to sort,
and because the hostname file is SEPARATE, the hostname file does not need
to be sorted or rewritten (in fact, never even needs to be in memory) - it
is merely appended to (which means the in-memory requirement for the index
construction drops sharply, since the hostnames don't need to be in memory,
just the IP table - and this applies only to jdresolve, not Analog, which
doens't need to load it all into memory). jdresolve loads this table into
memory (quite possibly as the aforementioned tree), then resolves IPs in
the logfile, appending the hostname to the hostname file and taking the
offset and adding that with the IP to the in-memory table. When complete,
it outputs a SORTED (low to high) table to the IP index file.
Let be describe the operation of a binary search (not an in-memory tree,
but a search of a sorted file). Apologies to those on the list who already
understand bhis, but I think it'd help some of the others:
Analog opens this sorted IP index file (statically, not for each individual
query). It'd get the record count (file_length/record_size) and initialize
initial pointers. It would also retrieve the high and low records (for
first-level evaluations). These it would preserve, so it wouldn't have to
reevaluate them on each query.
You have three pointers: Low, High, and Midpoint. Initially, Low = first
record in the file (call it 0), and High = Last record in file (in our
case, file_length/record_size). Midpoint is computed from those:
(High-Low)/2 + Low.
Okay now, so let's say we have a cache file with oh, 1 million entries
(which, given the above record size info, is 8MB -- and this DOESN'T
include the hostnames themseleves, which might be a file approx 20-30MB in
size). Gosh, a million entries - this no doubt is going to be a lot more
than most people will need to deal with, but even so, the IP index size IS
manageable.
For a decent and cheap optimization, Analog could produce a table that goes
a few query levels into the file - midpoint (1 record, 1st level query),
midpoints on either side of that (2 records, 2nd level), the midpoints on
either side of those (4 more records, 3rd level), and for good measure, the
midpoints around those (16 more records, 4th level). Now, you have 23
slices into your cake (=24 parts). That's a trivially small table to have
in memory (+ the high and low records) - a the record size stated above,
that's a whopping 200 bytes of memory. Now, this IN MEMORY table can be
used to pre-determine at what point in the index file you'll need to focus
your search on (as described, this will trim 4 queries against the index
file off of most lookups - which doesn't SEEM like much, but go multiply
that against the number of lookups you have to do, and you'll realize it's
really a big number). The pre-index will save you 'x' number of file seeks
and record reads from the file - the first few which will be performed on
virtually every record you look up (which arguably, if disk cacheing is
working well, would be handled there). The larger this pre-index is, the
more savings you'll get when manipulating large logfiles needing lots of
queries.
Anyway, back to our million record index. Ignoring the above optimization
suggestion, Analog takes the IP it needs to lookup and converts it from
ASCII representation into a ulong (once), then it would compare the ulong
IP addr to the low record, which if higher than our IP, it's not in the
database, emit as unresolved. Compare it against the high record, if this
is lower than our IP, same thing. Go to the midpoint record, compare. If
equal, emit as resolved. If the returned record is higher than what we're
looking for, then set the high pointer equal to this record and iterate for
a new midpoint. Similarly, if it looked up record is lower, set the low
index, and do the same iteration. When the new midpoint record index
evaluates equal to either HIGH or LOW records, the search is completed
(note some care should be placed in how the midpoint is computed, so you
don't miss records on odd divisions).
When a match is found, we take the (presumably also statically open) hosts
file and seek to the offset field from the IP index match, and retrieve
that line from the file.
Okay, why did I mention a million hostname records? Well, the above
sequence of events, in a WORST case situation (where the match - or
non-match is found at the last possible comparision - rather than right up
front), is a whopping 20 queries. And with the mentioned optimization (a
preliminary division of the file), that goes down to 16 (for a cost of a
200 byte table in memory!). The average record will be found in 10 queries
(or 6, with the optimization). A total savings of 6 queries could be
stripped for a 1K in-memory table (just 130 data points).
Multiply THAT by the number of hostname lookups you do, and the savings
will be quite apparent - not just the boost from the optimization, but
especially when compared to the sequential lookup method.
I think the above changes (and any number of variations) might be best
served by a library which Analog (written in perl after all) would use --
say Analog initialized the resolution library by calling a function like
hostresolve_init() to allow the resolution library to do initializations
(whether this is loading the cache file in its entirety, or doing something
like I outline above, or just opening the files), then when a resolution is
needed, calling hosresolve( ipaddr ), and on program completion, doing the
right thing can calling hostresolve_shutdown().
Analog could ship with a library module which provides functionality
identical to what it does right now, but if someone wants to employ other
methods (including a combination of them - such as netblock regexps
followed by a binary lookup, or whatever), all they have to do is write
their helper app and write a library module which provides the necessary
functions which Analog would be using. All the while keeping their fingers
out of the core Analog code - so an update to Analog doesn't (necessarily)
require changes to their resolution library.
This would allow a *LOT* of latitude with host resolution, make it fairly
seamless with Analog, yet offload the programming support from Stephen
(though he could still choose to integrate support for alternate resolution
systems if he were to choose to do that, but it'd still be dereferenced
from the core of the logfile analysis code).
My .02, adjusted for inflation.
---
Please DO NOT carbon me on list replies. I'll get my copy from the list.
Sean B. Straw / Professional Software Engineering
Post Box 2395 / San Rafael, CA 94912-2395
------------------------------------------------------------------------
This is the analog-help mailing list. To unsubscribe from this
mailing list, go to
http://lists.isite.net/listgate/analog-help/unsubscribe.html
List archives are available at
http://www.mail-archive.com/[email protected]/
and
http://lists.isite.net/listgate/analog-help/archives/
------------------------------------------------------------------------