How many lines do you have in your daily logfiles

As you can see, this looks for clients who have visited the same site
within 20 min.  If there is no match, a unique sessionid is assigned
from a sequence.  If there is a visit, the session id assigned to them
is used.  I'm only able to process about 25 records per second with my
setup.  My window to do this job is 3-4 hours and the shorter the
better.

I'd say your function is flawed because if a client stays more than 20 minutes he'll get two sessions.
I'd propose the following :


        * solution with postgres (variant #1):
        - insert everything into big table,
        - SELECT make_session(...) FROM big table GROUP BY account_id

(you may or may not wish to use the ip address, using it will duplicate sessions for people using anonimyzing crowds-style proxies, not using it will merge sessions from the same user from two different ip's). I'd not use it.
use an index-powered GroupAggregate maybe.


Now it's well ordered, ie. all accesses from the same account are grouped, you just have to find 'gaps' of more than 20 minutes in the atimes to merge or make sessions. This is made by the aggregate 'make_session' which has an internal state consisting of a list of sessions of the form :
- session :
- session start time
- session end time


all the aggregate does is look if the atime of the incoming row is < (session end time + 20 min)
if <, update session to mark session end time to atime
if >, create a new session with session start time = session end time = atime
and append it to the session list


So you get a table of session arrays, you just have to assign them id's and trackback to the URLs to mark them.

If an aggregate can issue INSERT or UPDATE queries, it can even generate session ids on the fly in a table, which simplifies its internal state.

        * solution with postgres (variant #2):

- insert everything into raw_table,
- CREATE TABLE sorted_table
just like raw_table but with a "id SERIAL PRIMARY KEY" added.
- INSERT INTO sorted_table SELECT * FROM raw_table ORDER by account_id, atime;


the aggregate was basically comparing the atime's of two adjacent lines to detect a gap of more than 20 minutes, so you could also do a join between rows a and b
where b.id = a.id+1
AND (
b.account_id != a.account_id
OR (b.atime > a.atime+20 minutes)
OR b does not exist )


this will give you the rows which mark a session start, then you have to join again to update all the rows in that session (BETWEEN id's) with the session id.

        * solution without postgres

Take advantage of the fact that the sessions are created and then die to only use RAM for the active sessions.
Read the logfile sequentially, you'll need to parse the date, if you can't do it use another language, change your apache date format output, or write a parser.


Basically you're doing event-driven programming like in a logic simulator, where the events are session expirations.

As you read the file,
- keep a hash of sessions indexed on account_id,
- and also a sorted (btree) list of sessions indexed on a the session expiry time.
It's very important that this list has fast insertion even in the middle, which is why a tree structure would be better. Try a red-black tree.


For each record do:
- look in the hashtable for account_id, find expiry date for this session,
if session still alive you're in that session,
update session expiry date and btree index accordingly
append url and infos to a list in the session if you want to keep them
else
expire session and start a new one, insert into hash and btree
store the expired session on disk and remove it from memory, you dont need it anymore !


And, as you see the atime advancing, scan the btree for sessions to expire.
It's ordered by expiry date, so that's fast.
For all expired sessions found,
expire session
store the expired session on disk and remove it from memory, you dont need it anymore !



That'd be my preferred solution. You'll need a good implementation of a sorted tree, you can find that in opensource.


        * solution with postgres (variant #3)

just like variant #2 but instead of an aggregate use a plpgsql procedure which reads the logs ordered by account_id, atime, while keeping a copy of the last row, and detecting session expirations on the fly.











---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to