Luke cc'd me on his reply and you definitely should have a look at
Bizgres Clickstream. Even if the whole stack doesn't match you needs,
though it sounds like it would. The clickstream focused TELL and BizGres
enhancements could make your life a little easier.
Basically the stack components that you might want to look at first are:
BizGres flavor of PostGreSQL - Enhanced for business intelligence and
data warehousing - The www.bizgres.com website can speak to this in more
Clickstream Data Model - Pageview fact table surrounded by various
dimensions and 2 core staging tables for the cleansed weblog data.
ETL Platform - Contains a weblog sessionizer, cleanser and ETL
transformations, which can handle 2-3 million hits without any trouble.
With native support for the COPY command, for even greater performance.
JasperReports - For pixel perfect reporting.
Sorry for sounding like I'm in marketing or sales, however I'm not.
Couple of key features that might interest you, considering your email.
The weblog parsing component allows for relatively complex cleansing,
allowing for less data to be written to the DB and therefore increasing
throughput. In addition, if you run every 5 minutes there would be no
need to truncate the days data and reload, the ETL knows how to connect
the data from before. The copy enhancement to postgresql found in
bizgres, makes a noticeable improvement when loading data.
The schema is basically
Dimension tables Session, Known Party (If cookies are logged), Page, IP
Address, Date, Time, Referrer, Referrer Page.
Fact tables: Pageview, Hit Subset (Not everyone wants all hits).
Staging Tables: Hits (Cleansed hits or just pageviews without surrogate
keys), Session (Session data gathered while parsing the log).
From: Luke Lonergan [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 01, 2005 9:38 AM
To: Ulrich Wisser; email@example.com
Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson
Subject: Re: [PERFORM] Need for speed 3
On 9/1/05 6:25 AM, "Ulrich Wisser" <[EMAIL PROTECTED]>
> My application basically imports Apache log files into a Postgres
> database. Every row in the log file gets imported in one of three (raw
> data) tables. My columns are exactly as in the log file. The import is
> run approx. every five minutes. We import about two million rows a
Bizgres Clickstream does this job using an ETL (extract transform and
load) process to transform the weblogs into an optimized schema for
> After every import the data from the current day is deleted from the
> reporting table and recalculated from the raw data table.
This is something the optimized ETL in Bizgres Clickstream also does
> What do you think of this approach? Are there better ways to do it? Is
> there some literature you recommend reading?
I recommend the Bizgres Clickstream docs, you can get it from Bizgres
CVS, and there will shortly be a live html link on the website.
Bizgres is free - it also improves COPY performance by almost 2x, among
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?