Greg Smith wrote:
I got a chance to review this patch over the weekend. Basic API seems
good, met all my requirements, no surprises with how the GUC variable
controlled the feature.
The most fundamental issue I have with the interface is that using
COPY makes it difficult to put any unique index on the resulting
table. I like to have a unique index on my imported log table because
it rejects the dupe records if you accidentally import the same
section of log file twice. COPY tosses the whole thing if there's an
index violation, which is a problem during a regular import because
you will occasionally come across lines with the same timestamp that
are similar in every way except for their statment; putting an index
on the timestamp+statement seems impractical.
Does the format not include the per-process line number? (I know i
briefly looked at this patch previously, but I forget the details.) One
reason I originally included line numbers in log_line-prefix was to
handle this sort of problem.
I've had a preference for INSERT from the beginning here that this
reinforces.
COPY is our standard bulk insert mechanism. I think arguing against it
would be a very hard sell.
I'm planning to just work around this issue by doing the COPY into a
temporary table and then INSERTing from there. I didn't want to just
let the concern pass by without mentioning it though. It crosses my
mind that inserting some sort of unique log file line ID number would
prevent the dupe issue and make for better ordering (it's possible to
have two lines with the same timestamp show up in the wrong order
now), not sure that's a practical idea to consider.
I guess that answers my question. We should definitely provide a unique
line key.
The basic coding of the patch seemed OK to me, but someone who is much
more familiar than myself with the mechanics of pipes should take a
look at that part of the patch before committing; it's complicated
code and I can't comment on it. There are some small formatting
issues that need to be fixed, particularly in the host+port mapping.
I can fix those myself and submit a slightly updated patch. There's
some documentation improvements I want to make before this goes in as
well.
The patch is actually broken fairly hard right now because of the
switch from INSERT to COPY FROM CSV as the output format at the last
minute. It outputs missing fields as NULL (fine for INSERT) that
chokes the CSV import when the session_start timestamp is missing.
All of those NULL values need to be just replaced with nothing for
proper CSV syntax; there should just the comma for the next field. I
worked around this with
copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';
I missed that before - yes it should be fixed.
cheers
andrew
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq