Hi!
On 26.9.2012 19:18, Jeff Janes wrote:
> On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <[email protected]> wrote:
>> Alvaro Herrera <[email protected]> writes:
>>> Excerpts from Euler Taveira's message of miƩ sep 26 11:53:27 -0300 2012:
>>>> On 26-09-2012 09:43, Tomas Vondra wrote:
>>>>> 5) splitting the single stat file into multiple pieces - e.g. per
>>>>> database,
>>>>> written separately, so that the autovacuum workers don't need to read all
>>>>> the data even for databases that don't need to be vacuumed. This might be
>>>>> combined with (4).
>>
>>>> IMHO that's the definitive solution. It would be one file per database
>>>> plus a
>>>> global one. That way, the check would only read the global.stat and process
>>>> those database that were modified. Also, an in-memory map could store that
>>>> information to speed up the checks.
>>
>>> +1
>>
>> That would help for the case of hundreds of databases, but how much
>> does it help for lots of tables in a single database?
>
> It doesn't help that case, but that case doesn't need much help. If
> you have N statistics-kept objects in total spread over M databases,
> of which T objects need vacuuming per naptime, the stats file traffic
> is proportional to N*(M+T). If T is low, then there is generally is
> no problem if M is also low. Or at least, the problem is much smaller
> than when M is high for a fixed value of N.
I've done some initial hacking on splitting the stat file into multiple
smaller pieces over the weekend, and it seems promising (at least with
respect to the issues we're having).
See the patch attached, but be aware that this is a very early WIP (or
rather a proof of concept), so it has many rough edges (read "sloppy
coding"). I haven't even added it to the commitfest yet ...
The two main changes are these:
(1) The stats file is split into a common "db" file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called "pgstat.stat", the per-db files have the
database OID appended, so for example "pgstat.stat.12345" etc.
This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:
pgstat_read_db_statsfile
pgstat_write_db_statsfile
that do the trick of reading/writing stat file for one database.
(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).
So what are the benefits?
(a) When a launcher asks for info about databases, something like this
is called in the end:
pgstat_read_db_statsfile(InvalidOid, false, true)
which means all databases (InvalidOid) and only db info (true). So
it reads only the one common file with db entries, not the
table/func stats.
(b) When a worker asks for stats for a given DB, something like this is
called in the end:
pgstat_read_db_statsfile(MyDatabaseId, false, false)
which reads only the common stats file (with db entries) and only
one file for the one database.
The current implementation (with the single pgstat.stat file), all
the data had to be read (and skipped silently) in both cases.
That's a lot of CPU time, and we're seeing ~60% of CPU spent on
doing just this (writing/reading huge statsfile).
So with a lot of databases/objects, this "pgstat.stat split" saves
us a lot of CPU ...
(c) This should lower the space requirements too - with a single file,
you actually need at least 2x the disk space (or RAM, if you're
using tmpfs as we are), because you need to keep two versions of
the file at the same time (pgstat.stat and pgstat.tmp).
Thanks to this split you only need additional space for a copy of
the largest piece (with some reasonable safety reserve).
Well, it's very early patch, so there are rough edges too
(a) It does not solve the "many-schema" scenario at all - that'll need
a completely new approach I guess :-(
(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.
That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.
One way might be adding the timestamp(s) into PgStat_StatDBEntry
and the other one is using an array of inquiries for each database.
And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.
(c) I'm a bit worried about the number of files - right now there's one
for each database and I'm thinking about splitting them by type
(one for tables, one for functions) which might make it even faster
for some apps with a lot of stored procedures etc.
But is the large number of files actually a problem? After all,
we're using one file per relation fork in the "base" directory, so
this seems like a minor issue.
And if really an issue, this might be solved by the mod(oid,256) to
combine multiple files into one (which would work neatly with the
fixed-length array of timestamps).
kind regards
Tomas
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be3adf1..226311c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -253,7 +253,9 @@ static PgStat_StatDBEntry *pgstat_get_db_entry(Oid
databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool
permanent);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB
*funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
@@ -1408,13 +1410,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid
databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3063,7 +3066,7 @@ PgstatCollectorMain(int argc, char *argv[])
* zero.
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3435,11 +3438,7 @@ static void
pgstat_write_statsfile(bool permanent)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE :
pgstat_stat_tmpname;
@@ -3493,29 +3492,15 @@ pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for
error with ferror */
/*
- * Walk through the database's access stats per table.
+ * Write our the tables and functions into a separate file.
*/
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *)
hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1,
fpout);
- (void) rc; /* we'll check for
error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *)
hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1,
fpout);
- (void) rc; /* we'll check for
error with ferror */
- }
+ pgstat_write_db_statsfile(dbentry, permanent);
/*
* Mark the end of this DB
+ *
+ * FIXME does it really make much sense, when the
tables/functions
+ * are moved to a separate file (using those chars?)
*/
fputc('d', fpout);
}
@@ -3587,6 +3572,111 @@ pgstat_write_statsfile(bool permanent)
}
+
+/* ----------
+ * pgstat_write_statsfile() -
+ *
+ * Tell the news.
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int rc;
+
+ /* FIXME Disgusting. Handle properly ... */
+ const char *tmpfile_x = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE :
pgstat_stat_tmpname;
+ const char *statfile_x = permanent ? PGSTAT_STAT_PERMANENT_FILENAME :
pgstat_stat_filename;
+
+ char tmpfile[255];
+ char statfile[255];
+
+ /* FIXME Do some kind of reduction (e.g. mod(oid,255)) not to end with
thousands of files,
+ *one for each database */
+ snprintf(tmpfile, 255, "%s.%d", tmpfile_x, dbentry->databaseid);
+ snprintf(statfile, 255, "%s.%d", statfile_x, dbentry->databaseid);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics
file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) !=
NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with
ferror */
+ }
+
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat))
!= NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with
ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file
\"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file
\"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics
file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+
+ // if (permanent)
+ // unlink(pgstat_stat_filename);
+}
+
/* ----------
* pgstat_read_statsfile() -
*
@@ -3595,14 +3685,10 @@ pgstat_write_statsfile(bool permanent)
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3758,6 +3844,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this
database. If there was
+ * onlydb specified (!= InvalidOid), we would
not get here because
+ * of a break above. So we don't need to
recheck.
+ */
+ if (! onlydbs)
+
pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+
permanent);
+
break;
/*
@@ -3768,6 +3864,79 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG :
WARNING,
+ (errmsg("corrupted statistics
file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool
permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ bool found;
+
+ /* FIXME Disgusting. Handle properly ... */
+ const char *statfile_x = permanent ? PGSTAT_STAT_PERMANENT_FILENAME :
pgstat_stat_filename;
+ char statfile[255];
+
+ /* FIXME Do some kind of reduction (e.g. mod(oid,255)) not to end with
thousands of files,
+ *one for each database */
+ snprintf(statfile, 255, "%s.%d", statfile_x, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file
\"%s\": %m",
+ statfile)));
+ return;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3853,10 +4022,11 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
done:
FreeFile(fpin);
- if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+// FIXME unlink permanent filename (with the proper Oid appended
+// if (permanent)
+// unlink(PGSTAT_STAT_PERMANENT_FILENAME);
- return dbhash;
+ return;
}
/* ----------
@@ -4006,7 +4176,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, InvalidOid);
break;
}
@@ -4016,7 +4186,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, InvalidOid);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4196,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions?
Or is it enough to read
+ * database-level stats? It seems to me the launcher needs
PgStat_StatDBEntry only
+ * (at least that's how I understand the
rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum()
only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by
reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false,
false);
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..8971002 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB
(InvalidOid => all DBs) */
} PgStat_MsgInquiry;
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers