Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-08 Thread Tom Lane
Tomasz Ostrowski  writes:
> W dniu 2016-06-08 o 05:04, Tom Lane pisze:
>> Yeah, my guess is that the OP's example where analyzing just one column
>> was significantly cheaper boiled down to some of the other columns being
>> mostly toasted data.  Otherwise it's hard to see how there's much more
>> expense in analyzing them all.

> Actually no - this volatile column has smaller "statistics" than most of 
> the table, so analyzing it is much faster when it's data is not in RAM. 

Oh, so you *will* be vulnerable to the side-effect we wondered about
earlier, wherein the manual ANALYZE scans fewer blocks and may therefore
derive a less-accurate tuple count estimate than auto-analyze gets.
It'll be interesting to see how the fix plays out for you --- please pay
attention to whether the pg_stat_all_tables.n_live_tup/n_dead_tup numbers
jump around in unexpected ways.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-08 Thread Tomasz Ostrowski

W dniu 2016-06-08 o 05:04, Tom Lane pisze:

Jim Nasby  writes:

Is there any significant advantage to not analyzing all columns? Only
case I can think of is if you have a fair number of columns that have
been toasted; otherwise I'd think IO would completely swamp any other
considerations.


Yeah, my guess is that the OP's example where analyzing just one column
was significantly cheaper boiled down to some of the other columns being
mostly toasted data.  Otherwise it's hard to see how there's much more
expense in analyzing them all.


Actually no - this volatile column has smaller "statistics" than most of 
the table, so analyzing it is much faster when it's data is not in RAM. 
Here is a small exaggerated example showing a difference:


$ psql
tometzky=> create table test (id serial, data text);
tometzky=> insert into test(data)
 select 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. '
 ||generate_series(0,1000)::text;
tometzky=> alter table test alter column id set statistics 10;
tometzky=> alter table test alter column data set statistics 1000;
tometzky=> \q

# Drop OS page cache and restart postgres
# so the table data won't be in RAM anymore:
$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'
$ sudo systemctl restart postgresql;

# Test single column analyze:
$ psql
tometzky=> \timing
Timing is on.
tometzky=> analyze verbose test(id);
INFO:  analyzing "public.test"
INFO:  "test": scanned 3000 of 123457 pages, containing 243000 live rows 
and 0 dead rows; 3000 rows in sample, 2712238 estimated total rows

ANALYZE
Time: 422,521 ms
tometzky=> \q

# Drop OS page cache and restart postgres again
$ sudo sh -c 'echo 1 >/proc/sys/vm/drop_caches'
$ sudo systemctl restart postgresql;

$ psql
tometzky=> \timing
Timing is on.
tometzky=> analyze verbose test;
INFO:  analyzing "public.test"
INFO:  "test": scanned 123457 of 123457 pages, containing 1001 live 
rows and 0 dead rows; 30 rows in sample, 1001 estimated total rows

ANALYZE
Time: 9447,519 ms

--
Tomasz "Tometzky" Ostrowski


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-07 Thread Tom Lane
Jim Nasby  writes:
> Is there any significant advantage to not analyzing all columns? Only 
> case I can think of is if you have a fair number of columns that have 
> been toasted; otherwise I'd think IO would completely swamp any other 
> considerations.

Yeah, my guess is that the OP's example where analyzing just one column
was significantly cheaper boiled down to some of the other columns being
mostly toasted data.  Otherwise it's hard to see how there's much more
expense in analyzing them all.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-07 Thread Jim Nasby

On 6/6/16 3:23 PM, Josh berkus wrote:

On 06/06/2016 01:38 PM, Tom Lane wrote:


Also, I'd be a bit inclined to disable the counter reset whenever a column
list is specified, disregarding the corner case where a list is given but
it includes all the table's analyzable columns.  It doesn't really seem
worth the effort to account for that case specially (especially after
you consider that index expressions should count as analyzable columns).

Thoughts?


+1.  Better to err on the side of duplicate analyzes than none at all.

Also, I'm not surprised this took so long to discover; I doubt most
users are aware that you *can* analyze individual columns.


Is there any significant advantage to not analyzing all columns? Only 
case I can think of is if you have a fair number of columns that have 
been toasted; otherwise I'd think IO would completely swamp any other 
considerations.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-06 Thread Josh berkus
On 06/06/2016 01:38 PM, Tom Lane wrote:

> Also, I'd be a bit inclined to disable the counter reset whenever a column
> list is specified, disregarding the corner case where a list is given but
> it includes all the table's analyzable columns.  It doesn't really seem
> worth the effort to account for that case specially (especially after
> you consider that index expressions should count as analyzable columns).
> 
> Thoughts?

+1.  Better to err on the side of duplicate analyzes than none at all.

Also, I'm not surprised this took so long to discover; I doubt most
users are aware that you *can* analyze individual columns.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-06 Thread Tom Lane
I wrote:
> Tomasz Ostrowski  writes:
>> I suppose this is a bug - an analyze, which does not analyze all 
>> columns, should not reset pg_stat_user_tables(n_mod_since_analyze). What 
>> do you think?

> I'm inclined to think that this is a reasonable complaint.  A usage
> pattern like that probably hasn't come up before; but now that it has,
> it's clear it shouldn't block auto-analyze from happening.

Attached is a draft patch for this, which I propose to apply and
back-patch.  Any objections?

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cf8c816..97059e5 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** do_analyze_rel(Relation onerel, int opti
*** 611,620 
  	/*
  	 * Report ANALYZE to the stats collector, too.  However, if doing
  	 * inherited stats we shouldn't report, because the stats collector only
! 	 * tracks per-table stats.
  	 */
  	if (!inh)
! 		pgstat_report_analyze(onerel, totalrows, totaldeadrows);
  
  	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
  	if (!(options & VACOPT_VACUUM))
--- 611,623 
  	/*
  	 * Report ANALYZE to the stats collector, too.  However, if doing
  	 * inherited stats we shouldn't report, because the stats collector only
! 	 * tracks per-table stats.  Reset the changes_since_analyze counter only
! 	 * if we analyzed all columns; otherwise, there is still work for
! 	 * auto-analyze to do.
  	 */
  	if (!inh)
! 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
! 			  (va_cols == NIL));
  
  	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
  	if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 44237f9..439fc06 100644
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
*** pgstat_report_vacuum(Oid tableoid, bool 
*** 1340,1350 
   * pgstat_report_analyze() -
   *
   *	Tell the collector about the table we just analyzed.
   * 
   */
  void
  pgstat_report_analyze(Relation rel,
! 	  PgStat_Counter livetuples, PgStat_Counter deadtuples)
  {
  	PgStat_MsgAnalyze msg;
  
--- 1340,1354 
   * pgstat_report_analyze() -
   *
   *	Tell the collector about the table we just analyzed.
+  *
+  * Caller must provide new live- and dead-tuples estimates, as well as a
+  * flag indicating whether to reset the changes_since_analyze counter.
   * 
   */
  void
  pgstat_report_analyze(Relation rel,
! 	  PgStat_Counter livetuples, PgStat_Counter deadtuples,
! 	  bool resetcounter)
  {
  	PgStat_MsgAnalyze msg;
  
*** pgstat_report_analyze(Relation rel,
*** 1381,1386 
--- 1385,1391 
  	msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
  	msg.m_tableoid = RelationGetRelid(rel);
  	msg.m_autovacuum = IsAutoVacuumWorkerProcess();
+ 	msg.m_resetcounter = resetcounter;
  	msg.m_analyzetime = GetCurrentTimestamp();
  	msg.m_live_tuples = livetuples;
  	msg.m_dead_tuples = deadtuples;
*** pgstat_recv_analyze(PgStat_MsgAnalyze *m
*** 5263,5272 
  	tabentry->n_dead_tuples = msg->m_dead_tuples;
  
  	/*
! 	 * We reset changes_since_analyze to zero, forgetting any changes that
! 	 * occurred while the ANALYZE was in progress.
  	 */
! 	tabentry->changes_since_analyze = 0;
  
  	if (msg->m_autovacuum)
  	{
--- 5268,5278 
  	tabentry->n_dead_tuples = msg->m_dead_tuples;
  
  	/*
! 	 * If commanded, reset changes_since_analyze to zero, forgetting any
! 	 * changes that occurred while the ANALYZE was in progress.
  	 */
! 	if (msg->m_resetcounter)
! 		tabentry->changes_since_analyze = 0;
  
  	if (msg->m_autovacuum)
  	{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a078b61..19c8387 100644
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
*** typedef struct PgStat_MsgAnalyze
*** 383,388 
--- 383,389 
  	Oid			m_databaseid;
  	Oid			m_tableoid;
  	bool		m_autovacuum;
+ 	bool		m_resetcounter;
  	TimestampTz m_analyzetime;
  	PgStat_Counter m_live_tuples;
  	PgStat_Counter m_dead_tuples;
*** extern void pgstat_report_autovac(Oid db
*** 970,976 
  extern void pgstat_report_vacuum(Oid tableoid, bool shared,
  	 PgStat_Counter livetuples, PgStat_Counter deadtuples);
  extern void pgstat_report_analyze(Relation rel,
! 	  PgStat_Counter livetuples, PgStat_Counter deadtuples);
  
  extern void pgstat_report_recovery_conflict(int reason);
  extern void pgstat_report_deadlock(void);
--- 971,978 
  extern void pgstat_report_vacuum(Oid tableoid, bool shared,
  	 PgStat_Counter livetuples, PgStat_Counter deadtuples);
  extern void pgstat_report_analyze(Relation rel,
! 	  PgStat_Counter livetuples, PgStat_Counter deadtuples,
! 	  bool resetcounter);
  
  extern void pgstat_report_recovery_conflict(int reason);
  extern void 

Re: [HACKERS] [BUGS] Routine analyze of single column prevents standard autoanalyze from running at all

2016-06-06 Thread Tom Lane
[ redirecting to -hackers ]

Tomasz Ostrowski  writes:
> I'm routinely bulk inserting data to a PostgreSQL table and then 
> analyzing a single column of the table, because it contains data which 
> significantly changes histogram of this column values - for example 
> something like adding rows with "todo=true" column, when all rows before 
> bulk insert have "todo=false".

> But I've noticed that a standard automatic analyze, which should work in 
> background, never runs. I've noticed that this fast analyze of one 
> column resets pg_stat_user_tables(n_mod_since_analyze) counter.

> I suppose this is a bug - an analyze, which does not analyze all 
> columns, should not reset pg_stat_user_tables(n_mod_since_analyze). What 
> do you think?

I'm inclined to think that this is a reasonable complaint.  A usage
pattern like that probably hasn't come up before; but now that it has,
it's clear it shouldn't block auto-analyze from happening.

A cheap-and-dirty solution would be to not send a PgStat_MsgAnalyze
message at all, but I think that's probably undesirable: typically it
would be a good thing to accept the new n_live_tuples and n_dead_tuples
estimates.  What we could do instead is add a bool flag to
PgStat_MsgAnalyze saying whether or not to reset changes_since_analyze.

It would be safe enough to back-patch such a change, because the stats
collector messages are private to the backend (in fact, really private
to pgstat.c).

One question here is whether there is any connection between
changes_since_analyze and the tuple-count estimates that would make it
improper to update the latter and not the former.  I can't really see one;
in fact, changes_since_analyze is pretty squishy anyway because changes
made by an ANALYZE's own transaction are already accounted for in the new
pg_statistic entries but will be added to changes_since_analyze at commit
despite that.  So I'd go ahead and update all the fields other than
changes_since_analyze in this case.

Another interesting consideration is that if the table's columns have
different stats targets, a selective-column ANALYZE might possibly sample
fewer rows than an all-columns ANALYZE would.  This might mean that our
new tuple-count estimates are less accurate than autoanalyze would get.
So you could possibly argue that we shouldn't update the tuple-count
estimates after all.  But I don't particularly believe that, because it
disregards the fact that the new estimates are, well, new.  Even if they
have more statistical risk than autoanalyze would have, they could well
be better just by virtue of having seen whatever bulk updates might have
happened since the last autoanalyze.  So my inclination is to disregard
this fine point.  (Though it's interesting to ask whether Tomasz's use
case includes a lower-than-default stats target for his very volatile
column ...)

Also, I'd be a bit inclined to disable the counter reset whenever a column
list is specified, disregarding the corner case where a list is given but
it includes all the table's analyzable columns.  It doesn't really seem
worth the effort to account for that case specially (especially after
you consider that index expressions should count as analyzable columns).

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers