date:20100529

Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Mohammad Heykal Abdillah

On Min, 2010-05-30 at 00:44 -0400, Andrew Dunstan wrote: 
> 
> Mohammad Heykal Abdillah wrote:
> > Hi all,
> >
> > Right now i am trying to understand how SQL parser is work. 
> >
> > My question is there anyway to get list of table name and its atribut
> > before raw parser is analyze?
> >
> > Because i like to understand how PostgreSQL "break-down" the asterik at
> > target list, specialy in "natural join" case where PostgreSQL can give
> > query result that have unabigous attribut and match the right join key.
> >
> >
> >   
> 
> AIUI, expanding '*' is not done in the parser. The parser just makes an 
> A_star node.
> 
> See src/backend/gram.y for the parser spec.
> 
> Maybe you need to explain what exactly you're trying to do.
> 
> cheers
> 
> andrew

Yes that i know, expanding '*' is done in analyzer part. I am try to do
is, move the expanding process to "before" raw_parser that produce by
gram.y is processed by analyzer. Like this :

sql query -> gram.y -> raw_parse_tree -> (expand the '*') -> analyze

In this way analyzer part only receiving the expanded '*' parse tree
they wont expand the '*' again. I am aksing this because i didnt know
how to (analyzer) scan database so it can return list of attribut from
table. I have read the source code, and try manualy to replicate the
process (from what i got is the RTE function) with no result.

Thank You.
-- 
Mohammad Heykal Abdillah 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Andrew Dunstan




Mohammad Heykal Abdillah wrote:

Hi all,

Right now i am trying to understand how SQL parser is work. 


My question is there anyway to get list of table name and its atribut
before raw parser is analyze?

Because i like to understand how PostgreSQL "break-down" the asterik at
target list, specialy in "natural join" case where PostgreSQL can give
query result that have unabigous attribut and match the right join key.


  


AIUI, expanding '*' is not done in the parser. The parser just makes an 
A_star node.


See src/backend/gram.y for the parser spec.

Maybe you need to explain what exactly you're trying to do.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas

On Sat, May 29, 2010 at 11:40 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, May 29, 2010 at 6:11 PM, Tom Lane  wrote:
>>> Or, to put it differently: if nobody's done that in the past twenty
>>> years, why is it likely to happen before 9.1?
>
>> Hmm.  Well suppose we bet a dollar on whether that will happen or not.
>>  In fact, if you promise not to read
>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
>> make it two dollars.
>
> Yes, I've seen Jeff's example.  It's a cute hack but somehow I doubt
> that there is going to be a land rush to implement such things.
> Can you point to any pre-existing example where anyone actually asked
> for the ability to do that?

I've often wished for the ability to constrain a tale to hold just one
row, so I don't find that use case implausible at all.  As to whether
Jeff's use case is a cute hack or something that people will really
want to do, I think the jury's still out on that one.  But I don't
think we should make it not work unless we have a concrete reason, and
I haven't heard one yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Tom Lane

Greg Stark  writes:
> On Sun, May 30, 2010 at 3:56 AM, Greg Stark  wrote:
>> This sounds familiar. If you search back in the archives around 2004
>> or so I think you'll find a similar discussion when we replaced the
>> crc32 implementation with what we have now.

> Fwiw here's the thread (from 2005):
> http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811

I read through that thread and couldn't find much discussion of
alternative CRC implementations --- we spent all our time on arguing
about whether we needed 64-bit CRC or not.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Tom Lane

Robert Haas  writes:
> On Sat, May 29, 2010 at 6:11 PM, Tom Lane  wrote:
>> Or, to put it differently: if nobody's done that in the past twenty
>> years, why is it likely to happen before 9.1?

> Hmm.  Well suppose we bet a dollar on whether that will happen or not.
>  In fact, if you promise not to read
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
> make it two dollars.

Yes, I've seen Jeff's example.  It's a cute hack but somehow I doubt
that there is going to be a land rush to implement such things.
Can you point to any pre-existing example where anyone actually asked
for the ability to do that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Is there anyway to get list of table name, before raw parser is analyze?

2010-05-29 Thread Mohammad Heykal Abdillah

Hi all,

Right now i am trying to understand how SQL parser is work. 

My question is there anyway to get list of table name and its atribut
before raw parser is analyze?

Because i like to understand how PostgreSQL "break-down" the asterik at
target list, specialy in "natural join" case where PostgreSQL can give
query result that have unabigous attribut and match the right join key.

Thank You.

-- 
Mohammad Heykal Abdillah 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas

On Sat, May 29, 2010 at 5:58 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Thoughts on a few of the remaining items:
>
>> Should we revert the default output format for bytea to the old style
>> before shipping 9.0.0? - Consensus seems to be "no", thus no action is
>> required.
>
> I think we should leave that there for awhile, though I agree it's
> likely that the final decision will be "no change".
>
>> don't rename index columns behavior has already broken JDBC - As I
>> understand it, this is not a code issue, but just something that
>> driver authors need to be aware of.
>
> There had been a section on the page about information we needed to
> communicate to third-party authors.  Someone seems to have removed
> that, but that seems like where this belongs.
>
>> Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
>> the machine just ran out of disk space - not sure we need to do
>> anything here.
>
> It's a bit weird though, because UpdateControlFile should always update
> in place; why would there be any risk of out of disk space?  I would
> like to find out exactly what happened, though I have no clear ideas
> how to investigate it.

Well, I think at a minimum the first two of these need to go into a
section that is not called "code": the first is just a decision we
might change our mind about, and the second is a communication issue,
not a code issue.  I'd argue that the third one is probably not
something we're going to hold up the release for, either, and
therefore while it might belong on a list of known open bugs it
doesn't really belong on a list of 9.0 open items.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-05-29 Thread Fujii Masao

On Fri, May 28, 2010 at 11:12 AM, Fujii Masao  wrote:
> On Thu, May 27, 2010 at 11:13 PM, Robert Haas  wrote:
>>> I guess this happens because the frequency of checkpoint on the standby is
>>> too lower than that on the master. In the master, checkpoint occurs for 
>>> every
>>> consumption of three segments because of "checkpoint_segments = 3". On the
>>> other hand, in the standby, only checkpoint_timeout has effect, so 
>>> checkpoint
>>> occurs for every 30 minutes because of "checkpoint_timeout = 30min".
>>>
>>> The walreceiver should signal the bgwriter to start checkpoint if it has
>>> received more than checkpoint_segments WAL files, like normal processing?
>>
>> Is this also an issue when using log shipping, or just with SR?
>
> When using log shipping, checkpoint_segments always doesn't trigger a
> checkpoint. So recovery after the standby crashes might take unexpectedly
> long since redo starting point might be old.
>
> But in file-based log shipping, since WAL files don't accumulate in
> pg_xlog directory on the standby, even if the frequency of checkpoint
> is very low, pg_xlog will not be filled with many WAL files. That
> accumulation occurs only when using SR.
>
> If we should avoid low frequency of checkpoint itself rather than
> accumulation of WAL files, the bgwriter instead of the walreceiver
> should check if we've consumed too much WAL, I think. Thought?

I attached the patch, which changes the startup process so that it signals
bgwriter to perform a restartpoint if we've already replayed too much WAL
files. This leads checkpoint_segments to trigger a restartpoint.

This patch is worth applying for 9.0? If not, I'll add it into the next CF.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 508,513  static bool reachedMinRecoveryPoint = false;
--- 508,516 
  
  static bool InRedo = false;
  
+ /* We've already launched bgwriter to perform restartpoint? */
+ static bool bgwriterLaunched = false;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***
*** 550,555  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
--- 553,559 
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  XLogRecPtr *lsn, BkpBlock *bkpb);
  static bool AdvanceXLInsertBuffer(bool new_segment);
+ static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
  static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  	   bool find_free, int *max_advance,
***
*** 1554,1567  AdvanceXLInsertBuffer(bool new_segment)
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling the open log file (so that
!  * openLogId/openLogSeg are valid).  We measure the distance from RedoRecPtr
!  * to the open log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(void)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
--- 1558,1571 
  /*
   * Check whether we've consumed enough xlog space that a checkpoint is needed.
   *
!  * Caller must have just finished filling or reading the log file (so that
!  * the given logid/logseg are valid).  We measure the distance from RedoRecPtr
!  * to the log file and see if that exceeds CheckPointSegments.
   *
   * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
   */
  static bool
! XLogCheckpointNeeded(uint32 logid, uint32 logseg)
  {
  	/*
  	 * A straight computation of segment number could overflow 32 bits. Rather
***
*** 1577,1584  XLogCheckpointNeeded(void)
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg;
! 	new_highbits = openLogId / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
--- 1581,1588 
  	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
  		(RedoRecPtr.xrecoff / XLogSegSize);
  	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
! 	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
! 	new_highbits = logid / XLogSegSize;
  	if (new_highbits != old_highbits ||
  		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
  		return true;
***
*** 1782,1791  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
   * update RedoRecPtr and recheck.
   */
  if (IsUnderPostmaster &&
! 	XLogCheck

[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Greg Stark

On Sun, May 30, 2010 at 3:56 AM, Greg Stark  wrote:
> This sounds familiar. If you search back in the archives around 2004
> or so I think you'll find a similar discussion when we replaced the
> crc32 implementation with what we have now.

Fwiw here's the thread (from 2005):

http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Greg Stark

This sounds familiar. If you search back in the archives around 2004
or so I think you'll find a similar discussion when we replaced the
crc32 implementation with what we have now. We put a fair amount of
effort into searching for faster implementations so if you've found
one 3x faster I'm pretty startled. Are you sure it's faster on all
architectures and not a win sometimes and a loss other times? And are
you sure it's faster in our use case where we're crcing small
sequences of data often and not crcing a large block?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas

On Sat, May 29, 2010 at 6:11 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> The only disadvantage I see of just documenting this is that someone
>> might write a user-defined index opclass that works like this, and
>> they won't be able to use this until at least 9.1 (or at least, not
>> without patching the source).
>
> I don't actually think that anyone's very likely to write a <>-like index
> operator.  It's approximately useless to use an index for such a query.
>
> Or, to put it differently: if nobody's done that in the past twenty
> years, why is it likely to happen before 9.1?

Hmm.  Well suppose we bet a dollar on whether that will happen or not.
 In fact, if you promise not to read
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll
make it two dollars.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up

2010-05-29 Thread Bruce Momjian


Added to TODO:

Consider a faster CRC32 algorithm

* http://archives.postgresql.org/pgsql-hackers/2010-05/msg01112.php 

---

Andres Freund wrote:
> Hi,
> 
> I started to analyze XLogInsert because it was the major bottleneck when 
> creating some materialized view/cached tables/whatever.
> Analyzing it I could see that content of the COMP_CRC32 macro was taking most 
> of the time which isn't immediately obvious when you profile because it 
> obviously doesn't show up as a separate function.
> I first put it into functions to make it easier to profile. I couldn't 
> measure 
> any difference for COPY, CTAS and a simple pgbench run on 3 kinds of hardware 
> (Core2, older Xeon, older Sparc systems).
> 
> I looked a bit around for faster implementations of CRC32 and found one in 
> zlib. After adapting it (pg uses slightly different computation (non-
> inverted)) I found that it increases the speed of the CRC32 calculation 
> itself 
> 3 fold.
> It does that by not only using one lookup table but four (one for each byte 
> of 
> a word). Those four calculations are independent and thus are considerably 
> faster on somewhat recent hardware.
> Also it does memory lookups in 4 byte steps instead of 1 byte as the pg 
> version (thats only about ~8% benefit in itself).
> 
> I wrote a preliminary patch which includes both, the original implementation 
> and the new one switchable via an #define.
> 
> 
> I tested performance differences in a small number of scenarios:
> - CTAS/INSERT ... SELECT (8-30%)
> - COPY (3-20%)
> - pgbench (no real difference unless directly after a checkpoint)
> 
> Setup:
> 
> CREATE TABLE blub (ai int, bi int, aibi int);
> CREATE TABLE speedtest (ai int, bi int, aibi int);
> 
> 
> INSERT ... SELECT:
> 
> Statement:
> INSERT INTO blub SELECT a.i, b.i, a.i *b.i FROM generate_series(1, 1) 
> a(i), generate_series(1, 1000) b(i);
> 
> legacy crc:
> 
> 11526.588
> 11406.518
> 11412.182
> 11430.245
> 
> zlib:
> 9977.394
> 9945.408
> 9840.907
> 9842.875
> 
> 
> COPY:
> Statement:
> ('blub' enlarged here 4 times, as otherwise the variances were to large)
> 
> COPY blub TO '/tmp/b' BINARY;
> ...
> CHECKPOINT;TRUNCATE speedtest; COPY speedtest FROM '/tmp/b' BINARY;
> 
> legacy:
> 44835.840
> 44832.876
> 
> zlib:
> 39530.549
> 39365.109
> 39295.167
> 
> The performance differences are bigger if the table rows are significantly 
> bigger. 
> 
> Do you think something like that is sensible? If yes, I will make it into a 
> proper patch and such.
> 
> Thanks,
> 
> Andres
> 
> INSERT ... SELECT profile before patch:
> 
> 20.22% postgres  postgres   [.] comp_crc32
>  5.77% postgres  postgres   [.] XLogInsert
>  5.55% postgres  postgres   [.] LWLockAcquire
>  5.21% postgres  [kernel.   [k] copy_user_generic_string
>  4.64% postgres  postgres   [.] LWLockRelease
>  4.39% postgres  postgres   [.] ReadBuffer_common
>  2.75% postgres  postgres   [.] heap_insert
>  2.22% postgres  libc-2.1   [.] memcpy
>  2.09% postgres  postgres   [.] UnlockReleaseBuffer
>  1.85% postgres  postgres   [.] hash_any
>  1.77% postgres  [kernel.   [k] clear_page_c
>  1.69% postgres  postgres   [.] hash_search_with_hash_value
>  1.61% postgres  postgres   [.] heapgettup_pagemode
>  1.50% postgres  postgres   [.] PageAddItem
>  1.42% postgres  postgres   [.] MarkBufferDirty
>  1.28% postgres  postgres   [.] RelationGetBufferForTuple
>  1.15% postgres  postgres   [.] ExecModifyTable
>  1.06% postgres  postgres   [.] RelationPutHeapTuple
> 
> 
> After:
> 
>  9.97% postgres  postgres   [.] comp_crc32
>  5.95% postgres  [kernel.   [k] copy_user_generic_string
>  5.94% postgres  postgres   [.] LWLockAcquire
>  5.64% postgres  postgres   [.] XLogInsert
>  5.11% postgres  postgres   [.] LWLockRelease
>  4.63% postgres  postgres   [.] ReadBuffer_common
>  3.45% postgres  postgres   [.] heap_insert
>  2.54% postgres  libc-2.1   [.] memcpy
>  2.03% postgres  postgres   [.] UnlockReleaseBuffer
>  1.94% postgres  postgres   [.] hash_search_with_hash_value
>  1.84% postgres  postgres   [.] hash_any
>  1.73% postgres  [kernel.   [k] clear_page_c
>  1.68% postgres  postgres   [.] PageAddItem
>  1.62% postgres  postgres   [.] heapgettup_pagemode
>  1.52% postgres  postgres   [.] RelationGetBufferForTuple
>  1.47% postgres  postgres   [.] MarkBufferDirty
>  1.30% postgres  postgres   [.] ExecModifyTable
>  1.23% postgres  postgres   [.] RelationPutHeapTuple

[ Attachment, skipping... ]

> 
> -- 
> Sent via pgsql-hackers ma

Re: [HACKERS] Specification for Trusted PLs?

2010-05-29 Thread Bruce Momjian

Robert Haas wrote:
> On Sat, May 22, 2010 at 4:53 PM, C?dric Villemain
>  wrote:
> > 2010/5/21 Jan Wieck :
> >> The original idea was that a trusted language does not allow an 
> >> unprivileged
> >> user to gain access to any object or data, he does not have access to
> >> without that language.
> >>
> >> This does not include data transformation functionality, like string
> >> processing or the like. As long as the user had legitimate access to the
> >> input datum, then every derived form thereof is OK.
> >
> > I find the current doc enough, add this prose from Jan as a comment
> > might help people perhaps.
> 
> Yeah, Jan's description is very clear and to the point.

The attached, applied patch clarifies the meaning of "trusted language"
in the documentation using Jan's description.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
Index: doc/src/sgml/xplang.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/xplang.sgml,v
retrieving revision 1.37
diff -c -c -r1.37 xplang.sgml
*** doc/src/sgml/xplang.sgml	3 Apr 2010 07:22:56 -	1.37
--- doc/src/sgml/xplang.sgml	30 May 2010 02:21:53 -
***
*** 151,158 
  VALIDATOR validator_function_name ;
  
The optional key word TRUSTED specifies that
!   ordinary database users that have no superuser privileges should
!   be allowed to use this language to create functions and trigger
procedures. Since PL functions are executed inside the database
server, the TRUSTED flag should only be given
for languages that do not allow access to database server
--- 151,160 
  VALIDATOR validator_function_name ;
  
The optional key word TRUSTED specifies that
!   the language does not grant access to data that the user would
!   not otherwise have.  Trusted languages are designed for ordinary
!   database users (those without superuser privilege) and allows them
!   to safely create of functions and trigger
procedures. Since PL functions are executed inside the database
server, the TRUSTED flag should only be given
for languages that do not allow access to database server
Index: doc/src/sgml/ref/create_language.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/create_language.sgml,v
retrieving revision 1.50
diff -c -c -r1.50 create_language.sgml
*** doc/src/sgml/ref/create_language.sgml	3 Apr 2010 07:22:58 -	1.50
--- doc/src/sgml/ref/create_language.sgml	30 May 2010 02:21:53 -
***
*** 104,114 
  
   

!TRUSTED specifies that
!the language is safe, that is, it does not offer an
!unprivileged user any functionality to bypass access
!restrictions. If this key word is omitted when registering the
!language, only users with the
 PostgreSQL superuser privilege can
 use this language to create new functions.

--- 104,113 
  
   

!TRUSTED specifies that the language does
!not grant access to data that the user would not otherwise
!have.  If this key word is omitted
!when registering the language, only users with the
 PostgreSQL superuser privilege can
 use this language to create new functions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii

> > Wait. This works fine for me with stock pg_trgm. local is C and
> > encoding is UTF8. What version of PostgreSQL are you using? Mine is
> > 8.4.4.
> 
> This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
> you run this in 8.4, you're just comparing a sequence of ASCII letters
> and digits.

Hum. Still I prefer 8.4's behavior since anything is better than
returning NaN. It seems 9.0 does not have any escape route for
multibyte+C locale users.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii

> This is still ignoring the point: arbitrarily changing the module's
> longstanding standard behavior isn't acceptable.  You need to provide
> a way for the user to control the behavior.  (Once you've done that,
> I think it can be just either "alnum" or "!isspace", but maybe some
> other behaviors would be interesting.)

To be honest I don't know what "module's longstanding standard
behavior" should be. It's not documented anywhere. If you mean that is
whatever the current implementation is, then any effort to touch the
module should be prohibited.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Regression testing for psql

2010-05-29 Thread Selena Deckelmann

On Wed, May 26, 2010 at 6:25 PM, Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> There might be some value in psql backslash command tests that
>> are designed to depend on just one or a few tables (or other appropriate
>> objects).
>
> Updated, much much smaller, patch attached.  Also available, again, at
> http://snowman.net/~sfrost/psql-regress-help.patch
>
> Basically, I removed anything that would produce data directly from
> the catalogs by trying to find a 'none' object which matched.  This
> still goes through alot of the same setup and query, it's just that
> there aren't any results.

Is this something to be added to 2010-07 commitfest?

-selena


-- 
http://chesnok.com/daily - me

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread David Fetter

On Sat, May 29, 2010 at 06:11:57PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > The only disadvantage I see of just documenting this is that
> > someone might write a user-defined index opclass that works like
> > this, and they won't be able to use this until at least 9.1 (or at
> > least, not without patching the source).
> 
> I don't actually think that anyone's very likely to write a <>-like
> index operator.  It's approximately useless to use an index for such
> a query.
> 
> Or, to put it differently: if nobody's done that in the past twenty
> years, why is it likely to happen before 9.1?

Because there's a fundamentally new way to use them now, namely with
exclusion constraints :)

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Tom Lane

Robert Haas  writes:
> The only disadvantage I see of just documenting this is that someone
> might write a user-defined index opclass that works like this, and
> they won't be able to use this until at least 9.1 (or at least, not
> without patching the source).

I don't actually think that anyone's very likely to write a <>-like index
operator.  It's approximately useless to use an index for such a query.

Or, to put it differently: if nobody's done that in the past twenty
years, why is it likely to happen before 9.1?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] small exclusion constraints patch

2010-05-29 Thread Robert Haas

On Fri, May 28, 2010 at 10:32 PM, Bruce Momjian  wrote:
> Tom Lane wrote:
>> Jeff Davis  writes:
>> > Currently, the check for exclusion constraints performs a sanity check
>> > that's slightly too strict -- it assumes that a tuple will conflict with
>> > itself. That is not always the case: the operator might be "<>", in
>> > which case it's perfectly valid for the search for conflicts to not find
>> > itself.
>>
>> > This patch simply removes that sanity check, and leaves a comment in
>> > place.
>>
>> I'm a bit uncomfortable with removing the sanity check; it seems like a
>> good thing to have, especially since this code hasn't even made it out
>> of beta yet.  AFAIK the "<>" case is purely hypothetical, because we
>> have no index opclasses supporting such an operator, no?  How about just
>> documenting that we'd need to remove the sanity check if we ever did add
>> support for such a case?
>
> Done, with attached, applied patch.

The only disadvantage I see of just documenting this is that someone
might write a user-defined index opclass that works like this, and
they won't be able to use this until at least 9.1 (or at least, not
without patching the source).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Tom Lane

Robert Haas  writes:
> Thoughts on a few of the remaining items:

> Should we revert the default output format for bytea to the old style
> before shipping 9.0.0? - Consensus seems to be "no", thus no action is
> required.

I think we should leave that there for awhile, though I agree it's
likely that the final decision will be "no change".

> don't rename index columns behavior has already broken JDBC - As I
> understand it, this is not a code issue, but just something that
> driver authors need to be aware of.

There had been a section on the page about information we needed to
communicate to third-party authors.  Someone seems to have removed
that, but that seems like where this belongs.

> Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
> the machine just ran out of disk space - not sure we need to do
> anything here.

It's a bit weird though, because UpdateControlFile should always update
in place; why would there be any risk of out of disk space?  I would
like to find out exactly what happened, though I have no clear ideas
how to investigate it.

> move 'long long' check to c.h - Is this perhaps addressed by Michael
> Meskes commits on May 25th?
> Mergejoin null handling - I think this is done:

Yup, both done, I moved them.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas

On Sat, May 29, 2010 at 5:09 PM, Robert Haas  wrote:
> This is a really short list.

Thoughts on a few of the remaining items:

Type Mismatch Error in Set Returning Functions - tgl says this is a
deliberate change per link I just added to the wiki.  do we think more
is required here to prevent cranky users?
Should we revert the default output format for bytea to the old style
before shipping 9.0.0? - Consensus seems to be "no", thus no action is
required.
don't rename index columns behavior has already broken JDBC - As I
understand it, this is not a code issue, but just something that
driver authors need to be aware of.
Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that
the machine just ran out of disk space - not sure we need to do
anything here.
move 'long long' check to c.h - Is this perhaps addressed by Michael
Meskes commits on May 25th?
Mergejoin null handling - I think this is done:
http://archives.postgresql.org/pgsql-committers/2010-05/msg00332.php
Timeline for removal of older than 7.4 links to docs - link on the
wiki page is broken and this doesn't seem like a 9.0 issue anyway.
suggest we remove it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance problem in textanycat/anytextcat

2010-05-29 Thread Tom Lane

Robert Haas  writes:
> This is still on the 9.0 open items list, but ISTM you fixed it with
> two commits on May 27th.  Is that correct?

Oh, sorry, forgot to update the open items.  Done now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG 9.0 release timetable

2010-05-29 Thread Robert Haas

On Sat, May 29, 2010 at 4:19 PM, Bruce Momjian  wrote:
> Assuming we want a release Postgres 9.0 by mid-August, here is how the
> timetable would look:
>
>        Need RC release to be stable for 1-2 weeks before final
>                RC must be released by August 1
>        Beta must be stable for 2-3 weeks before RC
>                Stable beta must be released by early July
>
> So, we have 5-6 weeks to get a stable beta.  Looking at the open issues:
>
>        
> http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues
>
> it looks like we are doing OK, but we must continue progressing.

This is a really short list.  Several of these items have already been
fixed, and others have been discussed extensively and are just a
question of making a final decision.  The thorniest question we have
yet to resolve is what to do about max_standby_delay - I think we need
Tom and Heikki to review this patch by Simon:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01666.php

The real question in terms of release, I think, is how long we want to
wait for more bugs to be found, and/or how much time do we want to
allow for Tom and others to do further review of the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance problem in textanycat/anytextcat

2010-05-29 Thread Robert Haas

On Mon, May 17, 2010 at 9:23 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, May 17, 2010 at 4:01 PM, Tom Lane  wrote:
>>> Perhaps this is a backpatchable bug fix.  Comments?
>
>> I can't say whether this is safe enough to back-patch, but the way
>> this is set up, don't we also need to fix some catalog entries and, if
>> yes, isn't that problematic?
>
> The only catalog entries at issue, AFAICT, are the textanycat/anytextcat
> ones.  I am not sure whether we should attempt to back-patch changes for
> them, but this patch wouldn't make the situation in the back branches
> worse.  In particular, if we apply this patch but don't change the
> catalog entries, then nothing would change at all about the problematic
> cases, because the planner would decide it couldn't safely inline the
> function.  The only cases where inlining will happen is where the
> expression's apparent volatility stays the same or decreases, so as far
> as that issue is concerned this patch will never make CREATE INDEX
> reject a case it would have accepted otherwise.  The patch *will* make
> CREATE INDEX reject cases with volatile default arguments hiding under
> non-volatile functions, but that's got nothing to do with any built-in
> functions; and that's the case I claim is clearly a bug fix.

This is still on the 9.0 open items list, but ISTM you fixed it with
two commits on May 27th.  Is that correct?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PG 9.0 release timetable

2010-05-29 Thread Bruce Momjian

Assuming we want a release Postgres 9.0 by mid-August, here is how the
timetable would look:

Need RC release to be stable for 1-2 weeks before final
RC must be released by August 1
Beta must be stable for 2-3 weeks before RC
Stable beta must be released by early July

So, we have 5-6 weeks to get a stable beta.  Looking at the open issues:


http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues

it looks like we are doing OK, but we must continue progressing.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
> [ e of ] s/2 or s/3 look reasonable.

The examples in the LC paper seem to all use e = s/10.  Note the stated
assumption e << s.

> So, should I just write a patch that sets the bucket width and pruning
> count using 0.07 as the assumed frequency of the most common word and
> epsilon equal to s/2 or s/3?

I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for
a first cut to experiment with.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański

On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there.  It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it presumably have larger frequencies,
>>> which makes the total quite a lot more than 1.0.
> 
>> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
>> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
> 
> Um, apparently I can't do simple arithmetic first thing in the morning
> either, cause I got my number wrong too ;-)
> 
> After a bit more research: if you use the basic form of Zipf's law
> with a 1/k distribution, the first frequency has to be about 0.07
> to make the total come out to 1.0 for a reasonable number of words.
> So we could use s = 0.07 / K when we wanted a final list of K words.
> Some people (including the LC paper) prefer a higher exponent, ie
> 1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
> seems awfully high for the type of data we're working with, so I think
> the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts < 1887. For s/3, s/4 etc the numbers look like this:

s/114280
s/228571887
s/342852516
s/457142831
s/571423019
s/685713145
s/71   3235
s/811428   3302
s/912857   3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
> On 29/05/10 17:09, Tom Lane wrote:
>> There is definitely something wrong with your math there.  It's not
>> possible for the 100'th most common word to have a frequency as high
>> as 0.06 --- the ones above it presumably have larger frequencies,
>> which makes the total quite a lot more than 1.0.

> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański

On 29/05/10 17:09, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
> 
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
> 
>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
> 
> There is definitely something wrong with your math there.  It's not
> possible for the 100'th most common word to have a frequency as high
> as 0.06 --- the ones above it presumably have larger frequencies,
> which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignoring the fact that it's a stopword
> and so won't appear at all) in a tsvector it will be present only once.
> This may or may not be a problem, not sure if such "squashing" of
> occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents.  In any case we are not really depending on Zipf's
law heavily with this approach.  The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?=  writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there.  It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law.  The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001.  That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm

2010-05-29 Thread Tom Lane

Tatsuo Ishii  writes:
> After thinking a little bit more, I think following patch would not
> break existing behavior and also adopts mutibyte + C locale case. What
> do you think?

This is still ignoring the point: arbitrarily changing the module's
longstanding standard behavior isn't acceptable.  You need to provide
a way for the user to control the behavior.  (Once you've done that,
I think it can be just either "alnum" or "!isspace", but maybe some
other behaviors would be interesting.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm

2010-05-29 Thread Greg Stark

On Sat, May 29, 2010 at 9:13 AM, Tatsuo Ishii  wrote:
> ! #define iswordchr(c)  (lc_ctype_is_c()? \
> !                                                               ((*(c) & 
> 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \
>

Surely isspace(c) will always be false for non-ascii characters in C locale?

Now it might be sensible to just treat any non-ascii character as a
word-character in addition to alpha and digits, so what might make
sense is

   t_isalpha(c) || t_isdigit(c)) || (lc_ctype_is_c() && *(c)&0x80)

Though I wonder whether it wouldn't be generally more useful to users
to provide the non-space version as an option. I could see that being
useful for people in other circumstances aside from working around
this locale problem.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański

On 29/05/10 12:34, Jesper Krogh wrote:
> On 2010-05-28 23:47, Jan Urbański wrote:
>> On 28/05/10 22:22, Tom Lane wrote:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>>
> I think you should skip the assumption about stop-words, users may
> use something where they are needed in the index or have a language
> than the typical.  (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

> Isn't it the same "type" of logic that is used for collecting statistics
> for
> "array-types", say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>>
>> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>>
> 
> Im not sure I get this one.. does this mean that we prune everytime
> we have collected 167 new datapoints .. that would seem too often
> for me since that would roughly be once per "row".

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like "the" would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such "squashing" of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

>> After that, we remove lexemes with f<  0.9 * 0.06 * N = 0.054*N
>>
>> So assuming that on average a tsvector has 154 elements and that we went
>> through 35017 rows (as it would be in Jesper's case, before he raised
>> the stats target from 100 to 1000), we will remove lexemes with f<
>> 0.054 * 35017 * 154 that is f<  291201.37
>>
>> I wonder what would happen if Jasper's case if we did that... And I
>> wonder how sound that maths is
>>
> 
> If it means that I would get an accurate MCE-histogram for all
> things that have an occourance of more than 5.4% of the rows
> (given the samples chosen), then I think that would be really
> reasonable.

Here's the spreadsheet spat out.

The variables are:
 * the statistics target
 * top stopwords
 * error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f < (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with < 173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

> I can "fa

[HACKERS] Statistics for tsvector "wildcards". term*

2010-05-29 Thread Jesper Krogh


Hi.

There seems to be an "unimplemented" area around getting statistics for
wildcard searches done. Wildcards anchored to the left can be matched
up by the gin-index and the ts_match_vq operator:


testdb=# select to_tsvector('project') @@ to_tsquery('proj:*');
 ?column?
--
 t
(1 row)

Searching for project gives me this estimate:
testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('projects') order by id limit 50;

   QUERY PLAN
-
 Limit  (cost=0.00..3008.54 rows=50 width=76)
   ->  Index Scan using reference_pkey on reference  
(cost=0.00..3762544.72 rows=62531 width=76)

 Filter: (document_tsvector @@ to_tsquery('projects'::text))
(3 rows)

whereas searching for proj:* gives:

testdb=# explain select id,document_tsvector from efam.reference where 
document_tsvector @@ to_tsquery('proj:*') order by id limit 50;

 QUERY PLAN
-
 Limit  (cost=73.56..73.58 rows=6 width=76)
   ->  Sort  (cost=73.56..73.58 rows=6 width=76)
 Sort Key: id
 ->  Bitmap Heap Scan on reference  (cost=34.55..73.49 rows=6 
width=76)
   Recheck Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))
   ->  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.54 rows=6 width=0)
 Index Cond: (document_tsvector @@ 
to_tsquery('proj:*'::text))

(7 rows)

There are abouvios challenges in getting statistics for submatches where 
there are no real information
in the pg_stats table, but there will also be a huge amount of cases 
where a fairly reliable

guess can be extracted.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh


On 2010-05-28 04:47, Tom Lane wrote:

Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
   


I should have been more precise in the wording. Cranking up the
stats target gave me overall a "better plan", but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a "Bitmap Index Scan" on the gin-index to
"Index Scan" on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh


On 2010-05-28 23:47, Jan Urbański wrote:

On 28/05/10 22:22, Tom Lane wrote:
   

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.  Then use that estimate as
the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper.  If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.
 

I see what you mean, so the idea would be:

  * assume some value of W as the number of all words in the language
  * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
  * set e = s/10 and w = 1/e, that is 10/s
  * perform LC using that value of w
  * remove all elements for which f<  (s-e)N, that is f<  0.9*sN, where N
is the total number of lexemes processed
  * create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.
   


I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).

Isn't it the same "type" of logic that is used for collecting statistics 
for

"array-types", say integer-arrays and text arrays?

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
   


Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per "row".


After that, we remove lexemes with f<  0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f<
0.054 * 35017 * 154 that is f<  291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is
   


If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can "fairly easy" try out patches or do other kind of testing.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm

2010-05-29 Thread Tatsuo Ishii

> > It's not a practical solution for people working with prebuilt Postgres
> > versions, which is most people.  I don't object to finding a way to
> > provide a "not-space" behavior instead of an "is-alnum" behavior,
> > but as noted upthread a GUC isn't the right way.  How do you feel
> > about a new set of functions with an additional flag argument of
> > some sort?
> 
> Let me see how many functions we need to create...

After thinking a little bit more, I think following patch would not
break existing behavior and also adopts mutibyte + C locale case. What
do you think?

*** trgm_op.c~  2009-06-11 23:48:51.0 +0900
--- trgm_op.c   2010-05-29 17:07:28.0 +0900
***
*** 59,65 
  }
  
  #ifdef KEEPONLYALNUM
! #define iswordchr(c)  (t_isalpha(c) || t_isdigit(c))
  #else
  #define iswordchr(c)  (!t_isspace(c))
  #endif
--- 59,67 
  }
  
  #ifdef KEEPONLYALNUM
! #define iswordchr(c)  (lc_ctype_is_c()? \
!   ((*(c) & 0x80)? 
!t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \
!   (t_isalpha(c) 
|| t_isdigit(c)))
  #else
  #define iswordchr(c)  (!t_isspace(c))
  #endif

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

38 matches

Mail list logo