Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?
On Min, 2010-05-30 at 00:44 -0400, Andrew Dunstan wrote: > > Mohammad Heykal Abdillah wrote: > > Hi all, > > > > Right now i am trying to understand how SQL parser is work. > > > > My question is there anyway to get list of table name and its atribut > > before raw parser is analyze? > > > > Because i like to understand how PostgreSQL "break-down" the asterik at > > target list, specialy in "natural join" case where PostgreSQL can give > > query result that have unabigous attribut and match the right join key. > > > > > > > > AIUI, expanding '*' is not done in the parser. The parser just makes an > A_star node. > > See src/backend/gram.y for the parser spec. > > Maybe you need to explain what exactly you're trying to do. > > cheers > > andrew Yes that i know, expanding '*' is done in analyzer part. I am try to do is, move the expanding process to "before" raw_parser that produce by gram.y is processed by analyzer. Like this : sql query -> gram.y -> raw_parse_tree -> (expand the '*') -> analyze In this way analyzer part only receiving the expanded '*' parse tree they wont expand the '*' again. I am aksing this because i didnt know how to (analyzer) scan database so it can return list of attribut from table. I have read the source code, and try manualy to replicate the process (from what i got is the RTE function) with no result. Thank You. -- Mohammad Heykal Abdillah -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is there anyway to get list of table name, before raw parser is analyze?
Mohammad Heykal Abdillah wrote: Hi all, Right now i am trying to understand how SQL parser is work. My question is there anyway to get list of table name and its atribut before raw parser is analyze? Because i like to understand how PostgreSQL "break-down" the asterik at target list, specialy in "natural join" case where PostgreSQL can give query result that have unabigous attribut and match the right join key. AIUI, expanding '*' is not done in the parser. The parser just makes an A_star node. See src/backend/gram.y for the parser spec. Maybe you need to explain what exactly you're trying to do. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
On Sat, May 29, 2010 at 11:40 PM, Tom Lane wrote: > Robert Haas writes: >> On Sat, May 29, 2010 at 6:11 PM, Tom Lane wrote: >>> Or, to put it differently: if nobody's done that in the past twenty >>> years, why is it likely to happen before 9.1? > >> Hmm. Well suppose we bet a dollar on whether that will happen or not. >> In fact, if you promise not to read >> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll >> make it two dollars. > > Yes, I've seen Jeff's example. It's a cute hack but somehow I doubt > that there is going to be a land rush to implement such things. > Can you point to any pre-existing example where anyone actually asked > for the ability to do that? I've often wished for the ability to constrain a tale to hold just one row, so I don't find that use case implausible at all. As to whether Jeff's use case is a cute hack or something that people will really want to do, I think the jury's still out on that one. But I don't think we should make it not work unless we have a concrete reason, and I haven't heard one yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up
Greg Stark writes: > On Sun, May 30, 2010 at 3:56 AM, Greg Stark wrote: >> This sounds familiar. If you search back in the archives around 2004 >> or so I think you'll find a similar discussion when we replaced the >> crc32 implementation with what we have now. > Fwiw here's the thread (from 2005): > http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811 I read through that thread and couldn't find much discussion of alternative CRC implementations --- we spent all our time on arguing about whether we needed 64-bit CRC or not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
Robert Haas writes: > On Sat, May 29, 2010 at 6:11 PM, Tom Lane wrote: >> Or, to put it differently: if nobody's done that in the past twenty >> years, why is it likely to happen before 9.1? > Hmm. Well suppose we bet a dollar on whether that will happen or not. > In fact, if you promise not to read > http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll > make it two dollars. Yes, I've seen Jeff's example. It's a cute hack but somehow I doubt that there is going to be a land rush to implement such things. Can you point to any pre-existing example where anyone actually asked for the ability to do that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Is there anyway to get list of table name, before raw parser is analyze?
Hi all, Right now i am trying to understand how SQL parser is work. My question is there anyway to get list of table name and its atribut before raw parser is analyze? Because i like to understand how PostgreSQL "break-down" the asterik at target list, specialy in "natural join" case where PostgreSQL can give query result that have unabigous attribut and match the right join key. Thank You. -- Mohammad Heykal Abdillah -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG 9.0 release timetable
On Sat, May 29, 2010 at 5:58 PM, Tom Lane wrote: > Robert Haas writes: >> Thoughts on a few of the remaining items: > >> Should we revert the default output format for bytea to the old style >> before shipping 9.0.0? - Consensus seems to be "no", thus no action is >> required. > > I think we should leave that there for awhile, though I agree it's > likely that the final decision will be "no change". > >> don't rename index columns behavior has already broken JDBC - As I >> understand it, this is not a code issue, but just something that >> driver authors need to be aware of. > > There had been a section on the page about information we needed to > communicate to third-party authors. Someone seems to have removed > that, but that seems like where this belongs. > >> Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that >> the machine just ran out of disk space - not sure we need to do >> anything here. > > It's a bit weird though, because UpdateControlFile should always update > in place; why would there be any risk of out of disk space? I would > like to find out exactly what happened, though I have no clear ideas > how to investigate it. Well, I think at a minimum the first two of these need to go into a section that is not called "code": the first is just a decision we might change our mind about, and the second is a communication issue, not a code issue. I'd argue that the third one is probably not something we're going to hold up the release for, either, and therefore while it might belong on a list of known open bugs it doesn't really belong on a list of 9.0 open items. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby
On Fri, May 28, 2010 at 11:12 AM, Fujii Masao wrote: > On Thu, May 27, 2010 at 11:13 PM, Robert Haas wrote: >>> I guess this happens because the frequency of checkpoint on the standby is >>> too lower than that on the master. In the master, checkpoint occurs for >>> every >>> consumption of three segments because of "checkpoint_segments = 3". On the >>> other hand, in the standby, only checkpoint_timeout has effect, so >>> checkpoint >>> occurs for every 30 minutes because of "checkpoint_timeout = 30min". >>> >>> The walreceiver should signal the bgwriter to start checkpoint if it has >>> received more than checkpoint_segments WAL files, like normal processing? >> >> Is this also an issue when using log shipping, or just with SR? > > When using log shipping, checkpoint_segments always doesn't trigger a > checkpoint. So recovery after the standby crashes might take unexpectedly > long since redo starting point might be old. > > But in file-based log shipping, since WAL files don't accumulate in > pg_xlog directory on the standby, even if the frequency of checkpoint > is very low, pg_xlog will not be filled with many WAL files. That > accumulation occurs only when using SR. > > If we should avoid low frequency of checkpoint itself rather than > accumulation of WAL files, the bgwriter instead of the walreceiver > should check if we've consumed too much WAL, I think. Thought? I attached the patch, which changes the startup process so that it signals bgwriter to perform a restartpoint if we've already replayed too much WAL files. This leads checkpoint_segments to trigger a restartpoint. This patch is worth applying for 9.0? If not, I'll add it into the next CF. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 508,513 static bool reachedMinRecoveryPoint = false; --- 508,516 static bool InRedo = false; + /* We've already launched bgwriter to perform restartpoint? */ + static bool bgwriterLaunched = false; + /* * Information logged when we detect a change in one of the parameters * important for Hot Standby. *** *** 550,555 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags); --- 553,559 static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); static bool AdvanceXLInsertBuffer(bool new_segment); + static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg); static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch); static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath, bool find_free, int *max_advance, *** *** 1554,1567 AdvanceXLInsertBuffer(bool new_segment) /* * Check whether we've consumed enough xlog space that a checkpoint is needed. * ! * Caller must have just finished filling the open log file (so that ! * openLogId/openLogSeg are valid). We measure the distance from RedoRecPtr ! * to the open log file and see if that exceeds CheckPointSegments. * * Note: it is caller's responsibility that RedoRecPtr is up-to-date. */ static bool ! XLogCheckpointNeeded(void) { /* * A straight computation of segment number could overflow 32 bits. Rather --- 1558,1571 /* * Check whether we've consumed enough xlog space that a checkpoint is needed. * ! * Caller must have just finished filling or reading the log file (so that ! * the given logid/logseg are valid). We measure the distance from RedoRecPtr ! * to the log file and see if that exceeds CheckPointSegments. * * Note: it is caller's responsibility that RedoRecPtr is up-to-date. */ static bool ! XLogCheckpointNeeded(uint32 logid, uint32 logseg) { /* * A straight computation of segment number could overflow 32 bits. Rather *** *** 1577,1584 XLogCheckpointNeeded(void) old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile + (RedoRecPtr.xrecoff / XLogSegSize); old_highbits = RedoRecPtr.xlogid / XLogSegSize; ! new_segno = (openLogId % XLogSegSize) * XLogSegsPerFile + openLogSeg; ! new_highbits = openLogId / XLogSegSize; if (new_highbits != old_highbits || new_segno >= old_segno + (uint32) (CheckPointSegments - 1)) return true; --- 1581,1588 old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile + (RedoRecPtr.xrecoff / XLogSegSize); old_highbits = RedoRecPtr.xlogid / XLogSegSize; ! new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg; ! new_highbits = logid / XLogSegSize; if (new_highbits != old_highbits || new_segno >= old_segno + (uint32) (CheckPointSegments - 1)) return true; *** *** 1782,1791 XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) * update RedoRecPtr and recheck. */ if (IsUnderPostmaster && ! XLogCheck
[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up
On Sun, May 30, 2010 at 3:56 AM, Greg Stark wrote: > This sounds familiar. If you search back in the archives around 2004 > or so I think you'll find a similar discussion when we replaced the > crc32 implementation with what we have now. Fwiw here's the thread (from 2005): http://thread.gmane.org/gmane.comp.db.postgresql.devel.general/43811 -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up
This sounds familiar. If you search back in the archives around 2004 or so I think you'll find a similar discussion when we replaced the crc32 implementation with what we have now. We put a fair amount of effort into searching for faster implementations so if you've found one 3x faster I'm pretty startled. Are you sure it's faster on all architectures and not a win sometimes and a loss other times? And are you sure it's faster in our use case where we're crcing small sequences of data often and not crcing a large block? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
On Sat, May 29, 2010 at 6:11 PM, Tom Lane wrote: > Robert Haas writes: >> The only disadvantage I see of just documenting this is that someone >> might write a user-defined index opclass that works like this, and >> they won't be able to use this until at least 9.1 (or at least, not >> without patching the source). > > I don't actually think that anyone's very likely to write a <>-like index > operator. It's approximately useless to use an index for such a query. > > Or, to put it differently: if nobody's done that in the past twenty > years, why is it likely to happen before 9.1? Hmm. Well suppose we bet a dollar on whether that will happen or not. In fact, if you promise not to read http://archives.postgresql.org/pgsql-hackers/2010-05/msg01175.php I'll make it two dollars. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up
Added to TODO: Consider a faster CRC32 algorithm * http://archives.postgresql.org/pgsql-hackers/2010-05/msg01112.php --- Andres Freund wrote: > Hi, > > I started to analyze XLogInsert because it was the major bottleneck when > creating some materialized view/cached tables/whatever. > Analyzing it I could see that content of the COMP_CRC32 macro was taking most > of the time which isn't immediately obvious when you profile because it > obviously doesn't show up as a separate function. > I first put it into functions to make it easier to profile. I couldn't > measure > any difference for COPY, CTAS and a simple pgbench run on 3 kinds of hardware > (Core2, older Xeon, older Sparc systems). > > I looked a bit around for faster implementations of CRC32 and found one in > zlib. After adapting it (pg uses slightly different computation (non- > inverted)) I found that it increases the speed of the CRC32 calculation > itself > 3 fold. > It does that by not only using one lookup table but four (one for each byte > of > a word). Those four calculations are independent and thus are considerably > faster on somewhat recent hardware. > Also it does memory lookups in 4 byte steps instead of 1 byte as the pg > version (thats only about ~8% benefit in itself). > > I wrote a preliminary patch which includes both, the original implementation > and the new one switchable via an #define. > > > I tested performance differences in a small number of scenarios: > - CTAS/INSERT ... SELECT (8-30%) > - COPY (3-20%) > - pgbench (no real difference unless directly after a checkpoint) > > Setup: > > CREATE TABLE blub (ai int, bi int, aibi int); > CREATE TABLE speedtest (ai int, bi int, aibi int); > > > INSERT ... SELECT: > > Statement: > INSERT INTO blub SELECT a.i, b.i, a.i *b.i FROM generate_series(1, 1) > a(i), generate_series(1, 1000) b(i); > > legacy crc: > > 11526.588 > 11406.518 > 11412.182 > 11430.245 > > zlib: > 9977.394 > 9945.408 > 9840.907 > 9842.875 > > > COPY: > Statement: > ('blub' enlarged here 4 times, as otherwise the variances were to large) > > COPY blub TO '/tmp/b' BINARY; > ... > CHECKPOINT;TRUNCATE speedtest; COPY speedtest FROM '/tmp/b' BINARY; > > legacy: > 44835.840 > 44832.876 > > zlib: > 39530.549 > 39365.109 > 39295.167 > > The performance differences are bigger if the table rows are significantly > bigger. > > Do you think something like that is sensible? If yes, I will make it into a > proper patch and such. > > Thanks, > > Andres > > INSERT ... SELECT profile before patch: > > 20.22% postgres postgres [.] comp_crc32 > 5.77% postgres postgres [.] XLogInsert > 5.55% postgres postgres [.] LWLockAcquire > 5.21% postgres [kernel. [k] copy_user_generic_string > 4.64% postgres postgres [.] LWLockRelease > 4.39% postgres postgres [.] ReadBuffer_common > 2.75% postgres postgres [.] heap_insert > 2.22% postgres libc-2.1 [.] memcpy > 2.09% postgres postgres [.] UnlockReleaseBuffer > 1.85% postgres postgres [.] hash_any > 1.77% postgres [kernel. [k] clear_page_c > 1.69% postgres postgres [.] hash_search_with_hash_value > 1.61% postgres postgres [.] heapgettup_pagemode > 1.50% postgres postgres [.] PageAddItem > 1.42% postgres postgres [.] MarkBufferDirty > 1.28% postgres postgres [.] RelationGetBufferForTuple > 1.15% postgres postgres [.] ExecModifyTable > 1.06% postgres postgres [.] RelationPutHeapTuple > > > After: > > 9.97% postgres postgres [.] comp_crc32 > 5.95% postgres [kernel. [k] copy_user_generic_string > 5.94% postgres postgres [.] LWLockAcquire > 5.64% postgres postgres [.] XLogInsert > 5.11% postgres postgres [.] LWLockRelease > 4.63% postgres postgres [.] ReadBuffer_common > 3.45% postgres postgres [.] heap_insert > 2.54% postgres libc-2.1 [.] memcpy > 2.03% postgres postgres [.] UnlockReleaseBuffer > 1.94% postgres postgres [.] hash_search_with_hash_value > 1.84% postgres postgres [.] hash_any > 1.73% postgres [kernel. [k] clear_page_c > 1.68% postgres postgres [.] PageAddItem > 1.62% postgres postgres [.] heapgettup_pagemode > 1.52% postgres postgres [.] RelationGetBufferForTuple > 1.47% postgres postgres [.] MarkBufferDirty > 1.30% postgres postgres [.] ExecModifyTable > 1.23% postgres postgres [.] RelationPutHeapTuple [ Attachment, skipping... ] > > -- > Sent via pgsql-hackers ma
Re: [HACKERS] Specification for Trusted PLs?
Robert Haas wrote: > On Sat, May 22, 2010 at 4:53 PM, C?dric Villemain > wrote: > > 2010/5/21 Jan Wieck : > >> The original idea was that a trusted language does not allow an > >> unprivileged > >> user to gain access to any object or data, he does not have access to > >> without that language. > >> > >> This does not include data transformation functionality, like string > >> processing or the like. As long as the user had legitimate access to the > >> input datum, then every derived form thereof is OK. > > > > I find the current doc enough, add this prose from Jan as a comment > > might help people perhaps. > > Yeah, Jan's description is very clear and to the point. The attached, applied patch clarifies the meaning of "trusted language" in the documentation using Jan's description. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com Index: doc/src/sgml/xplang.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/xplang.sgml,v retrieving revision 1.37 diff -c -c -r1.37 xplang.sgml *** doc/src/sgml/xplang.sgml 3 Apr 2010 07:22:56 - 1.37 --- doc/src/sgml/xplang.sgml 30 May 2010 02:21:53 - *** *** 151,158 VALIDATOR validator_function_name ; The optional key word TRUSTED specifies that ! ordinary database users that have no superuser privileges should ! be allowed to use this language to create functions and trigger procedures. Since PL functions are executed inside the database server, the TRUSTED flag should only be given for languages that do not allow access to database server --- 151,160 VALIDATOR validator_function_name ; The optional key word TRUSTED specifies that ! the language does not grant access to data that the user would ! not otherwise have. Trusted languages are designed for ordinary ! database users (those without superuser privilege) and allows them ! to safely create of functions and trigger procedures. Since PL functions are executed inside the database server, the TRUSTED flag should only be given for languages that do not allow access to database server Index: doc/src/sgml/ref/create_language.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/ref/create_language.sgml,v retrieving revision 1.50 diff -c -c -r1.50 create_language.sgml *** doc/src/sgml/ref/create_language.sgml 3 Apr 2010 07:22:58 - 1.50 --- doc/src/sgml/ref/create_language.sgml 30 May 2010 02:21:53 - *** *** 104,114 !TRUSTED specifies that !the language is safe, that is, it does not offer an !unprivileged user any functionality to bypass access !restrictions. If this key word is omitted when registering the !language, only users with the PostgreSQL superuser privilege can use this language to create new functions. --- 104,113 !TRUSTED specifies that the language does !not grant access to data that the user would not otherwise !have. If this key word is omitted !when registering the language, only users with the PostgreSQL superuser privilege can use this language to create new functions. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm
> > Wait. This works fine for me with stock pg_trgm. local is C and > > encoding is UTF8. What version of PostgreSQL are you using? Mine is > > 8.4.4. > > This is in 9.0, because 8.4 doesn't recognize the \u escape syntax. If > you run this in 8.4, you're just comparing a sequence of ASCII letters > and digits. Hum. Still I prefer 8.4's behavior since anything is better than returning NaN. It seems 9.0 does not have any escape route for multibyte+C locale users. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm
> This is still ignoring the point: arbitrarily changing the module's > longstanding standard behavior isn't acceptable. You need to provide > a way for the user to control the behavior. (Once you've done that, > I think it can be just either "alnum" or "!isspace", but maybe some > other behaviors would be interesting.) To be honest I don't know what "module's longstanding standard behavior" should be. It's not documented anywhere. If you mean that is whatever the current implementation is, then any effort to touch the module should be prohibited. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regression testing for psql
On Wed, May 26, 2010 at 6:25 PM, Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> There might be some value in psql backslash command tests that >> are designed to depend on just one or a few tables (or other appropriate >> objects). > > Updated, much much smaller, patch attached. Also available, again, at > http://snowman.net/~sfrost/psql-regress-help.patch > > Basically, I removed anything that would produce data directly from > the catalogs by trying to find a 'none' object which matched. This > still goes through alot of the same setup and query, it's just that > there aren't any results. Is this something to be added to 2010-07 commitfest? -selena -- http://chesnok.com/daily - me -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
On Sat, May 29, 2010 at 06:11:57PM -0400, Tom Lane wrote: > Robert Haas writes: > > The only disadvantage I see of just documenting this is that > > someone might write a user-defined index opclass that works like > > this, and they won't be able to use this until at least 9.1 (or at > > least, not without patching the source). > > I don't actually think that anyone's very likely to write a <>-like > index operator. It's approximately useless to use an index for such > a query. > > Or, to put it differently: if nobody's done that in the past twenty > years, why is it likely to happen before 9.1? Because there's a fundamentally new way to use them now, namely with exclusion constraints :) Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
Robert Haas writes: > The only disadvantage I see of just documenting this is that someone > might write a user-defined index opclass that works like this, and > they won't be able to use this until at least 9.1 (or at least, not > without patching the source). I don't actually think that anyone's very likely to write a <>-like index operator. It's approximately useless to use an index for such a query. Or, to put it differently: if nobody's done that in the past twenty years, why is it likely to happen before 9.1? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small exclusion constraints patch
On Fri, May 28, 2010 at 10:32 PM, Bruce Momjian wrote: > Tom Lane wrote: >> Jeff Davis writes: >> > Currently, the check for exclusion constraints performs a sanity check >> > that's slightly too strict -- it assumes that a tuple will conflict with >> > itself. That is not always the case: the operator might be "<>", in >> > which case it's perfectly valid for the search for conflicts to not find >> > itself. >> >> > This patch simply removes that sanity check, and leaves a comment in >> > place. >> >> I'm a bit uncomfortable with removing the sanity check; it seems like a >> good thing to have, especially since this code hasn't even made it out >> of beta yet. AFAIK the "<>" case is purely hypothetical, because we >> have no index opclasses supporting such an operator, no? How about just >> documenting that we'd need to remove the sanity check if we ever did add >> support for such a case? > > Done, with attached, applied patch. The only disadvantage I see of just documenting this is that someone might write a user-defined index opclass that works like this, and they won't be able to use this until at least 9.1 (or at least, not without patching the source). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG 9.0 release timetable
Robert Haas writes: > Thoughts on a few of the remaining items: > Should we revert the default output format for bytea to the old style > before shipping 9.0.0? - Consensus seems to be "no", thus no action is > required. I think we should leave that there for awhile, though I agree it's likely that the final decision will be "no change". > don't rename index columns behavior has already broken JDBC - As I > understand it, this is not a code issue, but just something that > driver authors need to be aware of. There had been a section on the page about information we needed to communicate to third-party authors. Someone seems to have removed that, but that seems like where this belongs. > Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that > the machine just ran out of disk space - not sure we need to do > anything here. It's a bit weird though, because UpdateControlFile should always update in place; why would there be any risk of out of disk space? I would like to find out exactly what happened, though I have no clear ideas how to investigate it. > move 'long long' check to c.h - Is this perhaps addressed by Michael > Meskes commits on May 25th? > Mergejoin null handling - I think this is done: Yup, both done, I moved them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG 9.0 release timetable
On Sat, May 29, 2010 at 5:09 PM, Robert Haas wrote: > This is a really short list. Thoughts on a few of the remaining items: Type Mismatch Error in Set Returning Functions - tgl says this is a deliberate change per link I just added to the wiki. do we think more is required here to prevent cranky users? Should we revert the default output format for bytea to the old style before shipping 9.0.0? - Consensus seems to be "no", thus no action is required. don't rename index columns behavior has already broken JDBC - As I understand it, this is not a code issue, but just something that driver authors need to be aware of. Crash in buildfarm for Mac OS X 10.6.3 - Consensus seems to be that the machine just ran out of disk space - not sure we need to do anything here. move 'long long' check to c.h - Is this perhaps addressed by Michael Meskes commits on May 25th? Mergejoin null handling - I think this is done: http://archives.postgresql.org/pgsql-committers/2010-05/msg00332.php Timeline for removal of older than 7.4 links to docs - link on the wiki page is broken and this doesn't seem like a 9.0 issue anyway. suggest we remove it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance problem in textanycat/anytextcat
Robert Haas writes: > This is still on the 9.0 open items list, but ISTM you fixed it with > two commits on May 27th. Is that correct? Oh, sorry, forgot to update the open items. Done now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PG 9.0 release timetable
On Sat, May 29, 2010 at 4:19 PM, Bruce Momjian wrote: > Assuming we want a release Postgres 9.0 by mid-August, here is how the > timetable would look: > > Need RC release to be stable for 1-2 weeks before final > RC must be released by August 1 > Beta must be stable for 2-3 weeks before RC > Stable beta must be released by early July > > So, we have 5-6 weeks to get a stable beta. Looking at the open issues: > > > http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues > > it looks like we are doing OK, but we must continue progressing. This is a really short list. Several of these items have already been fixed, and others have been discussed extensively and are just a question of making a final decision. The thorniest question we have yet to resolve is what to do about max_standby_delay - I think we need Tom and Heikki to review this patch by Simon: http://archives.postgresql.org/pgsql-hackers/2010-05/msg01666.php The real question in terms of release, I think, is how long we want to wait for more bugs to be found, and/or how much time do we want to allow for Tom and others to do further review of the code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance problem in textanycat/anytextcat
On Mon, May 17, 2010 at 9:23 PM, Tom Lane wrote: > Robert Haas writes: >> On Mon, May 17, 2010 at 4:01 PM, Tom Lane wrote: >>> Perhaps this is a backpatchable bug fix. Comments? > >> I can't say whether this is safe enough to back-patch, but the way >> this is set up, don't we also need to fix some catalog entries and, if >> yes, isn't that problematic? > > The only catalog entries at issue, AFAICT, are the textanycat/anytextcat > ones. I am not sure whether we should attempt to back-patch changes for > them, but this patch wouldn't make the situation in the back branches > worse. In particular, if we apply this patch but don't change the > catalog entries, then nothing would change at all about the problematic > cases, because the planner would decide it couldn't safely inline the > function. The only cases where inlining will happen is where the > expression's apparent volatility stays the same or decreases, so as far > as that issue is concerned this patch will never make CREATE INDEX > reject a case it would have accepted otherwise. The patch *will* make > CREATE INDEX reject cases with volatile default arguments hiding under > non-volatile functions, but that's got nothing to do with any built-in > functions; and that's the case I claim is clearly a bug fix. This is still on the 9.0 open items list, but ISTM you fixed it with two commits on May 27th. Is that correct? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PG 9.0 release timetable
Assuming we want a release Postgres 9.0 by mid-August, here is how the timetable would look: Need RC release to be stable for 1-2 weeks before final RC must be released by August 1 Beta must be stable for 2-3 weeks before RC Stable beta must be released by early July So, we have 5-6 weeks to get a stable beta. Looking at the open issues: http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items#Resolved_Issues it looks like we are doing OK, but we must continue progressing. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > [ e of ] s/2 or s/3 look reasonable. The examples in the LC paper seem to all use e = s/10. Note the stated assumption e << s. > So, should I just write a patch that sets the bucket width and pruning > count using 0.07 as the assumed frequency of the most common word and > epsilon equal to s/2 or s/3? I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for a first cut to experiment with. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
On 29/05/10 17:34, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: >> On 29/05/10 17:09, Tom Lane wrote: >>> There is definitely something wrong with your math there. It's not >>> possible for the 100'th most common word to have a frequency as high >>> as 0.06 --- the ones above it presumably have larger frequencies, >>> which makes the total quite a lot more than 1.0. > >> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be >> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 > > Um, apparently I can't do simple arithmetic first thing in the morning > either, cause I got my number wrong too ;-) > > After a bit more research: if you use the basic form of Zipf's law > with a 1/k distribution, the first frequency has to be about 0.07 > to make the total come out to 1.0 for a reasonable number of words. > So we could use s = 0.07 / K when we wanted a final list of K words. > Some people (including the LC paper) prefer a higher exponent, ie > 1/k^S with S around 1.25. That makes the F1 value around 0.22 which > seems awfully high for the type of data we're working with, so I think > the 1/k rule is probably what we want here. OK, I think we're getting somewhere :o) I took the formula from Wikipedia's page on Zipf's law, assuming an exponent of 1: rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is the number of words in English Then I took the nth harmonic number expansion from the page on harmonic numbers: H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 + O(n^-6) Assuming 1 million words in English and the big-O term in the harmonic expansion to be 1, we get H(1e6) = 14.3927, which would make the frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say 0.07). Which brings me to the same result as yours, which in turn reassures me a lot ;) My previous result was wrong because I used the wrong logarithm base, go figure. So with this, for statistics target of 100 we would predict the frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in the input the bucket width and the final pruning value depend only on the epsilon that we choose for the LC algorithm. So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 = 1428 lexemes and would not discard anything from the result. If we want e to be s/2 we'd prune every 2857 lexemes and discard lexemes with counts < 1887. For s/3, s/4 etc the numbers look like this: s/114280 s/228571887 s/342852516 s/457142831 s/571423019 s/685713145 s/71 3235 s/811428 3302 s/912857 3355 s/2 or s/3 look reasonable. So, should I just write a patch that sets the bucket width and pruning count using 0.07 as the assumed frequency of the most common word and epsilon equal to s/2 or s/3? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > On 29/05/10 17:09, Tom Lane wrote: >> There is definitely something wrong with your math there. It's not >> possible for the 100'th most common word to have a frequency as high >> as 0.06 --- the ones above it presumably have larger frequencies, >> which makes the total quite a lot more than 1.0. > Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be > 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 Um, apparently I can't do simple arithmetic first thing in the morning either, cause I got my number wrong too ;-) After a bit more research: if you use the basic form of Zipf's law with a 1/k distribution, the first frequency has to be about 0.07 to make the total come out to 1.0 for a reasonable number of words. So we could use s = 0.07 / K when we wanted a final list of K words. Some people (including the LC paper) prefer a higher exponent, ie 1/k^S with S around 1.25. That makes the F1 value around 0.22 which seems awfully high for the type of data we're working with, so I think the 1/k rule is probably what we want here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
On 29/05/10 17:09, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. > >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. > >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 > > There is definitely something wrong with your math there. It's not > possible for the 100'th most common word to have a frequency as high > as 0.06 --- the ones above it presumably have larger frequencies, > which makes the total quite a lot more than 1.0. Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 With regards to my other mail this means that top_stopwords = 10 and error_factor = 10 would mean bucket_width = 7150 and final prune value of 6787. Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > Hm, I am now thinking that maybe this theory is flawed, because tsvecors > contain only *unique* words, and Zipf's law is talking about words in > documents in general. Normally a word like "the" would appear lots of > times in a document, but (even ignoring the fact that it's a stopword > and so won't appear at all) in a tsvector it will be present only once. > This may or may not be a problem, not sure if such "squashing" of > occurences as tsvectors do skewes the distribution away from Zipfian or not. Well, it's still going to approach Zipfian distribution over a large number of documents. In any case we are not really depending on Zipf's law heavily with this approach. The worst-case result if it's wrong is that we end up with an MCE list shorter than our original target. I suggest we could try this and see if we notice that happening a lot. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > Now I tried to substitute some numbers there, and so assuming the > English language has ~1e6 words H(W) is around 6.5. Let's assume the > statistics target to be 100. > I chose s as 1/(st + 10)*H(W) because the top 10 English words will most > probably be stopwords, so we will never see them in the input. > Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 There is definitely something wrong with your math there. It's not possible for the 100'th most common word to have a frequency as high as 0.06 --- the ones above it presumably have larger frequencies, which makes the total quite a lot more than 1.0. For the purposes here, I think it's probably unnecessary to use the more complex statements of Zipf's law. The interesting property is the rule "the k'th most common element occurs 1/k as often as the most common one". So if you suppose the most common lexeme has frequency 0.1, the 100'th most common should have frequency around 0.0001. That's pretty crude of course but it seems like the right ballpark. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm
Tatsuo Ishii writes: > After thinking a little bit more, I think following patch would not > break existing behavior and also adopts mutibyte + C locale case. What > do you think? This is still ignoring the point: arbitrarily changing the module's longstanding standard behavior isn't acceptable. You need to provide a way for the user to control the behavior. (Once you've done that, I think it can be just either "alnum" or "!isspace", but maybe some other behaviors would be interesting.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm
On Sat, May 29, 2010 at 9:13 AM, Tatsuo Ishii wrote: > ! #define iswordchr(c) (lc_ctype_is_c()? \ > ! ((*(c) & > 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \ > Surely isspace(c) will always be false for non-ascii characters in C locale? Now it might be sensible to just treat any non-ascii character as a word-character in addition to alpha and digits, so what might make sense is t_isalpha(c) || t_isdigit(c)) || (lc_ctype_is_c() && *(c)&0x80) Though I wonder whether it wouldn't be generally more useful to users to provide the non-space version as an option. I could see that being useful for people in other circumstances aside from working around this locale problem. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
On 29/05/10 12:34, Jesper Krogh wrote: > On 2010-05-28 23:47, Jan Urbański wrote: >> On 28/05/10 22:22, Tom Lane wrote: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. >> >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. >> > I think you should skip the assumption about stop-words, users may > use something where they are needed in the index or have a language > than the typical. (and they dont seem to influcence the math that much). Turns out it has nearly linear influence on the bucket width and the frequency necessary to survive the final pruning. I put some data in a spreadsheet, results below. > Isn't it the same "type" of logic that is used for collecting statistics > for > "array-types", say integer-arrays and text arrays? AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. ts_typanalyze is the only typanalyze function that takes the trouble of looping over the actual contents of each cell, all the others just compare whole arrays (which means that for a text[] field you will probably a quite useless MCV entry). >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 >> >> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes >> > > Im not sure I get this one.. does this mean that we prune everytime > we have collected 167 new datapoints .. that would seem too often > for me since that would roughly be once per "row". Hm, if we pick s to be 0.06, we say that the K'th word in the English language will have a frequency of 0.06, so if we want to have statistics with an error of s/10, we can prune every 167 lexemes (K is the statistics target, possibly +top_stopwords). Hm, I am now thinking that maybe this theory is flawed, because tsvecors contain only *unique* words, and Zipf's law is talking about words in documents in general. Normally a word like "the" would appear lots of times in a document, but (even ignoring the fact that it's a stopword and so won't appear at all) in a tsvector it will be present only once. This may or may not be a problem, not sure if such "squashing" of occurences as tsvectors do skewes the distribution away from Zipfian or not. Anyway, figuring that out would require some more math and thinking, and to fix the problem at hand we can say Zipf is good enough. >> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N >> >> So assuming that on average a tsvector has 154 elements and that we went >> through 35017 rows (as it would be in Jesper's case, before he raised >> the stats target from 100 to 1000), we will remove lexemes with f< >> 0.054 * 35017 * 154 that is f< 291201.37 >> >> I wonder what would happen if Jasper's case if we did that... And I >> wonder how sound that maths is >> > > If it means that I would get an accurate MCE-histogram for all > things that have an occourance of more than 5.4% of the rows > (given the samples chosen), then I think that would be really > reasonable. Here's the spreadsheet spat out. The variables are: * the statistics target * top stopwords * error factor Where top stopwords is the number of top words in the English language that would be stopwords. You can also think about it as the smudge factor determinig how well do we trust that the distribution is Zipfian. Theoretically if you want to keep X values in the MCE array, you should discard inputs with frequency lower than the frequency of the X'th value in a Zipfian distribution. If you would write out all English words and their frequencies (according to Zipf's law), the top Y of them would be stopwords. We want to discard words with frequency that's lower than X + Y, and then we probably want to have some breathing space as well. That cutoff frequency is called s in the LC algorithm. Error factor determines the relation between s and e, since apparently we want e to be proportional to s (e is the error from the LC algorithm). It directly determines the bucket width, since the larger the bucket, the more accurate the results will be, as there will be less pruning going on. There are also constants: H(len(eng)) is the harmonic number from Zipf's law, that assuming 1e6 words in English is 6.5. tsvector length and rows in sample are just some values to get concrete numbers out. They influence the final pruning frequency, because the rule is f < (s-e)N and N is the total number of lexemes seen The results are attached in a text (CSV) file, to preserve formatting. Based on them I'd like to propose top_stopwords and error_factor to be 100. With your dataset this would mean pruning every 3076 lexemes and discarding from the result all lexemes with < 173507 occurrences. With statistics target set to 1000 it would change to 16923 and 31546, respectively. > I can "fa
[HACKERS] Statistics for tsvector "wildcards". term*
Hi. There seems to be an "unimplemented" area around getting statistics for wildcard searches done. Wildcards anchored to the left can be matched up by the gin-index and the ts_match_vq operator: testdb=# select to_tsvector('project') @@ to_tsquery('proj:*'); ?column? -- t (1 row) Searching for project gives me this estimate: testdb=# explain select id,document_tsvector from efam.reference where document_tsvector @@ to_tsquery('projects') order by id limit 50; QUERY PLAN - Limit (cost=0.00..3008.54 rows=50 width=76) -> Index Scan using reference_pkey on reference (cost=0.00..3762544.72 rows=62531 width=76) Filter: (document_tsvector @@ to_tsquery('projects'::text)) (3 rows) whereas searching for proj:* gives: testdb=# explain select id,document_tsvector from efam.reference where document_tsvector @@ to_tsquery('proj:*') order by id limit 50; QUERY PLAN - Limit (cost=73.56..73.58 rows=6 width=76) -> Sort (cost=73.56..73.58 rows=6 width=76) Sort Key: id -> Bitmap Heap Scan on reference (cost=34.55..73.49 rows=6 width=76) Recheck Cond: (document_tsvector @@ to_tsquery('proj:*'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..34.54 rows=6 width=0) Index Cond: (document_tsvector @@ to_tsquery('proj:*'::text)) (7 rows) There are abouvios challenges in getting statistics for submatches where there are no real information in the pg_stats table, but there will also be a huge amount of cases where a fairly reliable guess can be extracted. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
On 2010-05-28 04:47, Tom Lane wrote: Cranking up the stats target actually makes it worse not better, since low-frequency items are then more likely to get into the MCV list I should have been more precise in the wording. Cranking up the stats target gave me overall a "better plan", but that is due to that the range in the MCE histogram where the query-plan for my sample query tipped from a "Bitmap Index Scan" on the gin-index to "Index Scan" on a btree index actually became reliable. This is more due to the nature of my application and test queries than has anything to do with the correctness of the MCE histogram. So cranking up the statistics target made the problem move to somewhere, where it didnt matter that much to me. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsvector pg_stats seems quite a bit off.
On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number of MCV entries or perhaps a bit more. Then use that estimate as the "s" value, and set e = s/10 or so, and then w = 1/e and continue as per the paper. If the eventual filtering results in a lot less than the target number of MCV entries (because the input wasn't so Zipfian), we lose, but at least we have accurate numbers for the entries we kept. I see what you mean, so the idea would be: * assume some value of W as the number of all words in the language * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic number and st is the statistics target, using Zipf's law * set e = s/10 and w = 1/e, that is 10/s * perform LC using that value of w * remove all elements for which f< (s-e)N, that is f< 0.9*sN, where N is the total number of lexemes processed * create the MCELEM entries as (item, f/N) Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assume the statistics target to be 100. I chose s as 1/(st + 10)*H(W) because the top 10 English words will most probably be stopwords, so we will never see them in the input. I think you should skip the assumption about stop-words, users may use something where they are needed in the index or have a language than the typical. (and they dont seem to influcence the math that much). Isn't it the same "type" of logic that is used for collecting statistics for "array-types", say integer-arrays and text arrays? Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes Im not sure I get this one.. does this mean that we prune everytime we have collected 167 new datapoints .. that would seem too often for me since that would roughly be once per "row". After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N So assuming that on average a tsvector has 154 elements and that we went through 35017 rows (as it would be in Jesper's case, before he raised the stats target from 100 to 1000), we will remove lexemes with f< 0.054 * 35017 * 154 that is f< 291201.37 I wonder what would happen if Jasper's case if we did that... And I wonder how sound that maths is If it means that I would get an accurate MCE-histogram for all things that have an occourance of more than 5.4% of the rows (given the samples chosen), then I think that would be really reasonable. I can "fairly easy" try out patches or do other kind of testing. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm
> > It's not a practical solution for people working with prebuilt Postgres > > versions, which is most people. I don't object to finding a way to > > provide a "not-space" behavior instead of an "is-alnum" behavior, > > but as noted upthread a GUC isn't the right way. How do you feel > > about a new set of functions with an additional flag argument of > > some sort? > > Let me see how many functions we need to create... After thinking a little bit more, I think following patch would not break existing behavior and also adopts mutibyte + C locale case. What do you think? *** trgm_op.c~ 2009-06-11 23:48:51.0 +0900 --- trgm_op.c 2010-05-29 17:07:28.0 +0900 *** *** 59,65 } #ifdef KEEPONLYALNUM ! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c)) #else #define iswordchr(c) (!t_isspace(c)) #endif --- 59,67 } #ifdef KEEPONLYALNUM ! #define iswordchr(c) (lc_ctype_is_c()? \ ! ((*(c) & 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \ ! (t_isalpha(c) || t_isdigit(c))) #else #define iswordchr(c) (!t_isspace(c)) #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers