On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov <aekorot...@gmail.com>wrote:
> On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> Heikki Linnakangas <hlinnakan...@vmware.com> writes: >> > On 23.01.2013 09:36, Alexander Korotkov wrote: >> >> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<t...@sss.pgh.pa.us> wrote: >> >>> The biggest problem is that I really don't care for the idea of >> >>> contrib/pg_trgm being this cozy with the innards of regex_t. >> >> >> The only option I see now is to provide a method like "export_cnfa" >> which >> >> would export corresponding CNFA in fixed format. >> >> > Yeah, I think that makes sense. The transformation code in trgm_regexp.c >> > would probably be more readable too, if it didn't have to deal with the >> > regex guts representation of the CNFA. Also, once you have intermediate >> > representation of the original CNFA, you could do some of the >> > transformation work on that representation, before building the >> > "tranformed graph" containing trigrams. You could eliminate any >> > non-alphanumeric characters, joining states connected by arcs with >> > non-alphanumeric characters, for example. >> >> It's not just the CNFA though; the other big API problem is with mapping >> colors back to characters. Right now, that not only knows way too much >> about a part of the regex internals we have ambitions to change soon, >> but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of >> which should be known to the regex library IMO. So I'm not sure how we >> divvy that up sanely. To be clear: I'm not going to insist that we have >> to have a clean API factorization before we commit this at all. But it >> worries me if we don't even know how we could get to that, because we >> are going to need it eventually. >> > > Now I have following idea about API. > Put code of stage 2 (transform the original CNFA into an automaton-like > graph) into regex engine. It would use API which describes what exactly are > we going to extract from CNFA. This API could look like this. > > typedef char *Prefix; > typedef char *ArcLabel; > > typedef struct > { > Prefix newPrefix; > ArcLabel label; > } ArcInfo; > > typedef struct > { > Prefix (*getInitialPrefix) (); > bool (*prefixContains) (Prefix prefix1, Prefix prefix2); > Prefix * (*getPrefixes) (Prefix prefix, color c, int *n); > ArcInfo * (*getArcs) (Prefix prefix, color c, int *n); > void (*freePrefix) (Prefix prefix); > void (*freeArcLabel) (ArcLabel arcLabel); > } CFNATransformAPI; > > getInitialPrefix returns initial prefix value like now this code does: > > initkey.prefix.colors[0] = UNKNOWN_COLOR; > > initkey.prefix.colors[1] = UNKNOWN_COLOR; > prefixContains are exactly same as function with this name. > getPrefixes and getArcs cycle step work of addKeys an addArcs. > freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel > strutures. > > Additionally regex engine should provide correct way to examine colormap. > int getColorCharsCount(colormap *cm, color c); > pg_wchar *getColorChars(colormap *cm, color c); > getColorCharsCount would return -1 if this color should be considered as > unexpandable. > Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback? ------ With best regards, Alexander Korotkov.
trgm-regexp-0.13.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers