On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas <hlinnakan...@vmware.com>wrote:
> On 02.12.2012 20:19, Tom Lane wrote: > >> Alexander Korotkov<aekorot...@gmail.com> writes: >> >>> Nice idea to delay expanding colors to characters! Obviously, we should >>> delay expanding inly alphanumerical characters. Because >>> non-alphanumberical >>> characters influence graph structure. Trying to implement... >>> >> >> Uh, why would that be? Colors are colors. The regexp machinery doesn't >> care whether they represent alphanumerics or not. (Or to be more >> precise, if there is a situation where it makes a difference, separate >> colors will have been created for each set of characters that need to be >> distinguished.) >> > > The regexp machinery doesn't care, but the trigrams that pg_trgm extracts > only contain alphanumeric characters. So if by looking at the CNFA graph > produced by the regexp machinery you conclude that any matching strings > must contain three-letter sequences "%oo" and "#oo", you can just club them > together into " oo" trigram. > > I think you can run a pre-processing step to the colors, and merge colors > that are equivalent as far as trigrams are considered. For example, if you > have a color that contains only character '%', and another that contains > character '#', you can treat them as the same hcolor. You might then be > able to simplify the CNFA. Actually, it would be even better if you could > apply the pre-processing to the regexp before the regexp machinery turns it > into a CNFA. Not sure how easy it would be to do such pre-processing. > Treating colors as same should be possible only for colors which has no alphanumeric characters, because colors are non-overlapping. However, this optimization could be significant in some cases. BTW, why create the path matrix? You could check the "check" array of > trigrams in the consistent function directly against the graph. Consistent > should return true, iff there is a path through the graph following only > arcs that contain trigrams present in the check array. Finding a path > through a complex graph could be expensive, O(|E|), but if the path is > complex, the path matrix would be large as well, and checking against a > large matrix isn't exactly free either. It would allow you to avoid > "overflows" caused by having too many paths through the graph. Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. ------ With best regards, Alexander Korotkov.