2009/2/5 ChrisK <[email protected]>: > Eugene Kirpichov wrote: >>> >>> All in all, my question remains: what is the fastest way to do this >>> kind of parsing on a lazy bytestring? >>> > > Your example regular expression works the same in both Posix and Perl-ish > semantics. Do you know the difference? Posix libraries look for the > longest match of all possible matches. Perl-ish is left-bias and looks at > the left branch first and only looks at the right branch when the left fails > and it has to backtrack.
Thanks, I didn't know that about posix. > > So the "++" operator is a hack to try and control the backtracking of Perl > regular expressions. Such a things has no meaning in Posix where the > implementation details are totally different. > That's precisely what I put it in for :) > You might try this variant of the example pattern: > /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+) > > The [^0-9] can be used when you know that there is at least one junk > character before the barid, which I suspect will always occur in a URL. > > I expect regex-posix to be slower than regex-pcre. I have not used the new > pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library > wrapper. There are patterns where regex-pcre will backtrack and take > exponentially more time than regex-tdfa's automaton (which is not yet ideal > and may get faster). You're right :) > > So what is the lazy bytestring with its multiple buffers doing for you when > using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against > these C libraries the target text is converted to a single buffer, i.e. a > CStringLen in Haskell. Thus it is morally converted into a strict > bytestring. This may involve copying the logfile into a NEW strict > bytestring EVERY TIME you run a match. Please Please Please convert to a > strict bytestring and then run regex-pcre or pcre-light (or any of the > others). > I'm running the regex on lines of the logfile, not on the whole logfile itself; and I'm running it only once per line, so that doesn't matter much. Actually, that's what I did in the initial version of the program (strictify lines before matching), but then I experimentally got rid of the extra conversion function and performance didn't change at all (anyways, someone was converting that string exactly once, be it me or the package), so I left it that way. > regex-tdfa does not convert it into a strict bytestring, but is otherwise > much slower than pcre for your simple pattern. > > As for regex-pcre's interface....you should use the API in regex-base to get > a pure interface. The RegexLike functions are the pure interface for this, > and the RegexContext class offers a slew of instances with useful variants. Thanks, I didn't know that. > But if you have been getting to the "low level IO API" in regex-pcre then > you probably do not need or want the RegexContext transformations. Is that because of their performance? > > And BoyerMoore (which I think I helped optimize): this may be faster because > it does not copy your whole Lazy bytestring into a Strict ByteString for > each search. But you may wish to test it with a Strict ByteString as input > anyway. > > -- > Chris > _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
