Congratulations! This is really cool. And this also made me find a bug in GHC :)
http://hackage.haskell.org/trac/ghc/ticket/3059 2009/3/2 ChrisK <hask...@list.mightyreason.com>: > Announcing the version 1.0.0 release of regex-tdfa. > > I am proud of this release. > This is not just a bug fix release. > It is a serious improvement in the asymptotic running time. > > The previous versions allowed bad combinations of pattern and searched > text length to scale badly in the length of the text. Previously the > worst case for text of length N was O(N^3). > > The new worst case asymptotic runtime scaled as O(N). > There is never any backtracking. > And the worst case storage space is independent of N. > > The package is on hackage at > http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa > The darcs repository is at > http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ > > All non-overlapping matches are found and returned, along with all > captured (parenthesized) subexpressions. The result is precisely what > Posix extended regular expressions are supposed to return. > > To be concrete, consider example text with length of N of (2^n)+2: > >> longInput = replicate (2^n) 'a' ++ "bc" > > And define 4 worst-case-scenario patterns. I wrote the code and I > know how to kill it: > >> wcs0 = "a*b" >> wcs1 = "a*c" >> wcs2 = "(a|aa)*b" >> wcs3 = "(a|aa)*c" > > wcs0 is easy. > wcs1 causes the old code to backtrack. > wcs2 causes the old code's storage to scale as O(N). > wcs3 causes both backtracking and O(N) storage with the old code. > > The old code's time scales as O(N) for wcs0, O(N^2) for wcs1 and wcs2, > and O(N^3) for wcs3. The new code is always O(N). The actual timings > for the old code on my G4 laptop for wcs on 2^8 and 2^9 and 2^10 are: > > Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 8 +RTS -sstderr 2>&1 | > head -4 > ./Test-TDFA-np wcs3 8 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] > 263,776,756 bytes allocated in the heap > user 0m1.017s > sys 0m0.058s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 9 +RTS -sstderr 2>&1 | > head -4 > ./Test-TDFA-np wcs3 9 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] > 1,998,647,452 bytes allocated in the heap > user 0m7.083s > sys 0m0.289s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 10 +RTS -sstderr 2>&1 > | head -4 > ./Test-TDFA-np wcs3 10 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] > 15,653,277,200 bytes allocated in the heap > user 0m53.350s > sys 0m2.056s > > The doubling of length is causing a nearly eight-fold time increase. > The heap allocation is also increasing nearly eight-fold. > > The new code with the same input pattern and texts gives: > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS > -sstderr 2>&1 | head -4 > ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(257,1)),(1,(-1,0))]] > 2,135,324 bytes allocated in the heap > user 0m0.017s > sys 0m0.017s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS > -sstderr 2>&1 | head -4 > ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(513,1)),(1,(-1,0))]] > 3,588,656 bytes allocated in the heap > user 0m0.024s > sys 0m0.017s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS > -sstderr 2>&1 | head -4 > ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]] > 6,345,436 bytes allocated in the heap > user 0m0.038s > sys 0m0.018s > > Note that the heap allocation for the 1026 character example above is > 2466 times less than the old code. > > That was too fast to prove the scaling, so take more input: > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS > -sstderr 2>&1 | head -4 > ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(1048577,1)),(1,(-1,0))]] > 5,708,574,848 bytes allocated in the heap > user 0m26.023s > sys 0m0.985s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS > -sstderr 2>&1 | head -4 > ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr > Test for [Char] > Right [array (0,1) [(0,(2097153,1)),(1,(-1,0))]] > 11,416,354,056 bytes allocated in the heap > user 0m52.656s > sys 0m1.985s > > The length and time both doubled, as did the heap allocation. And the > new code has searched two million characters in the time the old code > searched one thousand. > > How about away from the worst case scenario? On the testing suite the > new code is running slightly slower: > > Reason:compare-tdfa chrisk$ time ./Test-TDFA-np -r 1 100 > /dev/null > user 0m4.841s > sys 0m3.019s > > Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 -r 1 100 > > /dev/null > user 0m5.970s > sys 0m3.012s > > So that is an increase of execution time of 14%. This small dip in > performance might be reclaimable with more optimization. I think the > gain in worst case performance already offsets the slight cost. > > The code for String is complete. The strict and lazy bytestrings and > the (Seq Char) are currently using the String code for matching. This > will be improved in a future release. > > Cheers, > Chris > > The small print: regex-tdfa will still behave badly in space and time > if given a pathological pattern, > e.g. "(((a|aa){0,100}){0,100}){0,100}". But, I am hopeful than I can > improve regex-tdfa to behave well with the patterns like this one. > That is my vague goal for a future version 2.0.0 release. > > The very small print: I have been using ghc-6.10.1, and I think I have > carried over cabal switches to make ghc-6.8.3 also work. The library > probably can work with ghc-6.6, but the cabal file will need editing > first as well as fixes to "Text.Regex.TDFA.NewDFA.copySTU". > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Eugene Kirpichov Web IR developer, market.yandex.ru _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe