Hi Daniel, Thanks for presenting your interesting use case.
Your solution looks clever. Would you mind sharing some sample data with us (an XML document and a string dictionary)? Cheers, Christian Zimmel, Daniel <d.zim...@esvmedien.de> schrieb am Mi., 29. Dez. 2021, 12:13: > Hi, > > recently I ran into serious (as in SERIOUS) performance trouble regarding > expensive regexes, and no wonder why. > > Here is my scenario: > > * XML text documents with a total of 1m text nodes > * A regex search string, including a huge string dictionary list of 50.000 > strings (some of them containing 50 words each) > * a match must be wrapped in an element (= create a link) > > I could drink many cups of tea while waiting for the regex to complete... > hours... I ran out of tea! > > Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk > String Matching" [1] in which she explores the Aho-Corasick algorithm, > implementing it with XQuery - marvellous! Following this, while I can build > a data structure which gives me fast results, building the same structure > is still very slow due to the amount of text to build from. So this was not > fast enough for my use case - or I may simply not be smart enough to apply > it correctly :-| > > So, I tried tinkering with maps which turned out to give me extraordinary > performance gains: > > * build a map from the string dictionary > * walk through all text nodes one by one > * for each text node, put any combination of words in the text node in > word order (I need to find strings, not words) into another map > * strip punctuation (for word boundaries) and do some normalization of > whitespaces in both maps > * compare the keys of both maps > * give the new reduced string dictionary to the regular regex search > > While comparing the maps, I do not know where in the text my strings are, > but at least I know if they are in there - to find out where exactly (and > how do they fit my other regex context) I can then use a massively reduced > regular regex search. Fast! > > I am quite happy BUT I still cannot understand why this is so much faster > for my sample data: > > * plain method : 51323.72ms > * maps method: 597.94ms(!) > > Is this due to any optimizations done by BaseX or have I simply discovered > the power of maps? > How would you do it? Is there still room for improvement? > Why does Aho-Corasick not help much with this scenario? Is it because the > list of strings is simply too massive? > Why is this so much faster with text-splitting-to-map? > > See below for the query examples to better understand what I am trying to > do (bulk data not included) [2],[3] > There is no normalization of punctuation in the examples but that is only > necessary for completeness. > > Best, Daniel > > [1] > http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Holstege01.html > > [2] plain method > > let $textnodes := > fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() > let $strings := > file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') > let $regex := $strings > for $textnode in $textnodes > where matches($textnode,$regex) = true() > return $textnode > > [3] maps method > > (:~ > : Create map from string > : Tokenization > : > : @param $strings > : @return Map > :) > declare function local:createMapFromString ( > $string as xs:string > ) as map(*) { > let $map_words := > map:merge( > for $string in tokenize($string,'\|') > let $key := $string > let $val := $string > return map:entry($key,$val), > map { 'duplicates': 'use-first' }) > return > $map_words > }; > > (:~ > : Create map from text nodes > : Write any combination of words in document order to the map > : > : @param $textnodes > : @return Map > :) > declare function local:createMapFromTextnodes ( > $textnodes as xs:string+ > ) as map(*) { > map:merge( > for $node in $textnodes > let $text := normalize-space($node) > let $tokens := tokenize($text,' ') > let $map_nodes := > map:merge( > for $start in 1 to fn:count($tokens) > for $length in 1 to fn:count($tokens) - $start + 1 > return > map:entry(fn:string-join(fn:subsequence($tokens, > $start, $length), ' '),'x') > ) > > return > $map_nodes) > }; > > (:~ > : Compare two maps > : > : @param $map1 > : @param $map2 > : @return xs:string* > :) > declare function local:reduceMaps ( > $map1 as map(*), > $map2 as map(*) > ) as xs:string* { > for $key in map:keys($map1) > where map:contains($map2,$key) > let $value := map:get($map1,$key) > return $value > }; > > let $textnodes := > fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() > let $strings := > file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') > > let $map_words := local:createMapFromString($strings) > let $map_textnodes := local:createMapFromTextnodes($textnodes) > let $matches := local:reduceMaps($map_words,$map_textnodes) > > let $regex := string-join(for $match in $matches > group by $match > return $match,'|') > > for $textnode in $textnodes > where matches($textnode,$regex) = true() > return $textnode > > >