Catching up late, but thank you for your thoughts. I have grown so fond of using maps in XQuery, I use them even with tiny data sets now all the time. So I have learned that In combination with a large amount of large search strings they can speed up things a lot where a regular expression can only go so far.
Thanks! -----Ursprüngliche Nachricht----- Von: Christian Grün <christian.gr...@gmail.com> Gesendet: Freitag, 7. Januar 2022 18:49 An: Zimmel, Daniel <d.zim...@esvmedien.de> Cc: BaseX <basex-talk@mailman.uni-konstanz.de> Betreff: Re: [basex-talk] Questions about the performance of bulk string matching Hi Daniel, Thanks for your code. My reply is a fairly general one and not specific to XQuery: Regular expressions are very powerful, but they can get fairly expensive, so it’s always a good idea to simplify or optimize them before running them on large datasets. Here is an example for a regex that looks fairly simple, but takes many seconds to execute: matches( string-join((1 to 100000) ! 'x'), '.*y' ) This expression can e.g. be sped a lot by using '.*y' as replace string. BaseX passes on most regex strings to Java without further changes. If an expression takes a long time to execute, chances are high that it’s due to the JDK. Hope this helps, Christian On Tue, Jan 4, 2022 at 3:50 PM Zimmel, Daniel <d.zim...@esvmedien.de> wrote: > > Hi Christian, > > > > thanks for your interest, please find attached a complete example. It is all > about the length of strings in my dictionary, really. > > > > test-reduce.xq: 170.82ms > > test.xq: 6572.49ms > > > > I noticed that if the data is larger, performance-wise it is better to build > a comparison map for each textnode (not as in the example). > > On the other hand, If the strings are short, performance might even be worse. > > > > But with my use case, I have to search for complete strings like “Erste > Durchführungsverordnung zur Verordnung über Luftfahrtpersonal > (Anwendungsbestimmungen zur Lizenzierung von Piloten Flugzeuge, von Piloten > Hubschrauber, von Flugingenieuren, Freiballonführern, Flugdienstberatern und > Flugtechnikern auf Hubschraubern bei den Polizeien des Bundes und der > Länder)“ – this does not work too well with bulk matches/replace. > > > > So while I am quite happy with the results, I do not understand why > this works so fast – or if there is any feasible alternative to this > approach (I got the feeling that there is still much to learn for me > in the mysteries of XQuery ;-) > > Does this work because BaseX simply handles this well? Could match/replace in > some other way be optimized to handle bulk search including long strings? > > > > Daniel > > > > Von: Christian Grün <christian.gr...@gmail.com> > Gesendet: Donnerstag, 30. Dezember 2021 20:15 > An: Zimmel, Daniel <d.zim...@esvmedien.de> > Cc: BaseX <basex-talk@mailman.uni-konstanz.de> > Betreff: Re: [basex-talk] Questions about the performance of bulk > string matching > > > > Hi Daniel, > > > > Thanks for presenting your interesting use case. > > > > Your solution looks clever. Would you mind sharing some sample data with us > (an XML document and a string dictionary)? > > > > Cheers, > > Christian > > > > > > Zimmel, Daniel <d.zim...@esvmedien.de> schrieb am Mi., 29. Dez. 2021, 12:13: > > Hi, > > recently I ran into serious (as in SERIOUS) performance trouble regarding > expensive regexes, and no wonder why. > > Here is my scenario: > > * XML text documents with a total of 1m text nodes > * A regex search string, including a huge string dictionary list of > 50.000 strings (some of them containing 50 words each) > * a match must be wrapped in an element (= create a link) > > I could drink many cups of tea while waiting for the regex to complete... > hours... I ran out of tea! > > Then I found the 2021 Balisage paper from Mary Holstege titled "Fast > Bulk String Matching" [1] in which she explores the Aho-Corasick > algorithm, implementing it with XQuery - marvellous! Following this, > while I can build a data structure which gives me fast results, > building the same structure is still very slow due to the amount of > text to build from. So this was not fast enough for my use case - or I > may simply not be smart enough to apply it correctly :-| > > So, I tried tinkering with maps which turned out to give me extraordinary > performance gains: > > * build a map from the string dictionary > * walk through all text nodes one by one > * for each text node, put any combination of words in the text node in > word order (I need to find strings, not words) into another map > * strip punctuation (for word boundaries) and do some normalization of > whitespaces in both maps > * compare the keys of both maps > * give the new reduced string dictionary to the regular regex search > > While comparing the maps, I do not know where in the text my strings are, but > at least I know if they are in there - to find out where exactly (and how do > they fit my other regex context) I can then use a massively reduced regular > regex search. Fast! > > I am quite happy BUT I still cannot understand why this is so much faster for > my sample data: > > * plain method : 51323.72ms > * maps method: 597.94ms(!) > > Is this due to any optimizations done by BaseX or have I simply discovered > the power of maps? > How would you do it? Is there still room for improvement? > Why does Aho-Corasick not help much with this scenario? Is it because the > list of strings is simply too massive? > Why is this so much faster with text-splitting-to-map? > > See below for the query examples to better understand what I am trying > to do (bulk data not included) [2],[3] There is no normalization of > punctuation in the examples but that is only necessary for completeness. > > Best, Daniel > > [1] > http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol > 26-Holstege01.html > > [2] plain method > > let $textnodes := > fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text( > ) let $strings := > file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') > let $regex := $strings > for $textnode in $textnodes > where matches($textnode,$regex) = true() return $textnode > > [3] maps method > > (:~ > : Create map from string > : Tokenization > : > : @param $strings > : @return Map > :) > declare function local:createMapFromString ( > $string as xs:string > ) as map(*) { > let $map_words := > map:merge( > for $string in tokenize($string,'\|') > let $key := $string > let $val := $string > return map:entry($key,$val), > map { 'duplicates': 'use-first' }) return > $map_words > }; > > (:~ > : Create map from text nodes > : Write any combination of words in document order to the map > : > : @param $textnodes > : @return Map > :) > declare function local:createMapFromTextnodes ( > $textnodes as xs:string+ > ) as map(*) { > map:merge( > for $node in $textnodes > let $text := normalize-space($node) > let $tokens := tokenize($text,' ') > let $map_nodes := > map:merge( > for $start in 1 to fn:count($tokens) > for $length in 1 to fn:count($tokens) - $start + 1 > return > map:entry(fn:string-join(fn:subsequence($tokens, > $start, $length), ' '),'x') > ) > > return > $map_nodes) > }; > > (:~ > : Compare two maps > : > : @param $map1 > : @param $map2 > : @return xs:string* > :) > declare function local:reduceMaps ( > $map1 as map(*), > $map2 as map(*) > ) as xs:string* { > for $key in map:keys($map1) > where map:contains($map2,$key) > let $value := map:get($map1,$key) > return $value > }; > > let $textnodes := > fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text( > ) let $strings := > file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') > > let $map_words := local:createMapFromString($strings) > let $map_textnodes := local:createMapFromTextnodes($textnodes) > let $matches := local:reduceMaps($map_words,$map_textnodes) > > let $regex := string-join(for $match in $matches > group by $match > return $match,'|') > > for $textnode in $textnodes > where matches($textnode,$regex) = true() return $textnode >