Re: [basex-talk] Questions about the performance of bulk string matching

Zimmel, Daniel Tue, 18 Jan 2022 05:11:46 -0800

Catching up late, but thank you for your thoughts.
I have grown so fond of using maps in XQuery, I use them even with tiny data 
sets now all the time.
So I have learned that In combination with a large amount of large search 
strings they can speed up things a lot where a regular expression can only go 
so far.


Thanks!

-----Ursprüngliche Nachricht-----
Von: Christian Grün <christian.gr...@gmail.com> 
Gesendet: Freitag, 7. Januar 2022 18:49
An: Zimmel, Daniel <d.zim...@esvmedien.de>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Betreff: Re: [basex-talk] Questions about the performance of bulk string 
matching

Hi Daniel,

Thanks for your code.

My reply is a fairly general one and not specific to XQuery: Regular 
expressions are very powerful, but they can get fairly expensive, so it’s 
always a good idea to simplify or optimize them before running them on large 
datasets. Here is an example for a regex that looks fairly simple, but takes 
many seconds to execute:

matches(
  string-join((1 to 100000) ! 'x'),
  '.*y'
)

This expression can e.g. be sped a lot by using '.*y' as replace string.

BaseX passes on most regex strings to Java without further changes. If an 
expression takes a long time to execute, chances are high that it’s due to the 
JDK.

Hope this helps,
Christian



On Tue, Jan 4, 2022 at 3:50 PM Zimmel, Daniel <d.zim...@esvmedien.de> wrote:
>
> Hi Christian,
>
>
>
> thanks for your interest, please find attached a complete example. It is all 
> about the length of strings in my dictionary, really.
>
>
>
> test-reduce.xq: 170.82ms
>
> test.xq: 6572.49ms
>
>
>
> I noticed that if the data is larger, performance-wise it is better to build 
> a comparison map for each textnode (not as in the example).
>
> On the other hand, If the strings are short, performance might even be worse.
>
>
>
> But with my use case, I have to search for complete strings like “Erste 
> Durchführungsverordnung zur Verordnung über Luftfahrtpersonal 
> (Anwendungsbestimmungen zur Lizenzierung von Piloten Flugzeuge, von Piloten 
> Hubschrauber, von Flugingenieuren, Freiballonführern, Flugdienstberatern und 
> Flugtechnikern auf Hubschraubern bei den Polizeien des Bundes und der 
> Länder)“ – this does not work too well with bulk matches/replace.
>
>
>
> So while I am quite happy with the results, I do not understand why 
> this works so fast – or if there is any feasible alternative to this 
> approach (I got the feeling that there is still much to learn for me 
> in the mysteries of XQuery ;-)
>
> Does this work because BaseX simply handles this well? Could match/replace in 
> some other way be optimized to handle bulk search including long strings?
>
>
>
> Daniel
>
>
>
> Von: Christian Grün <christian.gr...@gmail.com>
> Gesendet: Donnerstag, 30. Dezember 2021 20:15
> An: Zimmel, Daniel <d.zim...@esvmedien.de>
> Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
> Betreff: Re: [basex-talk] Questions about the performance of bulk 
> string matching
>
>
>
> Hi Daniel,
>
>
>
> Thanks for presenting your interesting use case.
>
>
>
> Your solution looks clever. Would you mind sharing some sample data with us 
> (an XML document and a string dictionary)?
>
>
>
> Cheers,
>
> Christian
>
>
>
>
>
> Zimmel, Daniel <d.zim...@esvmedien.de> schrieb am Mi., 29. Dez. 2021, 12:13:
>
> Hi,
>
> recently I ran into serious (as in SERIOUS) performance trouble regarding 
> expensive regexes, and no wonder why.
>
> Here is my scenario:
>
> * XML text documents with a total of 1m text nodes
> * A regex search string, including a huge string dictionary list of 
> 50.000 strings (some of them containing 50 words each)
> * a match must be wrapped in an element (= create a link)
>
> I could drink many cups of tea while waiting for the regex to complete... 
> hours... I ran out of tea!
>
> Then I found the 2021 Balisage paper from Mary Holstege titled "Fast 
> Bulk String Matching" [1] in which she explores the Aho-Corasick 
> algorithm, implementing it with XQuery - marvellous! Following this, 
> while I can build a data structure which gives me fast results, 
> building the same structure is still very slow due to the amount of 
> text to build from. So this was not fast enough for my use case - or I 
> may simply not be smart enough to apply it correctly :-|
>
> So, I tried tinkering with maps which turned out to give me extraordinary 
> performance gains:
>
> * build a map from the string dictionary
> * walk through all text nodes one by one
> * for each text node, put any combination of words in the text node in 
> word order (I need to find strings, not words) into another map
> * strip punctuation (for word boundaries) and do some normalization of 
> whitespaces in both maps
> * compare the keys of both maps
> * give the new reduced string dictionary to the regular regex search
>
> While comparing the maps, I do not know where in the text my strings are, but 
> at least I know if they are in there - to find out where exactly (and how do 
> they fit my other regex context) I can then use a massively reduced regular 
> regex search. Fast!
>
> I am quite happy BUT I still cannot understand why this is so much faster for 
> my sample data:
>
> * plain method : 51323.72ms
> * maps method: 597.94ms(!)
>
> Is this due to any optimizations done by BaseX or have I simply discovered 
> the power of maps?
> How would you do it? Is there still room for improvement?
> Why does Aho-Corasick not help much with this scenario? Is it because the 
> list of strings is simply too massive?
> Why is this so much faster with text-splitting-to-map?
>
> See below for the query examples to better understand what I am trying 
> to do (bulk data not included) [2],[3] There is no normalization of 
> punctuation in the examples but that is only necessary for completeness.
>
> Best, Daniel
>
> [1] 
> http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol
> 26-Holstege01.html
>
> [2] plain method
>
> let $textnodes := 
> fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text(
> ) let $strings := 
> file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
> let $regex := $strings
> for $textnode in $textnodes
> where  matches($textnode,$regex) = true() return $textnode
>
> [3] maps method
>
> (:~
>  : Create map from string
>  : Tokenization
>  :
>  : @param $strings
>  : @return Map
>  :)
> declare function local:createMapFromString (
>   $string as xs:string
> ) as map(*) {
>   let $map_words :=
>               map:merge(
>               for $string in tokenize($string,'\|')
>               let $key := $string
>               let $val := $string
>               return map:entry($key,$val),
>                 map { 'duplicates': 'use-first' }) return
>   $map_words
> };
>
> (:~
>  : Create map from text nodes
>  : Write any combination of words in document order to the map
>  :
>  : @param $textnodes
>  : @return Map
>  :)
> declare function local:createMapFromTextnodes (
>   $textnodes as xs:string+
> ) as map(*) {
>       map:merge(
>       for $node in $textnodes
>       let $text := normalize-space($node)
>       let $tokens := tokenize($text,' ')
>       let $map_nodes :=
>                     map:merge(
>                      for $start in 1 to fn:count($tokens)
>                       for $length in 1 to fn:count($tokens) - $start + 1
>                      return
>                        map:entry(fn:string-join(fn:subsequence($tokens, 
> $start, $length), ' '),'x')
>                      )
>
>       return
>         $map_nodes)
> };
>
> (:~
>  : Compare two maps
>  :
>  : @param $map1
>  : @param $map2
>  : @return xs:string*
>  :)
> declare function local:reduceMaps (
>   $map1 as map(*),
>   $map2 as map(*)
> ) as xs:string* {
>       for $key in map:keys($map1)
>       where map:contains($map2,$key)
>       let $value := map:get($map1,$key)
>       return $value
> };
>
> let $textnodes := 
> fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text(
> ) let $strings := 
> file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
>
> let $map_words := local:createMapFromString($strings)
> let $map_textnodes := local:createMapFromTextnodes($textnodes)
> let $matches :=  local:reduceMaps($map_words,$map_textnodes)
>
> let $regex := string-join(for $match in $matches
>               group by $match
>               return $match,'|')
>
> for $textnode in $textnodes
> where  matches($textnode,$regex) = true() return $textnode
>

Re: [basex-talk] Questions about the performance of bulk string matching

Reply via email to