Re: [basex-talk] Questions about the performance of bulk string matching

Christian Grün Thu, 30 Dec 2021 11:15:36 -0800

Hi Daniel,

Thanks for presenting your interesting use case.


Your solution looks clever. Would you mind sharing some sample data with us
(an XML document and a string dictionary)?

Cheers,
Christian



Zimmel, Daniel <d.zim...@esvmedien.de> schrieb am Mi., 29. Dez. 2021, 12:13:

> Hi,
>
> recently I ran into serious (as in SERIOUS) performance trouble regarding
> expensive regexes, and no wonder why.
>
> Here is my scenario:
>
> * XML text documents with a total of 1m text nodes
> * A regex search string, including a huge string dictionary list of 50.000
> strings (some of them containing 50 words each)
> * a match must be wrapped in an element (= create a link)
>
> I could drink many cups of tea while waiting for the regex to complete...
> hours... I ran out of tea!
>
> Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk
> String Matching" [1] in which she explores the Aho-Corasick algorithm,
> implementing it with XQuery - marvellous! Following this, while I can build
> a data structure which gives me fast results, building the same structure
> is still very slow due to the amount of text to build from. So this was not
> fast enough for my use case - or I may simply not be smart enough to apply
> it correctly :-|
>
> So, I tried tinkering with maps which turned out to give me extraordinary
> performance gains:
>
> * build a map from the string dictionary
> * walk through all text nodes one by one
> * for each text node, put any combination of words in the text node in
> word order (I need to find strings, not words) into another map
> * strip punctuation (for word boundaries) and do some normalization of
> whitespaces in both maps
> * compare the keys of both maps
> * give the new reduced string dictionary to the regular regex search
>
> While comparing the maps, I do not know where in the text my strings are,
> but at least I know if they are in there - to find out where exactly (and
> how do they fit my other regex context) I can then use a massively reduced
> regular regex search. Fast!
>
> I am quite happy BUT I still cannot understand why this is so much faster
> for my sample data:
>
> * plain method : 51323.72ms
> * maps method: 597.94ms(!)
>
> Is this due to any optimizations done by BaseX or have I simply discovered
> the power of maps?
> How would you do it? Is there still room for improvement?
> Why does Aho-Corasick not help much with this scenario? Is it because the
> list of strings is simply too massive?
> Why is this so much faster with text-splitting-to-map?
>
> See below for the query examples to better understand what I am trying to
> do (bulk data not included) [2],[3]
> There is no normalization of punctuation in the examples but that is only
> necessary for completeness.
>
> Best, Daniel
>
> [1]
> http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Holstege01.html
>
> [2] plain method
>
> let $textnodes :=
> fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
> let $strings :=
> file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
> let $regex := $strings
> for $textnode in $textnodes
> where  matches($textnode,$regex) = true()
> return $textnode
>
> [3] maps method
>
> (:~
>  : Create map from string
>  : Tokenization
>  :
>  : @param $strings
>  : @return Map
>  :)
> declare function local:createMapFromString (
>   $string as xs:string
> ) as map(*) {
>   let $map_words :=
>               map:merge(
>               for $string in tokenize($string,'\|')
>               let $key := $string
>               let $val := $string
>               return map:entry($key,$val),
>                 map { 'duplicates': 'use-first' })
> return
>   $map_words
> };
>
> (:~
>  : Create map from text nodes
>  : Write any combination of words in document order to the map
>  :
>  : @param $textnodes
>  : @return Map
>  :)
> declare function local:createMapFromTextnodes (
>   $textnodes as xs:string+
> ) as map(*) {
>       map:merge(
>       for $node in $textnodes
>       let $text := normalize-space($node)
>       let $tokens := tokenize($text,' ')
>       let $map_nodes :=
>                     map:merge(
>                      for $start in 1 to fn:count($tokens)
>                       for $length in 1 to fn:count($tokens) - $start + 1
>                      return
>                        map:entry(fn:string-join(fn:subsequence($tokens,
> $start, $length), ' '),'x')
>                      )
>
>       return
>         $map_nodes)
> };
>
> (:~
>  : Compare two maps
>  :
>  : @param $map1
>  : @param $map2
>  : @return xs:string*
>  :)
> declare function local:reduceMaps (
>   $map1 as map(*),
>   $map2 as map(*)
> ) as xs:string* {
>       for $key in map:keys($map1)
>       where map:contains($map2,$key)
>       let $value := map:get($map1,$key)
>       return $value
> };
>
> let $textnodes :=
> fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
> let $strings :=
> file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
>
> let $map_words := local:createMapFromString($strings)
> let $map_textnodes := local:createMapFromTextnodes($textnodes)
> let $matches :=  local:reduceMaps($map_words,$map_textnodes)
>
> let $regex := string-join(for $match in $matches
>               group by $match
>               return $match,'|')
>
> for $textnode in $textnodes
> where  matches($textnode,$regex) = true()
> return $textnode
>
>
>

Re: [basex-talk] Questions about the performance of bulk string matching

Reply via email to