Re: [pollen] Getting the first N words: speed comparison

Matthew Butterick Wed, 20 Mar 2019 22:39:36 -0700


> On Mar 20, 2019, at 8:19 PM, Joel Dueck <[email protected]> wrote:
> 
> The regular-expression version is slower, much more so, it seems, the larger 
> the first txexpr that you give it.
> 
> I am sure both functions could be made much faster. In particular the regex 
> version matches all the words in each string, there is probably a better 
> pattern that would stop after the first N words.


Your regexp-based function is slower because of `regexp-match*`, which eagerly 
finds all the matches (whether you need them or not). Whereas the port-based 
function is faster because it works incrementally. 

But you can do both at the same time, by passing an input port as the argument 
to `regexp-match`. In this example, the pattern is matched incrementally, and 
if we don't get enough words, we incrementally process the next txexpr.

(require racket/string)
(define (first-words-regex2 txs n)
  (define words
    (let loop ([txs txs][n n])
      (define ip (open-input-string (tx-strs (car txs))))
      (define words (for*/list ([i (in-range n)]
                                [bs (in-value (regexp-match #px"\\w+" ip))]
                                #:break (not bs))
                      (bytes->string/utf-8 (car bs))))
      (if (= (length words) n)
          words
          (append words (loop (cdr txs) (- n (length words)))))))
  (string-join words " "))

-- 
You received this message because you are subscribed to the Google Groups 
"Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [pollen] Getting the first N words: speed comparison

Reply via email to