[pollen] Getting the first N words: speed comparison

Joel Dueck Wed, 20 Mar 2019 20:19:53 -0700

This isn’t a question, just sharing a rabbit hole I went down tonight that 
might be of use to other Pollenators.


I wrote a function to get the first N words out of the "visible text" of a 
tagged X-expression:

    (define doc '(root (p [[class "new-section"]] "She counted (" (em "one, 
two") "— silently, eyes unblinking")))

    (first-words (get-elements doc) 5) ; → "She counted one two silently"

In particular I wanted to a) ignore attribute strings, b) filter out some, 
but not all, punctuation, and c) examine only one txexpr at a time until I 
found enough words.

I thought about using regular expressions. But then I thought I could get 
it to run faster by manually examining each character one at a time, so I 
tried that first. Then I got curious, so I went back and wrote the 
regular-expression version and benchmarked them against each other.

The functions and the benchmark results are at this gist: 
https://gist.github.com/otherjoel/4366960058983073ce01fa27f1a4d09d

The regular-expression version is slower, much more so, it seems, the 
larger the first txexpr that you give it.

I am sure both functions could be made much faster. In particular the regex 
version matches all the words in each string, there is probably a better 
pattern that would stop after the first N words.

-- 
You received this message because you are subscribed to the Google Groups 
"Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[pollen] Getting the first N words: speed comparison

Reply via email to