This isn’t a question, just sharing a rabbit hole I went down tonight that
might be of use to other Pollenators.
I wrote a function to get the first N words out of the "visible text" of a
tagged X-expression:
(define doc '(root (p [[class "new-section"]] "She counted (" (em "one,
two") "— silently, eyes unblinking")))
(first-words (get-elements doc) 5) ; → "She counted one two silently"
In particular I wanted to a) ignore attribute strings, b) filter out some,
but not all, punctuation, and c) examine only one txexpr at a time until I
found enough words.
I thought about using regular expressions. But then I thought I could get
it to run faster by manually examining each character one at a time, so I
tried that first. Then I got curious, so I went back and wrote the
regular-expression version and benchmarked them against each other.
The functions and the benchmark results are at this gist:
https://gist.github.com/otherjoel/4366960058983073ce01fa27f1a4d09d
The regular-expression version is slower, much more so, it seems, the
larger the first txexpr that you give it.
I am sure both functions could be made much faster. In particular the regex
version matches all the words in each string, there is probably a better
pattern that would stop after the first N words.
--
You received this message because you are subscribed to the Google Groups
"Pollen" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.