I would use enlive for this. - https://github.com/cgrand/enlive
`re-pred` seems relevant: https://cljdoc.org/d/enlive/enlive/1.1.6/api/net.cgrand.enlive-html#re-pred Here's someone doing something similar a while ago: https://stackoverflow.com/questions/18604049/clojure-enlive-a-selector-that-uses-regex - The were also having problems with text encoding, hopefully you wont. hth, -Harold On Wednesday, February 2, 2022 at 1:22:53 PM UTC-7 lawrence...@gmail.com wrote: > Assume I've been cursed to scrape HTML. If I convert the pages to Hickory > I end up with a big mass of data which, sadly, lacks many "class" or "id"s > that would let me easily pick out the data I need. However, for the most > part, the only thing I really need off this page is the CVEs, which look > like this: > > CVE-2021-40539 > > I'm thinking I might write regex against the plain text of the page, but > I'm also curious, is it common to take something like Hiccup or Hickory or > a zipper and run regex through it? If yes, how is that done? > > A small part of the data looks like this: > > :content > [{:type :element, > :attrs > {:class "tip-intro", :style "font-size: 15px;"}, > :tag :p, > :content > [{:type :element, > :attrs nil, > :tag :em, > :content > ["This Joint Cybersecurity Advisory uses the MITRE > Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK®) framework, > Version 8. See the " > {:type :element, > :attrs > {:href > " > https://attack.mitre.org/versions/v9/techniques/enterprise/"}, > :tag :a, > :content ["ATT&CK for Enterprise"]} > " for referenced threat actor tactics and for > techniques."]}]} > "\n\n" > {:type :element, > :attrs nil, > :tag :p, > :content > ["This joint advisory is the result of analytic efforts > between the Federal Bureau of Investigation (FBI), United States Coast > Guard Cyber Command (CGCYBER), and the Cybersecurity and Infrastructure > Security Agency (CISA) to highlight the cyber threat associated with active > exploitation of a newly identified vulnerability (CVE-2021-40539) in > ManageEngine ADSelfService Plus—a self-service password management and > single sign-on solution."]} > "\n\n" > {:type :element, > :attrs nil, > :tag :p, > :content > ["CVE-2021-40539, rated critical by the Common > Vulnerability Scoring System (CVSS), is an authentication bypass > vulnerability affecting representational state transfer (REST) application > programming interface (API) URLs that could enable remote code execution. > The FBI, CISA, and CGCYBER assess that advanced persistent threat (APT) > cyber actors are likely among those exploiting the vulnerability. The > exploitation of ManageEngine ADSelfService Plus poses a serious risk to > critical infrastructure companies, U.S.-cleared defense contractors, > academic institutions, and other entities that use the software. Successful > exploitation of the vulnerability allows an attacker to place webshells, > which enable the adversary to conduct post-exploitation activities, such as > compromising administrator credentials, conducting lateral movement, and > exfiltrating registry hives and Active Directory files."]} > "\n\n" > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/078db869-2405-48da-a383-8c5ca187c5adn%40googlegroups.com.