I would use enlive for this.
 - https://github.com/cgrand/enlive

`re-pred` seems 
relevant: 
https://cljdoc.org/d/enlive/enlive/1.1.6/api/net.cgrand.enlive-html#re-pred

Here's someone doing something similar a while 
ago: 
https://stackoverflow.com/questions/18604049/clojure-enlive-a-selector-that-uses-regex
 - The were also having problems with text encoding, hopefully you wont.

hth,
-Harold

On Wednesday, February 2, 2022 at 1:22:53 PM UTC-7 lawrence...@gmail.com 
wrote:

> Assume I've been cursed to scrape HTML. If I convert the pages to Hickory 
> I end up with a big mass of data which, sadly, lacks many "class" or "id"s 
> that would let me easily pick out the data I need. However, for the most 
> part, the only thing I really need off this page is the CVEs, which look 
> like this:
>
> CVE-2021-40539
>
> I'm thinking I might write regex against the plain text of the page, but 
> I'm also curious, is it common to take something like Hiccup or Hickory or 
> a zipper and run regex through it? If yes, how is that done? 
>
> A small part of the data looks like this:
>
>                 :content
>                 [{:type :element,
>                   :attrs
>                   {:class "tip-intro", :style "font-size: 15px;"},
>                   :tag :p,
>                   :content
>                   [{:type :element,
>                     :attrs nil,
>                     :tag :em,
>                     :content
>                     ["This Joint Cybersecurity Advisory uses the MITRE 
> Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK®) framework, 
> Version 8. See the "
>                      {:type :element,
>                       :attrs
>                       {:href
>                        "
> https://attack.mitre.org/versions/v9/techniques/enterprise/"},
>                       :tag :a,
>                       :content ["ATT&CK for Enterprise"]}
>                      " for  referenced threat actor tactics and for 
> techniques."]}]}
>                  "\n\n"
>                  {:type :element,
>                   :attrs nil,
>                   :tag :p,
>                   :content
>                   ["This joint advisory is the result of analytic efforts 
> between the Federal Bureau of Investigation (FBI), United States Coast 
> Guard Cyber Command (CGCYBER), and the Cybersecurity and Infrastructure 
> Security Agency (CISA) to highlight the cyber threat associated with active 
> exploitation of a newly identified vulnerability (CVE-2021-40539) in 
> ManageEngine ADSelfService Plus—a self-service password management and 
> single sign-on solution."]}
>                  "\n\n"
>                  {:type :element,
>                   :attrs nil,
>                   :tag :p,
>                   :content
>                   ["CVE-2021-40539, rated critical by the Common 
> Vulnerability Scoring System (CVSS), is an authentication bypass 
> vulnerability affecting representational state transfer (REST) application 
> programming interface (API) URLs that could enable remote code execution. 
> The FBI, CISA, and CGCYBER assess that advanced persistent threat (APT) 
> cyber actors are likely among those exploiting the vulnerability. The 
> exploitation of ManageEngine ADSelfService Plus poses a serious risk to 
> critical infrastructure companies, U.S.-cleared defense contractors, 
> academic institutions, and other entities that use the software. Successful 
> exploitation of the vulnerability allows an attacker to place webshells, 
> which enable the adversary to conduct post-exploitation activities, such as 
> compromising administrator credentials, conducting lateral movement, and 
> exfiltrating registry hives and Active Directory files."]}
>                  "\n\n"
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/clojure/078db869-2405-48da-a383-8c5ca187c5adn%40googlegroups.com.

Reply via email to