If you parse into data structure compatible with clojure.xml, then you can
use an XML zipper to find the links in the document.
(-> s
pl.danieljanus.tagsoup/parse-xml
clojure.zip/xml-zip
(clojure.data.zip/xml-> clojure.data.zip/descendants :a))
- James
On 9 April 2016 at 20:15, Danny Freeman <[email protected]> wrote:
> I have been working on a program that will take a website, and extract all
> the links from the body of the HTML page. I am using tagsoup
> <https://github.com/nathell/clj-tagsoup> to create a tree structure from
> an html page.
>
> The current issue I am running into is traversing the tree structure and
> pulling out all the links. I have a function that will parses the tree
> using a for loop and recursion, but it does not feel very idiomatic. The
> list it returns is filled with vectors of emtpy lists and nil values. I can
> flatten out the data structure and grab everything I need out of it, but it
> feels clunky. I was looking for some tips on how I could impore my code,
> since this is the first complicated clojure program I have written.
>
> Here is the code I have written for extracting the a tags out of the html
> tree.
>
> (defn get-tags
> ([tag html]
> (get-tags tag [] html))
>
> ([tag found html]
> (if html
> (for [el html]
> (if (vector? el)
> (if (= (soup/tag el) tag)
> (conj found el)
> (->> (soup/children el)
> (remove #(or (string? %) (nil? %)))
> (get-tags tag found)
> (conj found))))))))
>
> It gets called with this something like this. Normally the site would be a
> lot bigger, but I deleted a lot of the tree for this post.
>
> (def html-tree
> [:body
> {}
> [:a
> {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
> "triangular-nordic"]
> [:table
> {}
> [:tr
> {}
> [:td
> {:colspan "1", :rowspan "1"}
> [:a
> {:href "/files/", :shape "rect"}
> [:img {:src "truck.gif", :title "Slug's File Archive"}]]]
> [:td
> {:colspan "1", :rowspan "1"}
> [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
> [:td
> {:colspan "1", :rowspan "1"}
> [:a
> {:href "
> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
> ",
> :shape "rect"}
> "Forecast"]
> [:br {:clear "none"}]
> [:a
> {:href "
> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
> ",
> :shape "rect"}
> "Radar"]
> [:br {:clear "none"}]
> [:a {:href "http://news.google.com/", :shape "rect"} "News"]
> [:br {:clear "none"}]]
> [:td
> {:colspan "1", :rowspan "1"}
> [:a {:href "http://reddit.com", :shape "rect"} "Reddit"]
> [:br {:clear "none"}]
> [:a {:href "http://digg.com", :shape "rect"} "Digg"]
> [:br {:clear "none"}]]]]])
>
> (get-tags :a html-tree)
>
> This evaluates to
> (nil
> nil
> [[:a
> {:href "conditionedtransiti.php", :shape "rect", :style "display:
> none;"}
> "triangular-nordic"]]
> [([([([[:a
> {:href "/files/", :shape "rect"}
> [:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
> [([[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]])]
> [([[:a
> {:href "
> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
> ",
> :shape "rect"}
> "Forecast"]]
> [()]
> [[:a
> {:href "
> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
> ",
> :shape "rect"}
> "Radar"]]
> [()]
> [[:a {:href "http://news.google.com/", :shape "rect"} "News"]]
> [()])]
> [([[:a {:href "http://reddit.com", :shape "rect"} "Reddit"]]
> [()]
> [[:a {:href "http://digg.com", :shape "rect"} "Digg"]]
> [()])])])])
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.