Hello community, here is the log from the commit of package ghc-tagsoup for openSUSE:Factory checked in at 2016-03-26 15:26:13 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/ghc-tagsoup (Old) and /work/SRC/openSUSE:Factory/.ghc-tagsoup.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "ghc-tagsoup" Changes: -------- --- /work/SRC/openSUSE:Factory/ghc-tagsoup/ghc-tagsoup.changes 2016-01-22 01:08:22.000000000 +0100 +++ /work/SRC/openSUSE:Factory/.ghc-tagsoup.new/ghc-tagsoup.changes 2016-03-26 15:26:19.000000000 +0100 @@ -1,0 +2,9 @@ +Wed Mar 16 09:27:33 UTC 2016 - mimi...@gmail.com + +- update to 0.13.9 +* fix a space leak +* fix the demo examples +* make IsString a superclass of StringLike +* make flattenTree O(n) instead of O(n^2) + +------------------------------------------------------------------- Old: ---- tagsoup-0.13.8.tar.gz New: ---- tagsoup-0.13.9.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ ghc-tagsoup.spec ++++++ --- /var/tmp/diff_new_pack.l42AYE/_old 2016-03-26 15:26:21.000000000 +0100 +++ /var/tmp/diff_new_pack.l42AYE/_new 2016-03-26 15:26:21.000000000 +0100 @@ -19,7 +19,7 @@ %global pkg_name tagsoup Name: ghc-tagsoup -Version: 0.13.8 +Version: 0.13.9 Release: 0 Summary: Parsing and extracting information from (possibly malformed) HTML/XML documents License: BSD-3-Clause ++++++ tagsoup-0.13.8.tar.gz -> tagsoup-0.13.9.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/CHANGES.txt new/tagsoup-0.13.9/CHANGES.txt --- old/tagsoup-0.13.8/CHANGES.txt 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/CHANGES.txt 2016-03-15 13:07:28.000000000 +0100 @@ -1,5 +1,10 @@ Changelog for TagSoup +0.13.9 + #50, fix a space leak + #36, fix the demo examples + #35, make IsString a superclass of StringLike + #33, make flattenTree O(n) instead of O(n^2) 0.13.8 #30, add parse/render functions directly to the Tree module 0.13.7 diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Main.hs new/tagsoup-0.13.9/Main.hs --- old/tagsoup-0.13.8/Main.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Main.hs 2016-03-15 13:07:28.000000000 +0100 @@ -34,7 +34,7 @@ ,("bench","Benchmark the parsing",Left time) ,("benchfile","Benchmark the parsing of a file",Right timefile) ,("validate","Validate a page",Right validate) - ,("hitcount","Get the Haskell.org hit count",Left haskellHitCount) + ,("lastmodifieddate","Get the wiki.haskell.org last modified date",Left haskellLastModifiedDateTime) ,("spj","Simon Peyton Jones' papers",Left spjPapers) ,("ndm","Neil Mitchell's papers",Left ndmPapers) ,("time","Current time",Left currentTime) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/README.md new/tagsoup-0.13.9/README.md --- old/tagsoup-0.13.8/README.md 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/README.md 2016-03-15 13:07:28.000000000 +0100 @@ -4,8 +4,8 @@ The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the [Sample](https://github.com/ndmitchell/tagsoup/blob/master/TagSoup/Sample.hs) file from the source repository. The examples we give are: -* Obtaining the Hit Count from Haskell.org -* Obtaining a list of Simon Peyton-Jones' latest papers +* Obtaining the last modified date of the Haskell wiki +* Obtaining a list of Simon Peyton Jones' latest papers * A brief overview of some other examples The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock! @@ -22,28 +22,33 @@ There are two things that may go wrong with these examples: -* _The Websites being scraped may change._ There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, its only a few minutes work. +* _The Websites being scraped may change._ There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work. * _The `openURL` method may not work._ This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use `wget` to download the page locally, then use `readFile` instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead. -## Haskell Hit Count +## Last modified date of Haskell wiki -Our goal is to develop a program that displays the Haskell.org hit count. This example covers all the basics in designing a basic web-scraping application. +Our goal is to develop a program that displays the date that the wiki at +[`wiki.haskell.org`](http://wiki.haskell.org/Haskell) was last modified. This +example covers all the basics in designing a basic web-scraping application. ### Finding the Page -We first need to find where the information is displayed, and in what format. Taking a look at the [front web page](http://www.haskell.org/haskellwiki/Haskell), when not logged in, we see: - - <ul id="f-list"> - <li id="lastmod"> This page was last modified on 9 September 2013, at 22:38.</li> - <li id="viewcount">This page has been accessed 6,985,922 times.</li> - <li id="copyright">Recent content is available under <a href="/haskellwiki/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">a simple permissive license</a>.</li> - <li id="privacy"><a href="/haskellwiki/HaskellWiki:Privacy_policy" title="HaskellWiki:Privacy policy">Privacy policy</a></li> - <li id="about"><a href="/haskellwiki/HaskellWiki:About" title="HaskellWiki:About">About HaskellWiki</a></li> - <li id="disclaimer"><a href="/haskellwiki/HaskellWiki:General_disclaimer" title="HaskellWiki:General disclaimer">Disclaimers</a></li> - </ul> +We first need to find where the information is displayed and in what format. +Taking a look at the [front web page](http://wiki.haskell.org/Haskell), when +not logged in, we see: + +```html +<ul id="f-list"> + <li id="lastmod"> This page was last modified on 9 September 2013, at 22:38.</li> + <li id="copyright">Recent content is available under <a href="/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">a simple permissive license</a>.</li> + <li id="privacy"><a href="/HaskellWiki:Privacy_policy" title="HaskellWiki:Privacy policy">Privacy policy</a></li> + <li id="about"><a href="/HaskellWiki:About" title="HaskellWiki:About">About HaskellWiki</a></li> + <li id="disclaimer"><a href="/HaskellWiki:General_disclaimer" title="HaskellWiki:General disclaimer">Disclaimers</a></li> +</ul> +``` -So we see the hit count is available. This leads us to rule 1: +So, we see that the last modified date is available. This leads us to rule 1: **Rule 1:** Scrape from what the page returns, not what a browser renders, or what view-source gives. @@ -53,43 +58,88 @@ We can write a simple HTTP downloader with using the [HTTP package](http://hackage.haskell.org/package/HTTP): - import Network.HTTP - - openURL x = getResponseBody =<< simpleHTTP (getRequest x) - - main = do src <- openURL "http://www.haskell.org/haskellwiki/Haskell" - writeFile "temp.htm" src +```haskell +module Main where + +import Network.HTTP + +openURL :: String -> IO String +openURL x = getResponseBody =<< simpleHTTP (getRequest x) + +main :: IO () +main = do + src <- openURL "http://wiki.haskell.org/Haskell" + writeFile "temp.htm" src +``` Now open `temp.htm`, find the fragment of HTML containing the hit count, and examine it. #### Using the `tagsoup` Program -Tagsoup installs both as a library and a program. The program contains all the examples mentioned on this page, along with a few other useful functions. In order to download a URL to a file: - - $ tagsoup grab http://www.haskell.org/haskellwiki/Haskell > temp.htm +TagSoup installs both as a library and a program. The program contains all the +examples mentioned on this page, along with a few other useful functions. In +order to download a URL to a file: + +```bash +$ tagsoup grab http://wiki.haskell.org/Haskell > temp.htm +``` ### Finding the Information -Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment has that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further: +Now we examine both the fragment that contains our snippet of information, and +the wider page. What does the fragment have that nothing else has? What +algorithm would we use to obtain that particular element? How can we still +return the element as the content changes? What if the design changes? But +wait, before going any further: **Rule 2:** Do not be robust to design changes, do not even consider the possibility when writing the code. If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change. -So now, lets consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice "id" property, or a "class" - something which is unlikely to occur multiple times. In the above example, "viewcount" as the id seems perfect. - - haskellHitCount = do - src <- openURL "http://haskell.org/haskellwiki/Haskell" - let count = fromFooter $ parseTags src - putStrLn $ "haskell.org has been hit " ++ count ++ " times" - where fromFooter = filter isDigit . innerText . take 2 . dropWhile (~/= "<li id=viewcount>") +So now, let's consider the fragment from above. It is useful to find a tag +which is unique just above your snippet - something with a nice `id` or `class` +attribute - something which is unlikely to occur multiple times. In the above +example, an `id` with value `lastmod` seems perfect. + +```haskell +module Main where + +import Data.Char +import Network.HTTP +import Text.HTML.TagSoup + +openURL :: String -> IO String +openURL x = getResponseBody =<< simpleHTTP (getRequest x) + +haskellLastModifiedDateTime :: IO () +haskellLastModifiedDateTime = do + src <- openURL "http://wiki.haskell.org/Haskell" + let lastModifiedDateTime = fromFooter $ parseTags src + putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime + where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=lastmod>") + +main :: IO () +main = haskellLastModifiedDateTime +``` Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of `Tag`s with `parseTags`. The `fromFooter` function does the interesting thing, and can be read right to left: -* First we throw away everything (`dropWhile`) until we get to an `li` tag containing `id=viewcount`. The `(~==)` operator is different from standard equality, allowing additional attributes to be present. We write `"<li id=viewcount>"` as syntactic sugar for `TagOpen "li" [("id","viewcount")]`. If we just wanted any open tag with the given id we could have written `(~== TagOpen "" [("id","viewcount")])` and this would have matched. Any empty strings in the second element of the match are considered as wildcards. -* Next we take two elements, the `<li>` tag and the text node immediately following. -* We call the `innerText` function to get all the text values from inside, which will just be the text node following the `viewcount`. -* We keep only the numbers, getting rid of the surrounding text and the commas. +* First we throw away everything (`dropWhile`) until we get to an `li` tag + containing `id=lastmod`. The `(~==)` and `(~/=)` operators are different from +standard equality and inequality since they allow additional attributes to be +present. We write `"<li id=lastmod>"` as syntactic sugar for `TagOpen "li" +[("id","lastmod")]`. If we just wanted any open tag with the given `id` +attribute we could have written `(~== TagOpen "" [("id","lastmod")])` and this +would have matched. Any empty strings in the second element of the match are +considered as wildcards. +* Next we take two elements: the `<li>` tag and the text node immediately + following. +* We call the `innerText` function to get all the text values from inside, + which will just be the text node following the `lastmod`. +* We split the string into a series of words and drop the first six, i.e. the + words `This`, `page`, `was`, `last`, `modified` and `on` +* We reassemble the remaining words into the resulting string `9 September + 2013, at 22:38.` This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup. @@ -104,34 +154,53 @@ First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple `take`/`drop` pair: - takeWhile (~/= "<a name=haskell>") $ - drop 5 $ dropWhile (~/= "<a name=current>") tags +```haskell +takeWhile (~/= "<a name=haskell>") $ +drop 5 $ dropWhile (~/= "<a name=current>") tags +``` This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section: - map f $ sections (~== "<A>") $ ... +```haskell +map f $ sections (~== "<A>") $ ... +``` Remember that the function to select all tags with name "A" could have been written as `(~== TagOpen "A" [])`, or alternatively `isTagOpenName "A"`. Afterwards we map each item with an `f` function. This function needs to take the tags starting just after the link, and find the text inside the link. - f = dequote . unwords . words . fromTagText . head . filter isTagText +```haskell +f = dequote . unwords . words . fromTagText . head . filter isTagText +``` Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the `filter` drops all those that are not, until we find a pure text node. The `unwords . words` deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist. For completeness, we now present the entire example: - - spjPapers :: IO () - spjPapers = do - tags <- fmap parseTags $ openURL "http://research.microsoft.com/en-us/people/simonpj/" - let links = map f $ sections (~== "<A>") $ - takeWhile (~/= "<a name=haskell>") $ - drop 5 $ dropWhile (~/= "<a name=current>") tags - putStr $ unlines links - where - f :: [Tag] -> String - f = dequote . unwords . words . fromTagText . head . filter isTagText - - dequote ('\"':xs) | last xs == '\"' = init xs - dequote x = x + +```haskell +module Main where + +import Network.HTTP +import Text.HTML.TagSoup + +openURL :: String -> IO String +openURL x = getResponseBody =<< simpleHTTP (getRequest x) + +spjPapers :: IO () +spjPapers = do + tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/" + let links = map f $ sections (~== "<A>") $ + takeWhile (~/= "<a name=haskell>") $ + drop 5 $ dropWhile (~/= "<a name=current>") tags + putStr $ unlines links + where + f :: [Tag String] -> String + f = dequote . unwords . words . fromTagText . head . filter isTagText + + dequote ('\"':xs) | last xs == '\"' = init xs + dequote x = x + +main :: IO () +main = spjPapers +``` ## Other Examples @@ -139,30 +208,54 @@ ### My Papers - ndmPapers :: IO () - ndmPapers = do - tags <- fmap parseTags $ openURL "http://community.haskell.org/~ndm/downloads/" - let papers = map f $ sections (~== "<li class=paper>") tags - putStr $ unlines papers - where - f :: [Tag] -> String - f xs = fromTagText (xs !! 2) +```haskell +module Main where + +import Network.HTTP +import Text.HTML.TagSoup + +openURL :: String -> IO String +openURL x = getResponseBody =<< simpleHTTP (getRequest x) + +ndmPapers :: IO () +ndmPapers = do + tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/" + let papers = map f $ sections (~== "<li class=paper>") tags + putStr $ unlines papers + where + f :: [Tag String] -> String + f xs = fromTagText (xs !! 2) + +main :: IO () +main = ndmPapers +``` ### UK Time - currentTime :: IO () - currentTime = do - tags <- fmap parseTags $ openURL "http://www.timeanddate.com/worldclock/city.html?n=136" - let time = fromTagText (dropWhile (~/= "<strong id=ct>") tags !! 1) - putStrLn time +```haskell +module Main where + +import Network.HTTP +import Text.HTML.TagSoup + +openURL :: String -> IO String +openURL x = getResponseBody =<< simpleHTTP (getRequest x) + +currentTime :: IO () +currentTime = do + tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london" + let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1) + putStrLn time + +main :: IO () +main = currentTime +``` -<h2>Related Projects</h2> +## Related Projects -<ul> - <li><a href="http://tagsoup.info/">TagSoup for Java</a> - an independently written malformed HTML parser for Java. Including <a href="http://tagsoup.info/#other">links to other</a> HTML parsers.</li> - <li><a href="http://www.fh-wedel.de/~si/HXmlToolbox/">HXT: Haskell XML Toolbox</a> - a more comprehensive XML parser, giving the option of using TagSoup as a lexer.</li> - <li><a href="http://www.fh-wedel.de/~si/HXmlToolbox/#rel">Other Related Work</a> - as described on the HXT pages.</li> - <li><a href="http://therning.org/magnus/archives/367">Using TagSoup with Parsec</a> - a nice combination of Haskell libraries.</li> - <li><a href="http://hackage.haskell.org/packages/tagsoup-parsec">tagsoup-parsec</a> - a library for easily using TagSoup as a token type in Parsec.</li> - <li><a href="http://hackage.haskell.org/packages/archive/wraxml/latest/doc/html/Text-XML-WraXML-Tree-TagSoup.html">WraXML</a> - construct a lazy tree from TagSoup lexemes.</li> -</ul> +* [TagSoup for Java](http://tagsoup.info/) - an independently written malformed HTML parser for Java. Including [links to other](http://tagsoup.info/#other) HTML parsers. +* [HXT: Haskell XML Toolbox](http://www.fh-wedel.de/~si/HXmlToolbox/) - a more comprehensive XML parser, giving the option of using TagSoup as a lexer. +* [Other Related Work](http://www.fh-wedel.de/~si/HXmlToolbox/#rel) - as described on the HXT pages. +* [Using TagSoup with Parsec](http://therning.org/magnus/archives/367) - a nice combination of Haskell libraries. +* [tagsoup-parsec](http://hackage.haskell.org/packages/tagsoup-parsec) - a library for easily using TagSoup as a token type in Parsec. +* [WraXML](http://hackage.haskell.org/packages/archive/wraxml/latest/doc/html/Text-XML-WraXML-Tree-TagSoup.html) - construct a lazy tree from TagSoup lexemes. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Setup.hs new/tagsoup-0.13.9/Setup.hs --- old/tagsoup-0.13.8/Setup.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Setup.hs 2016-03-15 13:07:28.000000000 +0100 @@ -1,3 +1,2 @@ -#! /usr/bin/env runhaskell import Distribution.Simple main = defaultMain diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/TagSoup/Sample.hs new/tagsoup-0.13.9/TagSoup/Sample.hs --- old/tagsoup-0.13.8/TagSoup/Sample.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/TagSoup/Sample.hs 2016-03-15 13:07:28.000000000 +0100 @@ -5,7 +5,6 @@ import Control.Exception import Control.Monad -import Data.Char import Data.List import System.Cmd import System.Directory @@ -47,13 +46,14 @@ {- -<li id="viewcount">This page has been accessed 6,985,922 times.</li> +<li id="lastmod"> This page was last modified on 9 September 2013, at 22:38.</li> -} -haskellHitCount = do - src <- openItem "http://haskell.org/haskellwiki/Haskell" - let count = fromFooter $ parseTags src - putStrLn $ "haskell.org has been hit " ++ count ++ " times" - where fromFooter = filter isDigit . innerText . take 2 . dropWhile (~/= "<li id=viewcount>") +haskellLastModifiedDateTime :: IO () +haskellLastModifiedDateTime = do + src <- openItem "http://wiki.haskell.org/Haskell" + let lastModifiedDateTime = fromFooter $ parseTags src + putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime + where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=lastmod>") googleTechNews :: IO () @@ -75,7 +75,7 @@ spjPapers :: IO () spjPapers = do - tags <- fmap parseTags $ openItem "http://research.microsoft.com/en-us/people/simonpj/" + tags <- parseTags <$> openItem "http://research.microsoft.com/en-us/people/simonpj/" let links = map f $ sections (~== "<A>") $ takeWhile (~/= "<a name=haskell>") $ drop 5 $ dropWhile (~/= "<a name=current>") tags @@ -90,7 +90,7 @@ ndmPapers :: IO () ndmPapers = do - tags <- fmap parseTags $ openItem "http://community.haskell.org/~ndm/downloads/" + tags <- parseTags <$> openItem "http://community.haskell.org/~ndm/downloads/" let papers = map f $ sections (~== "<li class=paper>") tags putStr $ unlines papers where @@ -100,9 +100,9 @@ currentTime :: IO () currentTime = do - tags <- fmap parseTags $ openItem "http://www.timeanddate.com/worldclock/city.html?n=136" - let res = fromTagText (dropWhile (~/= "<strong id=ct>") tags !! 1) - putStrLn res + tags <- parseTags <$> openItem "http://www.timeanddate.com/worldclock/uk/london" + let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1) + putStrLn time diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Text/HTML/TagSoup/Implementation.hs new/tagsoup-0.13.9/Text/HTML/TagSoup/Implementation.hs --- old/tagsoup-0.13.8/Text/HTML/TagSoup/Implementation.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Text/HTML/TagSoup/Implementation.hs 2016-03-15 13:07:28.000000000 +0100 @@ -46,7 +46,7 @@ expand :: Position -> String -> S -expand p text = res +expand p text = p `seq` res where res = S{s = res ,tl = expand (positionChar p (head text)) (tail text) ,hd = if null text then '\0' else head text diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Text/HTML/TagSoup/Render.hs new/tagsoup-0.13.9/Text/HTML/TagSoup/Render.hs --- old/tagsoup-0.13.8/Text/HTML/TagSoup/Render.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Text/HTML/TagSoup/Render.hs 2016-03-15 13:07:28.000000000 +0100 @@ -1,4 +1,4 @@ -{-# LANGUAGE PatternGuards #-} +{-# LANGUAGE PatternGuards, OverloadedStrings #-} {-| This module converts a list of 'Tag' back into a string. -} @@ -29,7 +29,6 @@ escapeHTML :: StringLike str => str -> str escapeHTML = fromString . escapeXML . toString - -- | The default render options value, described in 'RenderOptions'. renderOptions :: StringLike str => RenderOptions str renderOptions = RenderOptions escapeHTML (\x -> toString x == "br") (\x -> toString x == "script") @@ -50,34 +49,32 @@ renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> str renderTagsOptions opts = strConcat . tags where - s = fromString - ss x = [s x] - + ss x = [x] + tags (TagOpen name atts:TagClose name2:xs) - | name == name2 && optMinimize opts name = open name atts (s " /") ++ tags xs + | name == name2 && optMinimize opts name = open name atts " /" ++ tags xs tags (TagOpen name atts:xs) - | Just ('?',_) <- uncons name = open name atts (s " ?") ++ tags xs + | Just ('?',_) <- uncons name = open name atts " ?" ++ tags xs | optRawTag opts name = let (a,b) = break (== TagClose name) (TagOpen name atts:xs) in concatMap (\x -> case x of TagText s -> [s]; _ -> tag x) a ++ tags b tags (x:xs) = tag x ++ tags xs tags [] = [] - tag (TagOpen name atts) = open name atts (s "") - tag (TagClose name) = [s "</", name, s ">"] + tag (TagOpen name atts) = open name atts "" + tag (TagClose name) = ["</", name, ">"] tag (TagText text) = [txt text] tag (TagComment text) = ss "<!--" ++ com text ++ ss "-->" tag _ = ss "" txt = optEscape opts - open name atts shut = [s "<",name] ++ concatMap att atts ++ [shut,s ">"] - att (x,y) | xnull && ynull = [s " \"\""] - | ynull = [s " ", x] - | xnull = [s " \"",txt y,s "\""] - | otherwise = [s " ",x,s "=\"",txt y,s "\""] - where (xnull, ynull) = (strNull x, strNull y) + open name atts shut = ["<",name] ++ concatMap att atts ++ [shut,">"] + att ("","") = [" \"\""] + att (x ,"") = [" ", x] + att ("", y) = [" \"",txt y,"\""] + att (x , y) = [" ",x,"=\"",txt y,"\""] - com xs | Just ('-',xs) <- uncons xs, Just ('-',xs) <- uncons xs, Just ('>',xs) <- uncons xs = s "-- >" : com xs + com xs | Just ('-',xs) <- uncons xs, Just ('-',xs) <- uncons xs, Just ('>',xs) <- uncons xs = "-- >" : com xs com xs = case uncons xs of Nothing -> [] Just (x,xs) -> fromChar x : com xs diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Text/HTML/TagSoup/Tree.hs new/tagsoup-0.13.9/Text/HTML/TagSoup/Tree.hs --- old/tagsoup-0.13.8/Text/HTML/TagSoup/Tree.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Text/HTML/TagSoup/Tree.hs 2016-03-15 13:07:28.000000000 +0100 @@ -14,6 +14,7 @@ import Text.HTML.TagSoup (parseTags, parseTagsOptions, renderTags, renderTagsOptions, ParseOptions(..), RenderOptions(..)) import Text.HTML.TagSoup.Type import Control.Arrow +import GHC.Exts (build) data TagTree str = TagBranch str [Attribute str] [TagTree str] @@ -57,11 +58,15 @@ parseTreeOptions opts str = tagTree $ parseTagsOptions opts str flattenTree :: [TagTree str] -> [Tag str] -flattenTree xs = concatMap f xs +flattenTree xs = build $ flattenTreeFB xs + +flattenTreeFB :: [TagTree str] -> (Tag str -> lst -> lst) -> lst -> lst +flattenTreeFB xs cons nil = flattenTreeOnto xs nil where - f (TagBranch name atts inner) = - TagOpen name atts : flattenTree inner ++ [TagClose name] - f (TagLeaf x) = [x] + flattenTreeOnto [] tags = tags + flattenTreeOnto (TagBranch name atts inner:trs) tags = + TagOpen name atts `cons` flattenTreeOnto inner (TagClose name `cons` flattenTreeOnto trs tags) + flattenTreeOnto (TagLeaf x:trs) tags = x `cons` flattenTreeOnto trs tags renderTree :: StringLike str => [TagTree str] -> str renderTree = renderTags . flattenTree diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/Text/StringLike.hs new/tagsoup-0.13.9/Text/StringLike.hs --- old/tagsoup-0.13.8/Text/StringLike.hs 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/Text/StringLike.hs 2016-03-15 13:07:28.000000000 +0100 @@ -5,8 +5,9 @@ -- This module provides an abstraction for String's as used inside TagSoup. It allows -- TagSoup to work with String (list of Char), ByteString.Char8, ByteString.Lazy.Char8, -- Data.Text and Data.Text.Lazy. -module Text.StringLike where +module Text.StringLike (StringLike(..), fromString, castString) where +import Data.String import Data.Typeable import qualified Data.ByteString.Char8 as BS @@ -17,7 +18,7 @@ -- | A class to generalise TagSoup parsing over many types of string-like types. -- Examples are given for the String type. -class (Typeable a, Eq a) => StringLike a where +class (Typeable a, Eq a, IsString a) => StringLike a where -- | > empty = "" empty :: a -- | > cons = (:) @@ -28,8 +29,6 @@ -- | > toString = id toString :: a -> String - -- | > fromString = id - fromString :: String -> a -- | > fromChar = return fromChar :: Char -> a -- | > strConcat = concat @@ -49,7 +48,6 @@ uncons [] = Nothing uncons (x:xs) = Just (x, xs) toString = id - fromString = id fromChar = (:[]) strConcat = concat empty = [] @@ -60,7 +58,6 @@ instance StringLike BS.ByteString where uncons = BS.uncons toString = BS.unpack - fromString = BS.pack fromChar = BS.singleton strConcat = BS.concat empty = BS.empty @@ -71,7 +68,6 @@ instance StringLike LBS.ByteString where uncons = LBS.uncons toString = LBS.unpack - fromString = LBS.pack fromChar = LBS.singleton strConcat = LBS.concat empty = LBS.empty @@ -82,7 +78,6 @@ instance StringLike T.Text where uncons = T.uncons toString = T.unpack - fromString = T.pack fromChar = T.singleton strConcat = T.concat empty = T.empty @@ -93,7 +88,6 @@ instance StringLike LT.Text where uncons = LT.uncons toString = LT.unpack - fromString = LT.pack fromChar = LT.singleton strConcat = LT.concat empty = LT.empty diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/tagsoup-0.13.8/tagsoup.cabal new/tagsoup-0.13.9/tagsoup.cabal --- old/tagsoup-0.13.8/tagsoup.cabal 2016-01-10 22:15:15.000000000 +0100 +++ new/tagsoup-0.13.9/tagsoup.cabal 2016-03-15 13:07:28.000000000 +0100 @@ -1,6 +1,6 @@ cabal-version: >= 1.6 name: tagsoup -version: 0.13.8 +version: 0.13.9 copyright: Neil Mitchell 2006-2016 author: Neil Mitchell <ndmitch...@gmail.com> maintainer: Neil Mitchell <ndmitch...@gmail.com> @@ -11,7 +11,7 @@ license-file: LICENSE build-type: Simple synopsis: Parsing and extracting information from (possibly malformed) HTML/XML documents -tested-with: GHC==7.10.1, GHC==7.8.4, GHC==7.6.3, GHC==7.4.2, GHC==7.2.2 +tested-with: GHC==8.0.1, GHC==7.10.3, GHC==7.8.4, GHC==7.6.3, GHC==7.4.2 description: TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML