On Mon, Nov 22, 2010 at 02:20:22PM -0200, José Romildo Malaquias wrote: > > I am looking for some examples of code using Text.HTML.TagSoup.Parsec, > from the tagsoup-parsec package.
In an attempt to learn how to use tagsoup together with parsec in order to do web scrapping, I rewrote the application that displays the Haskell.org hit count, explained in the "Drinking TagSoup by Example" tutorial [1]. The source code is attached. First I tried to use tagsoup-parsec [2], but it was not too helpful. Therefore I wrote a few parser combinators myself (inspired by tagsoup-parsec, and the "TagSoup, meet Parsec!" blog post [3]. I am posting the program here so that other Haskell programmers can comment on it. I would like also to do some suggestions to the author of tagsoup-parsec: a) export more functions, like tagEater, which may be needed in order to define new parsers or parser combinators; I needed them, but they were no usable because they are not exported; b) add more basic parsers and parser combinators (at least the ones I have defined in my program). c) add some examples d) use parsec version 3 [1] http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm [2] http://hackage.haskell.org/package/tagsoup-parsec [3] http://therning.org/magnus/archives/367 Regards, Romildo
module Main (main) where import Text.Parsec hiding (satisfy) import Text.HTML.TagSoup (parseTags, Tag(TagText), (~==)) import Text.HTML.Download (openURL) import Data.Char (isDigit) import Data.List (findIndex) main = do src <- openURL "http://www.haskell.org/haskellwiki/Haskell" let x = tagParse counter (parseTags src) putStrLn $ "haskell.org has been hit " ++ show x ++ " times" counter = do skipTo (tag "<div class=printfooter>") count 2 (skipTo (tag "<p>")) s <- tagText "" let ss = words s case findIndex (== "times.") ss of Just i -> let num = ss !! (i - 1) in return (read (filter isDigit num) :: Int) Nothing -> parserZero -- -- tag parser library -- tagParse p ts = either ( error . show ) id $ parse p "tagsoup" ts tagEater matcher = tokenPrim show (\pos t ts -> incSourceLine pos 1) matcher anyTag = tagEater Just satisfy f = tagEater (\t -> if f t then Just t else Nothing) tag t = satisfy (~== t) <?> show t tagText str = do TagText x <- tag (TagText str) return x skipTo p = try p <|> (anyTag >> skipTo p)
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
