On Mon, Nov 22, 2010 at 02:20:22PM -0200, José Romildo Malaquias wrote:
> 
> I am looking for some examples of code using Text.HTML.TagSoup.Parsec,
> from the tagsoup-parsec package.

In an attempt to learn how to use tagsoup together with parsec in order
to do web scrapping, I rewrote the application that displays the
Haskell.org hit count, explained in the "Drinking TagSoup by Example"
tutorial [1]. The source code is attached.

First I tried to use tagsoup-parsec [2], but it was not too
helpful. Therefore I wrote a few parser combinators myself (inspired by
tagsoup-parsec, and the "TagSoup, meet Parsec!" blog post [3].

I am posting the program here so that other Haskell programmers can
comment on it.

I would like also to do some suggestions to the author of
tagsoup-parsec:

a) export more functions, like tagEater, which may be needed in order to
define new parsers or parser combinators; I needed them, but they were
no usable because they are not exported;

b) add more basic parsers and parser combinators (at least the ones I
have defined in my program).

c) add some examples

d) use parsec version 3


[1] http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm
[2] http://hackage.haskell.org/package/tagsoup-parsec
[3] http://therning.org/magnus/archives/367

Regards,

Romildo
module Main (main) where

import Text.Parsec hiding (satisfy)
import Text.HTML.TagSoup (parseTags, Tag(TagText), (~==))
import Text.HTML.Download (openURL)
import Data.Char (isDigit)
import Data.List (findIndex)

main =
  do src <- openURL "http://www.haskell.org/haskellwiki/Haskell";
     let x = tagParse counter (parseTags src)
     putStrLn $ "haskell.org has been hit " ++ show x ++ " times"


counter =
  do skipTo (tag "<div class=printfooter>")
     count 2 (skipTo (tag "<p>"))
     s <- tagText ""
     let ss = words s
     case findIndex (== "times.") ss of
       Just i -> let num = ss !! (i - 1)
                 in return (read (filter isDigit num) :: Int)
       Nothing -> parserZero


--
-- tag parser library
--

tagParse p ts =
  either ( error . show ) id $ parse p "tagsoup" ts

tagEater matcher =
  tokenPrim show (\pos t ts -> incSourceLine pos 1) matcher

anyTag = tagEater Just

satisfy f =
  tagEater (\t -> if f t then Just t else Nothing)

tag t = satisfy (~== t) <?> show t

tagText str = do TagText x <- tag (TagText str)
                 return x

skipTo p = try p <|> (anyTag >> skipTo p)

_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to