On 13 November 2010 16:46, Neil Mitchell <[email protected]> wrote: >> I've been working on a project that requires me to do screen scraping. > > If you are screen scraping HTML I think tagsoup is a very good choice. > The use of tagsoup means that you have a real HTML 5 compliant parser > underneath, and then you can use whatever technique you wish to split > up the page text - and regular expressions/parsec might be a > reasonable choice. I've written lots of screen scraping stuff with > tagsoup, and it's usually very easy - the manual even walks you > through a couple of examples: > http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm
Agreed, the tagsoup library just works. I've used it plenty of times for my scraping needs. E.g. scraping from paste sites: https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84 https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65 You can always regex match on what tagsoup gives you, too. _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
