On 13 November 2010 16:46, Neil Mitchell <[email protected]> wrote:
>> I've been working on a project that requires me to do screen scraping.
>
> If you are screen scraping HTML I think tagsoup is a very good choice.
> The use of tagsoup means that you have a real HTML 5 compliant parser
> underneath, and then you can use whatever technique you wish to split
> up the page text - and regular expressions/parsec might be a
> reasonable choice. I've written lots of screen scraping stuff with
> tagsoup, and it's usually very easy - the manual even walks you
> through a couple of examples:
> http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm

Agreed, the tagsoup library just works. I've used it plenty of times
for my scraping needs. E.g. scraping from paste sites:

https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84

https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65

You can always regex match on what tagsoup gives you, too.
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to