[Haskell-cafe] Web processing

2008-08-02 Thread Rafael C. de Almeida
Hello,

I understand that nowadays there are several frameworks and wrapper
libraries for making some sense of the XHTML documents you find over the
web. That is, making the life of those who want to process the
semi-structured data you find on the sites.

I don't have much experience on that field myself, but I want to learn a
little more about how I can, for instance, associate information from
one site with information in another site. Even though it is structured
differently in both places. Does anyone know about libraries that would
help me out with that sort of work? Hope I'm being clear.

[]'s
Rafael
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Web processing

2008-08-02 Thread Jeremy Shaw
Hello,

I would recommend using TagSoup:

http://www-users.cs.york.ac.uk/~ndm/tagsoup/

The tutorial easy, and has good advice:

http://www.cs.york.ac.uk/fp/darcs/tagsoup/tagsoup.htm

I would not bother trying to use a real XML parser, because I suspect
that many of the XHTML pages you want to parse, are not actually valid
XHTML, which means the XML parsers will fail. Also, some of the sites
you are interested in might not be XHTML at all. So, using TagSoup for
everything seems simpliest.

The process is very lo-fi. Write some code using TagSoup which scrapes
the data you care about from the web pages and turns it into Haskell
data structures. This code should not be clever, and it will need to
be updating whenever the site you are scraping changes enough to break
your code.

This process should work fine if you are talking about scraping data
from some specific sites.

If you want to make a web crawler which automatically finds relevant
pages and scrapes the data, then that is a much bigger project. You
will still want to use something like TagSoup to do the initial
parsing, but extracting the data will be much trickier (though,
possibly worth billions of $$$ if done well).

j.

ps. I only have experience with TagSoup, so there may be other
libraries which are even better. The key feature of TagSoup is that it
allows you to process malformed, invalid HTML -- which is important if
you don't control the creation of the HTML you are parsing.

At Sat, 02 Aug 2008 22:10:36 -0300,
Rafael C. de Almeida wrote:
 
 Hello,
 
 I understand that nowadays there are several frameworks and wrapper
 libraries for making some sense of the XHTML documents you find over the
 web. That is, making the life of those who want to process the
 semi-structured data you find on the sites.
 
 I don't have much experience on that field myself, but I want to learn a
 little more about how I can, for instance, associate information from
 one site with information in another site. Even though it is structured
 differently in both places. Does anyone know about libraries that would
 help me out with that sort of work? Hope I'm being clear.
 
 []'s
 Rafael
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Web processing

2008-08-02 Thread Don Stewart
jeremy:
 Hello,
 
 I would recommend using TagSoup:
 
 http://www-users.cs.york.ac.uk/~ndm/tagsoup/
 
 The tutorial easy, and has good advice:
 
 http://www.cs.york.ac.uk/fp/darcs/tagsoup/tagsoup.htm
 

There's also a wrapper for this, that uses curl+bytestrings for the
download part, and exposes tags, xml and rss/atom parsers for the
content itself,

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/download-curl

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe