Hi!
I'm trying to extract HCards (http://microformats.org/wiki/hcard) from
HTML documents. HCard is a microformat. Microformats is an attempt to
add semantic information to XML documents without adding any new tags.
This is done by adding semantic information in class attributes
instead (see the 'testXml' string below).
I'm trying to find a good way to extract HCards into Haskell records.
To do this I need to map XML elements with certain attribute values
onto record fields. Some of the elements are optional in the XML and I
represent that using Maybe fields in my record. The order of the
elements is not guaranteed only the way they are nested. This makes it
more difficult to first extract the fields I want into a list of
Strings and then map that onto my record since I need to tag each
string with the value it represents.
So my question is. How can I write the function
'extractElementsIntoRecords' below. Or, perhaps HXT is the wrong tool
for the job and I should be trying to walk the DOM tree instead?
module HCard where
import Text.XML.HXT.Arrow
data HCard = HCard
{
familyName :: String,
givenName :: String
org :: Maybe String
url :: Maybe String
} deriving Show
parseHCards xml = runX $ parseXml xml
parseXml xml =
readString [(a_parse_html, v_1)] xml >>>
deep (hasClassName "vcard") >>>
extractElementsIntoRecords
extractElementsIntoRecords = undefined
hasClassName s = hasAttrValue "class" (elem s . words)
testXml =
"<div class=\"vcard red\">" ++
" <a class=\"url\" href=\"http://tantek.com/\">" ++
" <span class=\"n\" style=\"display:none\">" ++
" <span class=\"family-name\">Çelik</span>" ++
" <span class=\"given-name\">Tantek</span>" ++
" </span>" ++
" <span class=\"fn\">Tantek Çelik</span>" ++
" </a>" ++
" <div class=\"org\">" ++
" <span class=\"organization-name\">Technorati</span>" ++
" </div>" ++
"</div>"
main = parseHCards textXml
Cheers,
Johan Tibell
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe