Re: [Haskell-cafe] Accumulating related XML nodes using HXT
Apologies if this is a duplicate, the original appears to have gone astray. On Wednesday 01 November 2006 10:57, Albert Lai wrote: Daniel McAllansmith [EMAIL PROTECTED] writes: Hello. I have some html from which I want to extract records. Each record is represented within a number of tr nodes, and all records tr nodes are contained by the same parent node. This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested tr, and color in hr. (Just ask http://validator.w3.org/ .) Indeed. The original is even worse, with overlapping nodes and other such treasures which makes navigation in HXT tricky at times. I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup. Yep! I sure wouldn't be doing this if I had control of the the original HTML. Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble. I was about to write a follow-up just as your mail came in... I've ended up with the same solution as you've kindly suggested. Another option I came across is Control.Arrow.ArrowTree.changeChildren which could be used to restore a more normalised structure ready for more processing. Thanks Daniel ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Accumulating related XML nodes using HXT
On Wednesday 01 November 2006 10:57, Albert Lai wrote: Daniel McAllansmith [EMAIL PROTECTED] writes: Hello. I have some html from which I want to extract records. Each record is represented within a number of tr nodes, and all records tr nodes are contained by the same parent node. This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested tr, and color in hr. (Just ask http://validator.w3.org/ .) Indeed. The original is even worse, with overlapping nodes and other such treasures which makes navigation in HXT tricky at times. I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup. Yep! I sure wouldn't be doing this if I had control of the the original HTML. Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble. I was about to write a follow-up just as your mail came in... I've ended up with the same solution as you've kindly suggested. Another option I came across is Control.Arrow.ArrowTree.changeChildren which could be used to restore a more normalised structure ready for more processing. Thanks Daniel ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Accumulating related XML nodes using HXT
Daniel McAllansmith [EMAIL PROTECTED] writes: Hello. I have some html from which I want to extract records. Each record is represented within a number of tr nodes, and all records tr nodes are contained by the same parent node. This is very poorly written HTML. The original structure of the data is destroyed - the parse tree no longer reflects the data structure. (If a record is to be displayed in several rows, there are proper ways.) It is syntactically incorrect: nested tr, and color in hr. (Just ask http://validator.w3.org/ .) I trust that you are parsing this because you realize it is all wrong and you want to programmatically convert it to proper markup. Since the file is unstructured, I choose not to sweat over restoring the structure in an HXT arrow. The HXT arrow will return a flat list, just as the file is a flat ensemble. The list looks like: [/prod17, Television, (code: 17), A very nice telly., /prod24, Cyclotron, (code: 24), Mind your fillings.] I then use a pure function to decompose this list four items at a time to emit the desired records. This is trivial outside HXT arrows. I use tuples, and every field is a string; you can easily change the code to produce Prod's, turn (code: 17) into the number 17, etc. Here is a complete, validated HTML 4 file containing the table, just so that my program below actually has valid input. !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01//EN http://www.w3.org/TR/html4/strict.dtd; html head meta http-equiv=Content-Type content=text/html;charset=utf-8 titleProducts/title /head body table tr tdstrongProduct:/strong/td tdstronga href=/prod17Television/a/strong (code: 17)/td /tr tr tdstrongDescription:/strong/td tdA very nice telly./td /tr tr tdhr/td /tr tr tdstrongProduct:/strong/td tdstronga href=/prod24Cyclotron/a/strong (code: 24)/td /tr tr tdstrongDescription:/strong/td tdMind your fillings./td /tr tr tdhr/td /tr /table /body /html Here is my program: import Text.XML.HXT.Arrow main = do { unstructured - runX (p table.html) ; let structured = s unstructured ; print structured } p filename = readDocument [(a_parse_html,1)] filename deep (isElem hasName table) getChildren isElem hasName tr getChildren isElem hasName td getChildren p1 + p2 p1 = isElem hasName strong getChildren isElem hasName a getAttrValue href + (getChildren getText) p2 = getText s (a:b:c:d: rest) = (a,b,c,d) : s rest s _ = [] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Accumulating related XML nodes using HXT
Hello. I have some html from which I want to extract records. Each record is represented within a number of tr nodes, and all records tr nodes are contained by the same parent node. The things I've tried so far end up giving me the cartesian product of record fields, so for the html fragment included below I'd end up with: [ Prod Television 17 /prod17 A very nice telly. , Prod Television 17 /prod17 Mind your fillings. , Prod Cyclotron 24 /prod24 A very nice telly. , Prod Cyclotron 24 /prod24 Mind your fillings. ] instead of: [ Prod Television 17 /prod17 A very nice telly. , Prod Cyclotron 24 /prod24 Mind your fillings. ] How should I go about accumulating related tr nodes into individual records? Thanks Daniel HTML fragment follows: ... tr tr tdstrongProduct:/strong/td tdstronga href=/prod17Television/a/strong (code: 17)/td /tr tr tdstrongDescription:/strong/td tdA very nice telly./td /tr tr tdhr color=#0/td /tr tr tdstrongProduct:/strong/td tdstronga href=/prod24Cyclotron/a/strong (code: 24)/td /tr tr tdstrongDescription:/strong/td tdMind your fillings./td /tr tr tdhr color=#0/td /tr /tr ... ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe