Re: [Haskell-cafe] Accumulating related XML nodes using HXT

2006-11-01 Thread Daniel McAllansmith
Apologies if this is a duplicate, the original appears to have gone astray.

On Wednesday 01 November 2006 10:57, Albert Lai wrote:
 Daniel McAllansmith [EMAIL PROTECTED] writes:
  Hello.
 
  I have some html from which I want to extract records.
  Each record is represented within a number of tr nodes, and all records
  tr nodes are contained by the same parent node.

 This is very poorly written HTML.  The original structure of the data
 is destroyed - the parse tree no longer reflects the data structure.
 (If a record is to be displayed in several rows, there are proper
 ways.)  It is syntactically incorrect: nested tr, and color in hr.
 (Just ask http://validator.w3.org/ .)  

Indeed.  The original is even worse, with overlapping nodes and other such 
treasures which makes navigation in HXT tricky at times.

 I trust that you are parsing 
 this because you realize it is all wrong and you want to
 programmatically convert it to proper markup.

Yep!  I sure wouldn't be doing this if I had control of the the original HTML.


 Since the file is unstructured, I choose not to sweat over restoring
 the structure in an HXT arrow.  The HXT arrow will return a flat list,
 just as the file is a flat ensemble.

I was about to write a follow-up just as your mail came in... I've ended up 
with the same solution as you've kindly suggested.

Another option I came across is Control.Arrow.ArrowTree.changeChildren which 
could be used to restore a more normalised structure ready for more 
processing.


Thanks
Daniel
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Accumulating related XML nodes using HXT

2006-11-01 Thread Daniel McAllansmith
On Wednesday 01 November 2006 10:57, Albert Lai wrote:
 Daniel McAllansmith [EMAIL PROTECTED] writes:
  Hello.
 
  I have some html from which I want to extract records.
  Each record is represented within a number of tr nodes, and all records
  tr nodes are contained by the same parent node.

 This is very poorly written HTML.  The original structure of the data
 is destroyed - the parse tree no longer reflects the data structure.
 (If a record is to be displayed in several rows, there are proper
 ways.)  It is syntactically incorrect: nested tr, and color in hr.
 (Just ask http://validator.w3.org/ .)  

Indeed.  The original is even worse, with overlapping nodes and other such 
treasures which makes navigation in HXT tricky at times.

 I trust that you are parsing 
 this because you realize it is all wrong and you want to
 programmatically convert it to proper markup.

Yep!  I sure wouldn't be doing this if I had control of the the original HTML.


 Since the file is unstructured, I choose not to sweat over restoring
 the structure in an HXT arrow.  The HXT arrow will return a flat list,
 just as the file is a flat ensemble.

I was about to write a follow-up just as your mail came in... I've ended up 
with the same solution as you've kindly suggested.

Another option I came across is Control.Arrow.ArrowTree.changeChildren which 
could be used to restore a more normalised structure ready for more 
processing.


Thanks
Daniel
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Accumulating related XML nodes using HXT

2006-10-31 Thread Albert Lai
Daniel McAllansmith [EMAIL PROTECTED] writes:

 Hello.
 
 I have some html from which I want to extract records.  
 Each record is represented within a number of tr nodes, and all records 
 tr 
 nodes are contained by the same parent node.

This is very poorly written HTML.  The original structure of the data
is destroyed - the parse tree no longer reflects the data structure.
(If a record is to be displayed in several rows, there are proper
ways.)  It is syntactically incorrect: nested tr, and color in hr.
(Just ask http://validator.w3.org/ .)  I trust that you are parsing
this because you realize it is all wrong and you want to
programmatically convert it to proper markup.

Since the file is unstructured, I choose not to sweat over restoring
the structure in an HXT arrow.  The HXT arrow will return a flat list,
just as the file is a flat ensemble.  The list looks like:

[/prod17, Television,  (code: 17), A very nice telly.,
 /prod24, Cyclotron,  (code: 24), Mind your fillings.]

I then use a pure function to decompose this list four items at a time
to emit the desired records.  This is trivial outside HXT arrows.  I
use tuples, and every field is a string; you can easily change the
code to produce Prod's, turn  (code: 17) into the number 17, etc.

Here is a complete, validated HTML 4 file containing the table, just
so that my program below actually has valid input.

!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01//EN
http://www.w3.org/TR/html4/strict.dtd;
html
head
meta http-equiv=Content-Type content=text/html;charset=utf-8
titleProducts/title
/head
body

table
  tr
tdstrongProduct:/strong/td
tdstronga href=/prod17Television/a/strong (code: 17)/td
  /tr
  tr
tdstrongDescription:/strong/td
tdA very nice telly./td
  /tr

  tr
tdhr/td
  /tr

  tr
tdstrongProduct:/strong/td
tdstronga href=/prod24Cyclotron/a/strong (code: 24)/td
  /tr
  tr
tdstrongDescription:/strong/td
tdMind your fillings./td
  /tr

  tr
tdhr/td
  /tr
/table
/body
/html

Here is my program:

import Text.XML.HXT.Arrow

main =
do { unstructured - runX (p table.html)
   ; let structured = s unstructured
   ; print structured
   }

p filename =
readDocument [(a_parse_html,1)] filename 
deep (isElem  hasName table) 
getChildren  isElem  hasName tr 
getChildren  isElem  hasName td 
getChildren 
p1 + p2

p1 =
isElem  hasName strong 
getChildren  isElem  hasName a 
getAttrValue href + (getChildren  getText)

p2 =
getText

s (a:b:c:d: rest) = (a,b,c,d) : s rest
s _ = []
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Accumulating related XML nodes using HXT

2006-10-30 Thread Daniel McAllansmith
Hello.

I have some html from which I want to extract records.  
Each record is represented within a number of tr nodes, and all records tr 
nodes are contained by the same parent node.

The things I've tried so far end up giving me the cartesian product of record 
fields, so for the html fragment included below I'd end up with:

[ Prod Television 17 /prod17 A very nice telly.
, Prod Television 17 /prod17 Mind your fillings.
, Prod Cyclotron 24 /prod24 A very nice telly.
, Prod Cyclotron 24 /prod24 Mind your fillings.
]

instead of:

[ Prod Television 17 /prod17 A very nice telly.
, Prod Cyclotron 24 /prod24 Mind your fillings.
]


How should I go about accumulating related tr nodes into individual records?


Thanks
Daniel


HTML fragment follows:

...
tr
  tr
tdstrongProduct:/strong/td
tdstronga href=/prod17Television/a/strong (code: 17)/td
  /tr
  tr
tdstrongDescription:/strong/td
tdA very nice telly./td
  /tr

  tr
tdhr color=#0/td
  /tr

  tr
tdstrongProduct:/strong/td
tdstronga href=/prod24Cyclotron/a/strong (code: 24)/td
  /tr
  tr
tdstrongDescription:/strong/td
tdMind your fillings./td
  /tr

  tr
tdhr color=#0/td
  /tr
/tr
...
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe