Hello [EMAIL PROTECTED]!
On 20-Gen-00, you wrote:
r> I still can't do what I need to do!
Tell me if this can be useful:
>> html-source: {
{ <TABLE>
{ <TR><TD>ALPHA</TD><TD>ONE</TD></TR>
{ <TR><TD>BETA</TD><TD>TWO</TD></TR>
{ <TR><TD COLSPAN=2>DUMMY LINE ONE</TD></TR>
{ <TR><TD>GAMMA</TD><TD>THREE</TD></TR>
{ <TR><TD>DELTA</TD><TD>FOUR</TD></TR>
{ <TR><TD COLSPAN=2>DUMMY LINE TWO</TD></TR>
{ <TR><TD>EPSILON</TD><TD>FIVE</TD></TR>
{ </TABLE>
{ }
== {
<TABLE>
<TR><TD>ALPHA</TD><TD>ONE</TD></TR>
<TR><TD>BETA</TD><TD>TWO</TD></TR>
<TR><TD COLSPAN=2>DUMMY LINE ONE</TD></TR>
<TR>...
>> parse-html html-source
== ["ALPHA" "ONE" "BETA" "TWO" "GAMMA" "THREE" "DELTA" "FOUR" "EPSILON" "FIVE"]
>> foreach [name value] parse-html html-source [
[ print [name "=" value]
[ ]
ALPHA = ONE
BETA = TWO
GAMMA = THREE
DELTA = FOUR
EPSILON = FIVE
This function has the advantage to be able to parse malformed HTML
too:
>> malformed-html: {
{ I hope you don't have to cope with things like this!
{ <HTML>
{ <TR><TD>You don't want</TD><TD>this, do you?</TD></TR>
{
{ Some unwanted content...
{
{ <TABLE>
{ Bla bla bla
{ <TR> Hey, look: this is very bad HTML!
{
{ <TD>
{
{ ALPHA</TD><TD>
{
{ ONE</TD></TR><TR>
{ ...<TD>a</TD>b<TD>c</TD>d<TD>e</TD>
{ </TR>
{ <TR><TD>BETA</TD><TD>TWO</TD></TR>
{ and so on...
{ </TABLE>
{ </BODY>
{ </HTML>
{ }
== {
I hope you don't have to cope with things like this!
<HTML>
<TR><TD>You don't want</TD><TD>this, do you?</TD></TR>
Some unwan...
>> parse-html malformed-html
== ["^/^/ALPHA" "^/^/ONE" "BETA" "TWO"]
It will also accept other tags inside the cells, stripping them:
>> parse-html {<TABLE><TR><TD>Some tags <B>here</B></TD><TD>etc.</TD></TR></TABLE>}
== ["Some tags here" "etc."]
And now, here's the code. It is a state machine, so perhaps there
are simpler ways to do this, but this is very flexible.
REBOL []
html-rule: [some [tag | text]]
tag: [ "<" [
"TABLE" (start-table) |
"/TABLE" (end-table) |
"TD" (start-cell) |
"/TD" (end-cell) |
"TR" (start-row) |
"/TR" (end-row) |
none ]
thru ">"
]
text: [
copy content some characters
(process content)
]
characters: complement charset "<>"
result: make block! 10
buffer: make block! 10
discard: func [
"Discards unwanted content"
content [string!]
] []
store: func [
"Store content"
content [string!]
] [
append last buffer content
]
process: :discard
in-row: reduce [
func [
"Cell start"
] [
append buffer make string! 100
process: :store
]
func [
"Cell end"
] [
process: :discard
]
]
not-in-row: reduce [none none]
in-table: reduce [
none
none
func [
"Row start"
] [
set [start-cell end-cell] in-row
clear buffer
process: :discard
]
func [
"Row end"
] [
if 2 = length? buffer [
append result buffer
]
set [start-cell end-cell] not-in-row
process: :discard
]
]
not-in-table: reduce [none none none none]
set [start-cell end-cell start-row end-row] not-in-table
start-table: func [
"Table start"
] [
set [start-cell end-cell start-row end-row] in-table
]
end-table: func [
"Table end"
] [
set [start-cell end-cell start-row end-row] not-in-table
]
parse-html: func [
"Parse the HTML source"
html [string!]
] [
clear result
parse/all html html-rule
result
]
Regards,
Gabriele.
--
o--------------------) .-^-. (----------------------------------o
| Gabriele Santilli / /_/_\_\ \ Amiga Group Italia --- L'Aquila |
| GIESSE on IRC \ \-\_/-/ / http://www.amyresource.it/AGI/ |
o--------------------) `-v-' (----------------------------------o