[REBOL] Pulling values from parsed HTML ... more REGEX trouble

runester Wed, 19 Jan 2000 17:19:37 -0800
Hello, I am new to the list. I have searched through the archive of previous
posts hoping to find the information I need, and I have read the various
documents at www.rebol.com, including the new users guide.

I still can't do what I need to do!

If this material is covered somewhere, please let me know where and I'll go
look it up. Otherwise, I would appreciate some guidance.

~~~ ~~~ ~~~

Problem: I need to parse an HTML page and pull values out of certain fields for
later analysis. Can this be done with 'parse and if so, how?

Sample Data:
<TABLE>
<TR><TD>ALPHA</TD><TD>ONE</TD></TR>
<TR><TD>BETA</TD><TD>TWO</TD></TR>
<TR><TD COLSPAN=2>DUMMY LINE ONE</TD></TR>
<TR><TD>GAMMA</TD><TD>THREE</TD></TR>
<TR><TD>DELTA</TD><TD>FOUR</TD></TR>
<TR><TD COLSPAN=2>DUMMY LINE TWO</TD></TR>
<TR><TD>EPSILON</TD><TD>FIVE</TD></TR>
</TABLE>

Desired output:
ALPHA = ONE
BETA = TWO
GAMMA = THREE
DELTA = FOUR
EPSILON = FIVE

How I would do it in PERL:
<PERL>
## I am assuming the data is in a file specified on the command line
## and the output is being sent to STDOUT

$pattern = '<tr><td>([\w\s]*)<\/td><td>([\w\s]*)<\/td><\/tr>';
while(<>)
{
   if( $_ =~ m/$pattern/gi ) { print "$1 = $2\n"; }
}
</PERL>

A Few Notes:
1) I only want to pull the cell contents out if there are two cells per row,
the other rows either contain needless data or section headers.

2) I actually need the values, they need to be reformated and compared, so
'just' printing them would not be enough in the script.

3) I know how to split the file into lines in REBOL if that would help, and I
know how to MATCH the data in REBOL ... but I do _NOT_ know how to pull the
values out of that data.

4) I have tried combinations of [thru <tr> <td> copy txt1 to </td> <td>] (which
works fine, for pulling out ONE value) but I cannot write a syntactically
correct parse-grammer that would pull out both values.

5) Also, could someone please explain the weird finding I outline below.

>> sample-text: "alpha#beta"
== "alpha#beta"
>> probe parse sample-text [copy txt1 to "#" "#" copy txt2 to end (print [txt1
txt2])]
alpha beta
true
== true


>> sample-text: "alpha<td>beta"
== "alpha<td>beta"
>> probe parse sample-text [copy txt1 to <td> <td> copy txt2 to end (print
[txt1 txt2])]
alpha beta
true
== true


>> sample-text: "alpha</td><td>beta"
== "alpha</td><td>beta"
>> probe parse sample-text [copy txt1 to </td> <td> </td> <td> copy txt2 to end
(print [txt1 txt2])]
false
== false


; So, why is my parse grammer correct for a single seperator (whether text or
tag) but incorrect for a double seperator?


Thank you, in advance for your assistance in this matter.


=====
Steve ~runester~ Jarjoura
"According to my calculations, that problem doesn't exist."
__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com
[REBOL] Pulling values from parsed HTML ... more REGEX trouble

Reply via email to