As Aparajita mentioned there is a lot available on the Internet. The nice thing about regular expressions is that so many languages use PCRE that examples from one language can be easily applied to another with no or minor tweaking.

A truly generic solution may not be possible because the HTML content can infinitely vary. Also having a single regular expression magically transform your input into the desired output is probably not practical. Most likely you'd write a script or method that would work through the HTML. For example the first thing I would do is simplify the tags by removing all style information, e.g., transform <tr class="even" valign=top> to <tr>. I'd also remove any formatting tags that don't matter such as <b>, <i>, <strong>, etc. Then strip out all anchors unless you need them. IOW, try to clean the input prior to parsing it. From personal experience I've found you always need to consider the data to parse and adapt regular expressions and processing to fit. That said, I always consider a regex based solution before trying to write something that relies on 4D's (or any other languages) simple string substitution commands.

In this particular case do you need a solution that will parse an HTML Table?

To strip the attributes out of specific tags you would match on something like this (which is by no means complete tag-wise)

<\s*(html|head|title|body|table|tr|th|td)([^>])*>

... and replace it with <\1>.

In the above we start the match on <
then we look for 0 or more whitespace characters
then we match on the strings html, head, title, body, table, tr, th or td (you'd modify the list as needed)
then we match any sequence of 0 or more characters that are not >
then we match the closing >

To strip out formatting tags (again not complete tag-wise) you'd start with something like:

</?\s*(b|i|strong|u)([^>])>

... and replace any matches with nothing.

Given your example

<tr class="even" valign=top>
<td ><a href="/0035000000KWnfU">Tesch</a></td>
<td ><a href="/0035000000KWnfU">John</a></td>
<td class="nowrapCell">(911) 953-6555</td>
<td ><a href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a></td>
...
</tr>

you would get:

Tesch
John
(911) 953-6555
[EMAIL PROTECTED]

Or do you want to do something else?

Also, If you have a Windows or Linux machine there is a nice free utility called the Regex Coach which can be found at http://weitz.de/regex-coach/.

-- Brad

Mehboob Alam wrote:
I didnt get back many responses from the regular Nug,
so I'll try again..

Here's my original message, pardon the repitition if
you have already seen it

----
I'm still looking, but if anyone can point me to code
samples, or a RegEx that parses out data surrounded by
html tags, I'd be most grateful. Here's an example

<tr class="even" valign=top><td ><a
href="/0035000000KWnfU">Tesch</a></td><td ><a
href="/0035000000KWnfU">John</a></td><td
class="nowrapCell">(911) 953-6555</td><td ><a
href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a></td><td
<a href="/00530000000eqnP">Mehboob Alam</a></td><td
<a href="/0015000000F6Eeo">Reuters</a></td></tr>

Ideally, I'd like to get back an array of strings, or
anything close. Unfortunately, this is a
screen-scraping excercise, as the cost of accessing
the system using SOAP or ODBC is going to prohibitive.

I'm looking for something generic, as the actual order
of fields may change, i.e. the phone field may not
necessarily be in the same location in the final
report to be parsed.

On another note, besides QFree (which requires
QuickTime?), has anyone tried "4D RegEx"
<http://4dplugin.com/en/products/8.htm>


sincerely,
mehboob alam



"There cannot be a crisis next week. My schedule is already full."
Henry Kissinger
_______________________________________________
Active4D-dev mailing list
[email protected]
http://mailman.aparajitaworld.com/mailman/listinfo/active4d-dev
Archives: http://mailman.aparajitaworld.com/archive/active4d-dev/


_______________________________________________
Active4D-dev mailing list
[email protected]
http://mailman.aparajitaworld.com/mailman/listinfo/active4d-dev
Archives: http://mailman.aparajitaworld.com/archive/active4d-dev/

Reply via email to