Re: Regex help with invalid HTML

2009-11-17 Thread Peter Boughton

 I have no control over this code 

The only time parsing HTML with RegEx might be remotely viable is when you know 
what that code will be - if the HTML is uncontrolled then using RegEx is a 
futile effort.

RegEx is for dealing with Regular text, and HTML is not a Regular language - 
even modern regex engines that implement non-Regular features *cannot* deal 
with the potential complexity of HTML.

The correct solution is to **use a tool designed for parsing HTML**.

There isn't one native to CF, but there are a number of Java ones available - 
take a look at:
http://java-source.net/open-source/html-parsers

I haven't used any of those, I'd probably start with TagSoup or NekoHTML since 
they look promising, but any HTML parser that produces a DOM structure which 
you can run XPath expressions against will allow you to extract the specific 
information you want.

So yeah, it might involve a bit of effort getting one of those to work, but 
it's far more stable and reliable than attempting to use regex for something it 
simply isn't designed for. 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328460
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Regex help with invalid HTML

2009-11-17 Thread Mark Henderson

List wrote at 17 November 2009 14:32:
 Andy matthews, you're welcome.

Ah hah, that's a name I'm more familiar with.

 testing

Roger.  And excuse the previously poorly formatted code (it looked ok at
my end before sending but occasionally in Outlook 2007 when I copy and
paste from external apps that happens).

Over and out.

Mark  


~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328477
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


RE: Regex help with invalid HTML

2009-11-17 Thread Mark Henderson

Peter Boughton wrote on Wed 18/11/2009 at 03:12:

 The only time parsing HTML with RegEx might be remotely viable is when
you know
 what that code will be - if the HTML is uncontrolled then using RegEx
is a futile effort.
 
 RegEx is for dealing with Regular text, and HTML is not a Regular
language - even
 modern regex engines that implement non-Regular features *cannot* deal
with the
 potential complexity of HTML.
 
 The correct solution is to **use a tool designed for parsing HTML**.

Ok Peter, thanks for the heads-up.


 There isn't one native to CF, but there are a number of Java ones
available - take a
 look at:
 http://java-source.net/open-source/html-parsers
 
 I haven't used any of those, I'd probably start with TagSoup or
NekoHTML since they
 look promising, but any HTML parser that produces a DOM structure
which you can
 run XPath expressions against will allow you to extract the specific
information you
 want.

TagSoup it is.

Mark

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328478
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Regex help with invalid HTML

2009-11-16 Thread Mark Henderson

Azadi Saryev wrote on 16 November 2009 at 17:58

 you can do it with something like this:
 cfset line='trtd class=la href=/blah.com/atd31
622td25
 623td193 645td840 642td1.9 GB'
 cfset cleanline = rereplace(line, 't[^]+', '|', 'all')
 cfoutput#listfirst(cleanline, '|')# #listlast(cleanline,
'|')#/cfoutput
 
 and if you do not want any html in final result (not even a tag),
then
 use:
 cfset cleanline = rereplace(line, '[^]+', '|', 'all')
 

Thanks Azadi. That's all I needed to get the thought processes rolling
in the right direction (it never occurred to me to check each entry was
on a new line, so thanks also to the individual I can only refer to as
list!). 

Here's the truncated code relevant to the question I asked that's
working:

cfhttp url=http://localhost/statsmerged.html;

cfset sStartString = cfhttp.filecontent
cfset sStartTag = FindNoCase(td class='l', sStartString)
cfset sTempString = RemoveChars(sStartString,1, sStartTag-1)
cfset sEndTag = FindNoCase(/table, sTempString)
cfset sFinalString = RemoveChars(sTempString,sEndTag,
Len(sTempString))

cfloop index=thisLine list=#sFinalString#
delimiters=#chr(10)##chr(13)# 
  cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all')
  cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine,
'|')#/cfoutput
/cfloop


~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328444
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


RE: Regex help with invalid HTML

2009-11-16 Thread lists

Andy matthews, you're welcome. 

-Original Message-
From: Mark Henderson [mailto:m...@cwc.co.nz] 
Sent: Monday, November 16, 2009 4:29 PM
To: cf-talk
Subject: RE: Regex help with invalid HTML


Azadi Saryev wrote on 16 November 2009 at 17:58

 you can do it with something like this:
 cfset line='trtd class=la href=/blah.com/atd31
622td25
 623td193 645td840 642td1.9 GB'
 cfset cleanline = rereplace(line, 't[^]+', '|', 'all') 
 cfoutput#listfirst(cleanline, '|')# #listlast(cleanline,
'|')#/cfoutput
 
 and if you do not want any html in final result (not even a tag),
then
 use:
 cfset cleanline = rereplace(line, '[^]+', '|', 'all')
 

Thanks Azadi. That's all I needed to get the thought processes rolling in
the right direction (it never occurred to me to check each entry was on a
new line, so thanks also to the individual I can only refer to as list!). 

Here's the truncated code relevant to the question I asked that's
working:

cfhttp url=http://localhost/statsmerged.html;

cfset sStartString = cfhttp.filecontent cfset sStartTag = FindNoCase(td
class='l', sStartString) cfset sTempString = RemoveChars(sStartString,1,
sStartTag-1) cfset sEndTag = FindNoCase(/table, sTempString) cfset
sFinalString = RemoveChars(sTempString,sEndTag, Len(sTempString))

cfloop index=thisLine list=#sFinalString#
delimiters=#chr(10)##chr(13)#
  cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all')
  cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine,
'|')#/cfoutput /cfloop




~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328450
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Regex help with invalid HTML

2009-11-16 Thread lists

testing 

-Original Message-
From: Mark Henderson [mailto:m...@cwc.co.nz] 
Sent: Monday, November 16, 2009 4:29 PM
To: cf-talk
Subject: RE: Regex help with invalid HTML


Azadi Saryev wrote on 16 November 2009 at 17:58

 you can do it with something like this:
 cfset line='trtd class=la href=/blah.com/atd31
622td25
 623td193 645td840 642td1.9 GB'
 cfset cleanline = rereplace(line, 't[^]+', '|', 'all') 
 cfoutput#listfirst(cleanline, '|')# #listlast(cleanline,
'|')#/cfoutput
 
 and if you do not want any html in final result (not even a tag),
then
 use:
 cfset cleanline = rereplace(line, '[^]+', '|', 'all')
 

Thanks Azadi. That's all I needed to get the thought processes rolling in
the right direction (it never occurred to me to check each entry was on a
new line, so thanks also to the individual I can only refer to as list!). 

Here's the truncated code relevant to the question I asked that's
working:

cfhttp url=http://localhost/statsmerged.html;

cfset sStartString = cfhttp.filecontent cfset sStartTag = FindNoCase(td
class='l', sStartString) cfset sTempString = RemoveChars(sStartString,1,
sStartTag-1) cfset sEndTag = FindNoCase(/table, sTempString) cfset
sFinalString = RemoveChars(sTempString,sEndTag, Len(sTempString))

cfloop index=thisLine list=#sFinalString#
delimiters=#chr(10)##chr(13)#
  cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all')
  cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine,
'|')#/cfoutput /cfloop




~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328451
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Regex help with invalid HTML

2009-11-15 Thread Mark Henderson

Calling all regex gurus. I've spent a little time on this so now it's
time to seek advice from the professionals. Here is an example of the
content I'm working with:

trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
265td2 166 760td471.47 MB
trtd class=la href=/xyz.co.nz/atd31 622td23 443td193
645td840 642td1.8 GB
trtd class=la href=/blah.com/atd31 622td25 623td193
645td840 642td1.9 GB

And what I want to do is remove everything between the first td (after
the closing /a) and the last td BEFORE the next tr.

E.G. This 
trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
265td2 166 760td471.47 MB 

becomes

trtd class=la href=/abc.co.nz/a 471.47 MB

At that point I will then strip all the remaining HTML tags (which I can
do) and I should be good to go. Unfortunately I have no control over
this code as it is generated by a stats program, and if indeed it used
the correct closing tags and validated I could probably fumble around
and eventually achieve what I want, as I've done in the past.  And just
in case anyone out there can do all this in one hit, ultimately I want
the output from above to look like this:

abc.co.nz 471.47 MB
xyz.co.nz 1.8 GB
blah.com 1.9 GB
etc.

I hope that makes sense.


TIA
Mark

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328402
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Regex help with invalid HTML

2009-11-15 Thread lists

Will it always be a domain name you want to keep? And will the file size
always be at the very end of the line? 

-Original Message-
From: Mark Henderson [mailto:m...@cwc.co.nz] 
Sent: Sunday, November 15, 2009 8:38 PM
To: cf-talk
Subject: Regex help with invalid HTML


Calling all regex gurus. I've spent a little time on this so now it's time
to seek advice from the professionals. Here is an example of the content I'm
working with:

trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
265td2 166 760td471.47 MB
trtd class=la href=/xyz.co.nz/atd31 622td23 443td193
645td840 642td1.8 GB trtd class=la href=/blah.com/atd31
622td25 623td193 645td840 642td1.9 GB

And what I want to do is remove everything between the first td (after the
closing /a) and the last td BEFORE the next tr.

E.G. This
trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
265td2 166 760td471.47 MB 

becomes

trtd class=la href=/abc.co.nz/a 471.47 MB

At that point I will then strip all the remaining HTML tags (which I can
do) and I should be good to go. Unfortunately I have no control over this
code as it is generated by a stats program, and if indeed it used the
correct closing tags and validated I could probably fumble around and
eventually achieve what I want, as I've done in the past.  And just in case
anyone out there can do all this in one hit, ultimately I want the output
from above to look like this:

abc.co.nz 471.47 MB
xyz.co.nz 1.8 GB
blah.com 1.9 GB
etc.

I hope that makes sense.


TIA
Mark



~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328403
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Regex help with invalid HTML

2009-11-15 Thread Mark Henderson

lists wrote:
 Will it always be a domain name you want to keep? And will the file
size
 always be at the very end of the line?

Yes, and yes (confirmed all the TRs start on a new line).


Regards

Mark

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328404
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Regex help with invalid HTML

2009-11-15 Thread Azadi Saryev

you can do it with something like this:
cfset line='trtd class=la href=/blah.com/atd31 622td25
623td193 645td840 642td1.9 GB'
cfset cleanline = rereplace(line, 't[^]+', '|', 'all')
cfoutput#listfirst(cleanline, '|')# #listlast(cleanline, '|')#/cfoutput

and if you do not want any html in final result (not even a tag), then
use:
cfset cleanline = rereplace(line, '[^]+', '|', 'all')

Azadi Saryev



On 16/11/2009 10:37, Mark Henderson wrote:
 Calling all regex gurus. I've spent a little time on this so now it's
 time to seek advice from the professionals. Here is an example of the
 content I'm working with:

 trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
 265td2 166 760td471.47 MB
 trtd class=la href=/xyz.co.nz/atd31 622td23 443td193
 645td840 642td1.8 GB
 trtd class=la href=/blah.com/atd31 622td25 623td193
 645td840 642td1.9 GB

 And what I want to do is remove everything between the first td (after
 the closing /a) and the last td BEFORE the next tr.

 E.G. This 
 trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122
 265td2 166 760td471.47 MB 

 becomes

 trtd class=la href=/abc.co.nz/a 471.47 MB

 At that point I will then strip all the remaining HTML tags (which I can
 do) and I should be good to go. Unfortunately I have no control over
 this code as it is generated by a stats program, and if indeed it used
 the correct closing tags and validated I could probably fumble around
 and eventually achieve what I want, as I've done in the past.  And just
 in case anyone out there can do all this in one hit, ultimately I want
 the output from above to look like this:

 abc.co.nz 471.47 MB
 xyz.co.nz 1.8 GB
 blah.com 1.9 GB
 etc.

 I hope that makes sense.


 TIA
 Mark

 

~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328405
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4