Re: Regex help with invalid HTML
I have no control over this code The only time parsing HTML with RegEx might be remotely viable is when you know what that code will be - if the HTML is uncontrolled then using RegEx is a futile effort. RegEx is for dealing with Regular text, and HTML is not a Regular language - even modern regex engines that implement non-Regular features *cannot* deal with the potential complexity of HTML. The correct solution is to **use a tool designed for parsing HTML**. There isn't one native to CF, but there are a number of Java ones available - take a look at: http://java-source.net/open-source/html-parsers I haven't used any of those, I'd probably start with TagSoup or NekoHTML since they look promising, but any HTML parser that produces a DOM structure which you can run XPath expressions against will allow you to extract the specific information you want. So yeah, it might involve a bit of effort getting one of those to work, but it's far more stable and reliable than attempting to use regex for something it simply isn't designed for. ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328460 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Regex help with invalid HTML
List wrote at 17 November 2009 14:32: Andy matthews, you're welcome. Ah hah, that's a name I'm more familiar with. testing Roger. And excuse the previously poorly formatted code (it looked ok at my end before sending but occasionally in Outlook 2007 when I copy and paste from external apps that happens). Over and out. Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328477 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Regex help with invalid HTML
Peter Boughton wrote on Wed 18/11/2009 at 03:12: The only time parsing HTML with RegEx might be remotely viable is when you know what that code will be - if the HTML is uncontrolled then using RegEx is a futile effort. RegEx is for dealing with Regular text, and HTML is not a Regular language - even modern regex engines that implement non-Regular features *cannot* deal with the potential complexity of HTML. The correct solution is to **use a tool designed for parsing HTML**. Ok Peter, thanks for the heads-up. There isn't one native to CF, but there are a number of Java ones available - take a look at: http://java-source.net/open-source/html-parsers I haven't used any of those, I'd probably start with TagSoup or NekoHTML since they look promising, but any HTML parser that produces a DOM structure which you can run XPath expressions against will allow you to extract the specific information you want. TagSoup it is. Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328478 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Regex help with invalid HTML
Azadi Saryev wrote on 16 November 2009 at 17:58 you can do it with something like this: cfset line='trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB' cfset cleanline = rereplace(line, 't[^]+', '|', 'all') cfoutput#listfirst(cleanline, '|')# #listlast(cleanline, '|')#/cfoutput and if you do not want any html in final result (not even a tag), then use: cfset cleanline = rereplace(line, '[^]+', '|', 'all') Thanks Azadi. That's all I needed to get the thought processes rolling in the right direction (it never occurred to me to check each entry was on a new line, so thanks also to the individual I can only refer to as list!). Here's the truncated code relevant to the question I asked that's working: cfhttp url=http://localhost/statsmerged.html; cfset sStartString = cfhttp.filecontent cfset sStartTag = FindNoCase(td class='l', sStartString) cfset sTempString = RemoveChars(sStartString,1, sStartTag-1) cfset sEndTag = FindNoCase(/table, sTempString) cfset sFinalString = RemoveChars(sTempString,sEndTag, Len(sTempString)) cfloop index=thisLine list=#sFinalString# delimiters=#chr(10)##chr(13)# cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all') cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine, '|')#/cfoutput /cfloop ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328444 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Regex help with invalid HTML
Andy matthews, you're welcome. -Original Message- From: Mark Henderson [mailto:m...@cwc.co.nz] Sent: Monday, November 16, 2009 4:29 PM To: cf-talk Subject: RE: Regex help with invalid HTML Azadi Saryev wrote on 16 November 2009 at 17:58 you can do it with something like this: cfset line='trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB' cfset cleanline = rereplace(line, 't[^]+', '|', 'all') cfoutput#listfirst(cleanline, '|')# #listlast(cleanline, '|')#/cfoutput and if you do not want any html in final result (not even a tag), then use: cfset cleanline = rereplace(line, '[^]+', '|', 'all') Thanks Azadi. That's all I needed to get the thought processes rolling in the right direction (it never occurred to me to check each entry was on a new line, so thanks also to the individual I can only refer to as list!). Here's the truncated code relevant to the question I asked that's working: cfhttp url=http://localhost/statsmerged.html; cfset sStartString = cfhttp.filecontent cfset sStartTag = FindNoCase(td class='l', sStartString) cfset sTempString = RemoveChars(sStartString,1, sStartTag-1) cfset sEndTag = FindNoCase(/table, sTempString) cfset sFinalString = RemoveChars(sTempString,sEndTag, Len(sTempString)) cfloop index=thisLine list=#sFinalString# delimiters=#chr(10)##chr(13)# cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all') cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine, '|')#/cfoutput /cfloop ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328450 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Regex help with invalid HTML
testing -Original Message- From: Mark Henderson [mailto:m...@cwc.co.nz] Sent: Monday, November 16, 2009 4:29 PM To: cf-talk Subject: RE: Regex help with invalid HTML Azadi Saryev wrote on 16 November 2009 at 17:58 you can do it with something like this: cfset line='trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB' cfset cleanline = rereplace(line, 't[^]+', '|', 'all') cfoutput#listfirst(cleanline, '|')# #listlast(cleanline, '|')#/cfoutput and if you do not want any html in final result (not even a tag), then use: cfset cleanline = rereplace(line, '[^]+', '|', 'all') Thanks Azadi. That's all I needed to get the thought processes rolling in the right direction (it never occurred to me to check each entry was on a new line, so thanks also to the individual I can only refer to as list!). Here's the truncated code relevant to the question I asked that's working: cfhttp url=http://localhost/statsmerged.html; cfset sStartString = cfhttp.filecontent cfset sStartTag = FindNoCase(td class='l', sStartString) cfset sTempString = RemoveChars(sStartString,1, sStartTag-1) cfset sEndTag = FindNoCase(/table, sTempString) cfset sFinalString = RemoveChars(sTempString,sEndTag, Len(sTempString)) cfloop index=thisLine list=#sFinalString# delimiters=#chr(10)##chr(13)# cfset cleanLine = ReReplace(thisLine, '[^]+', '|', 'all') cfoutput#listFirst(cleanLine, '|')# #listLast(cleanLine, '|')#/cfoutput /cfloop ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328451 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Regex help with invalid HTML
Calling all regex gurus. I've spent a little time on this so now it's time to seek advice from the professionals. Here is an example of the content I'm working with: trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB trtd class=la href=/xyz.co.nz/atd31 622td23 443td193 645td840 642td1.8 GB trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB And what I want to do is remove everything between the first td (after the closing /a) and the last td BEFORE the next tr. E.G. This trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB becomes trtd class=la href=/abc.co.nz/a 471.47 MB At that point I will then strip all the remaining HTML tags (which I can do) and I should be good to go. Unfortunately I have no control over this code as it is generated by a stats program, and if indeed it used the correct closing tags and validated I could probably fumble around and eventually achieve what I want, as I've done in the past. And just in case anyone out there can do all this in one hit, ultimately I want the output from above to look like this: abc.co.nz 471.47 MB xyz.co.nz 1.8 GB blah.com 1.9 GB etc. I hope that makes sense. TIA Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328402 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Regex help with invalid HTML
Will it always be a domain name you want to keep? And will the file size always be at the very end of the line? -Original Message- From: Mark Henderson [mailto:m...@cwc.co.nz] Sent: Sunday, November 15, 2009 8:38 PM To: cf-talk Subject: Regex help with invalid HTML Calling all regex gurus. I've spent a little time on this so now it's time to seek advice from the professionals. Here is an example of the content I'm working with: trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB trtd class=la href=/xyz.co.nz/atd31 622td23 443td193 645td840 642td1.8 GB trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB And what I want to do is remove everything between the first td (after the closing /a) and the last td BEFORE the next tr. E.G. This trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB becomes trtd class=la href=/abc.co.nz/a 471.47 MB At that point I will then strip all the remaining HTML tags (which I can do) and I should be good to go. Unfortunately I have no control over this code as it is generated by a stats program, and if indeed it used the correct closing tags and validated I could probably fumble around and eventually achieve what I want, as I've done in the past. And just in case anyone out there can do all this in one hit, ultimately I want the output from above to look like this: abc.co.nz 471.47 MB xyz.co.nz 1.8 GB blah.com 1.9 GB etc. I hope that makes sense. TIA Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328403 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Regex help with invalid HTML
lists wrote: Will it always be a domain name you want to keep? And will the file size always be at the very end of the line? Yes, and yes (confirmed all the TRs start on a new line). Regards Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328404 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Regex help with invalid HTML
you can do it with something like this: cfset line='trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB' cfset cleanline = rereplace(line, 't[^]+', '|', 'all') cfoutput#listfirst(cleanline, '|')# #listlast(cleanline, '|')#/cfoutput and if you do not want any html in final result (not even a tag), then use: cfset cleanline = rereplace(line, '[^]+', '|', 'all') Azadi Saryev On 16/11/2009 10:37, Mark Henderson wrote: Calling all regex gurus. I've spent a little time on this so now it's time to seek advice from the professionals. Here is an example of the content I'm working with: trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB trtd class=la href=/xyz.co.nz/atd31 622td23 443td193 645td840 642td1.8 GB trtd class=la href=/blah.com/atd31 622td25 623td193 645td840 642td1.9 GB And what I want to do is remove everything between the first td (after the closing /a) and the last td BEFORE the next tr. E.G. This trtd class=la href=/abc.co.nz/atd52 363td73 815td5 122 265td2 166 760td471.47 MB becomes trtd class=la href=/abc.co.nz/a 471.47 MB At that point I will then strip all the remaining HTML tags (which I can do) and I should be good to go. Unfortunately I have no control over this code as it is generated by a stats program, and if indeed it used the correct closing tags and validated I could probably fumble around and eventually achieve what I want, as I've done in the past. And just in case anyone out there can do all this in one hit, ultimately I want the output from above to look like this: abc.co.nz 471.47 MB xyz.co.nz 1.8 GB blah.com 1.9 GB etc. I hope that makes sense. TIA Mark ~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328405 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4