RE: regex help for grabbing values of html tag attributes
Google doesn't put quotes around most attributes. The following works (takes single or double quotes or even no quotes into consideration). Watch out for wrapping in the regular expressions. It allows you to find the value of 1 attribute in one or more tags (see examples). cfscript function GetAttributeValue(str,tag,attr){ var regexp = (#tag#)\s[^]*#attr#=('.*?'|.*?|[^\s]+)[^]*; var aReturn = ArrayNew(1); var start = 1; var stTmp = StructNew(); while(true){ stTmp = REFindNoCase(regexp,str,start,true); if(stTmp.pos[1] IS 0) break; ArrayAppend(aReturn,REReplace(Mid(str,stTmp.pos[3],stTmp.len[3]),^['] (.*)[']$,\1)); start = stTmp.pos[1] + stTmp.len[1]; } return aReturn; } /cfscript cfhttp url=http://www.google.com/; throwonerror=yes/cfhttp cfoutput#HTMLCodeFormat(cfhttp.filecontent)#/cfoutput cfdump var=#GetAttributeValue(cfhttp.filecontent,'a','href')# cfdump var=#GetAttributeValue(cfhttp.filecontent,'img','src')# cfdump var=#GetAttributeValue(cfhttp.filecontent,'a|td','class')# Pascal -Original Message- From: Burns, John D [mailto:[EMAIL PROTECTED] Sent: 22 March 2005 22:59 To: CF-Talk Subject: RE: regex help for grabbing values of html tag attributes Ben, I can see what you've got (I think) and it makes sense, but for some reason, it's not working. I'm grabbing the html from www.google.com and running it on it and this is what I've got in my code: #refindnocase('img.*?src=(.*?).*?',cfhttp.fileContent,0,true)# I'm using cfdump to display that info and what I see are 2 arrays (len and pos) and both have values of 1 and 0. I thought that if the first value was 1, the second value would be the position of the occurrence of the search string. I know google has an image, and I'm displaying the cfhttp.filecontent in a textarea above so that I can ensure the results are coming back as expected. Any ideas? Am I doing something wrong? John Burns Certified Advanced ColdFusion MX Developer Wyle Laboratories, Inc. | Web Developer -Original Message- From: Ben Doom [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 22, 2005 4:54 PM To: CF-Talk Subject: Re: regex help for grabbing values of html tag attributes Well, I see a couple of problems with what you're using. First, you've not got a closing on the attribute. Second, you've wrapped a regex that contains a in 's, which will error out if you don't escape the inner 's. You can wrap it with single quotes to fix that. Also, the last * boggles me. I don't know why it's there. Or, try this: '#tag#.*?#att#=(.*?).*?' where (should be obvious) tag and att are defined as the tag and attribute you want. Please note that if you define them as span and class and you have this: spanstuff in betweenspan class=bob the whole tag match will return both span tags and the stuff in between. The attribute match will return bob. So, if this might be the case, lemme know and we'll tweak the regex. Not tested, your miles may vary, trix are for kids, etc. --Ben Burns, John D wrote: 6.1. I was looking at the archives and have come up with this but it's erroring I'm using the img instance because it's easier to test on pages that have multiple images... #refindnocase(img[^]*src=([^]*)*,cfhttp.fileContent,0,true)# ~| Find out how CFTicket can increase your company's customer support efficiency by 100% http://www.houseoffusion.com/banners/view.cfm?bannerid=49 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199743 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
RE: regex help for grabbing values of html tag attributes
Google doesn't put quotes around most attributes. The following works (takes single or double quotes or even no quotes into consideration). Watch out for wrapping in the regular expressions. It allows you to find the value of 1 attribute in one or more tags (see examples). cfscript function GetAttributeValue(str,tag,attr){ var regexp = (#tag#)\s[^]*#attr#=('.*?'|.*?|[^\s]+)[^]*; var aReturn = ArrayNew(1); var start = 1; var stTmp = StructNew(); while(true){ stTmp = REFindNoCase(regexp,str,start,true); if(stTmp.pos[1] IS 0) break; ArrayAppend(aReturn,REReplace(Mid(str,stTmp.pos[3],stTmp.len[3]),^['] (.*)[']$,\1)); start = stTmp.pos[1] + stTmp.len[1]; } return aReturn; } /cfscript cfhttp url=http://www.google.com/; throwonerror=yes/cfhttp cfoutput#HTMLCodeFormat(cfhttp.filecontent)#/cfoutput cfdump var=#GetAttributeValue(cfhttp.filecontent,'a','href')# cfdump var=#GetAttributeValue(cfhttp.filecontent,'img','src')# cfdump var=#GetAttributeValue(cfhttp.filecontent,'a|td','class')# Pascal -Original Message- From: Burns, John D [mailto:[EMAIL PROTECTED] Sent: 22 March 2005 22:59 To: CF-Talk Subject: RE: regex help for grabbing values of html tag attributes Ben, I can see what you've got (I think) and it makes sense, but for some reason, it's not working. I'm grabbing the html from www.google.com and running it on it and this is what I've got in my code: #refindnocase('img.*?src=(.*?).*?',cfhttp.fileContent,0,true)# I'm using cfdump to display that info and what I see are 2 arrays (len and pos) and both have values of 1 and 0. I thought that if the first value was 1, the second value would be the position of the occurrence of the search string. I know google has an image, and I'm displaying the cfhttp.filecontent in a textarea above so that I can ensure the results are coming back as expected. Any ideas? Am I doing something wrong? ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199744 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
Re: regex help for grabbing values of html tag attributes
What version of CF? --Ben Burns, John D wrote: Does anyone have a regex already written (or would any of you regex gurus like you put something up) that could take the source code of an HTML file and grab the value of an attribute given the tag and the attribute that would be grabbed. For instance, if I wanted to get the value of any classes used on a span tag, it would search for span tags and search for a class attribute and return the value within the quotes after class=. Or, for images, it would search for the img tag and find the src attribute and return the url listed in there? I have tried a few things but haven't had a whole lot of luck. Any help would be great. Thanks! John Burns Certified Advanced ColdFusion MX Developer Wyle Laboratories, Inc. | Web Developer ~| Find out how CFTicket can increase your company's customer support efficiency by 100% http://www.houseoffusion.com/banners/view.cfm?bannerid=49 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199700 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
RE: regex help for grabbing values of html tag attributes
6.1. I was looking at the archives and have come up with this but it's erroring I'm using the img instance because it's easier to test on pages that have multiple images... #refindnocase(img[^]*src=([^]*)*,cfhttp.fileContent,0,true)# John Burns Certified Advanced ColdFusion MX Developer Wyle Laboratories, Inc. | Web Developer -Original Message- From: Ben Doom [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 22, 2005 4:14 PM To: CF-Talk Subject: Re: regex help for grabbing values of html tag attributes What version of CF? --Ben Burns, John D wrote: Does anyone have a regex already written (or would any of you regex gurus like you put something up) that could take the source code of an HTML file and grab the value of an attribute given the tag and the attribute that would be grabbed. For instance, if I wanted to get the value of any classes used on a span tag, it would search for span tags and search for a class attribute and return the value within the quotes after class=. Or, for images, it would search for the img tag and find the src attribute and return the url listed in there? I have tried a few things but haven't had a whole lot of luck. Any help would be great. Thanks! John Burns Certified Advanced ColdFusion MX Developer Wyle Laboratories, Inc. | Web Developer ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199703 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
Re: regex help for grabbing values of html tag attributes
Well, I see a couple of problems with what you're using. First, you've not got a closing on the attribute. Second, you've wrapped a regex that contains a in 's, which will error out if you don't escape the inner 's. You can wrap it with single quotes to fix that. Also, the last * boggles me. I don't know why it's there. Or, try this: '#tag#.*?#att#=(.*?).*?' where (should be obvious) tag and att are defined as the tag and attribute you want. Please note that if you define them as span and class and you have this: spanstuff in betweenspan class=bob the whole tag match will return both span tags and the stuff in between. The attribute match will return bob. So, if this might be the case, lemme know and we'll tweak the regex. Not tested, your miles may vary, trix are for kids, etc. --Ben Burns, John D wrote: 6.1. I was looking at the archives and have come up with this but it's erroring I'm using the img instance because it's easier to test on pages that have multiple images... #refindnocase(img[^]*src=([^]*)*,cfhttp.fileContent,0,true)# ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199710 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
RE: regex help for grabbing values of html tag attributes
Ben, I can see what you've got (I think) and it makes sense, but for some reason, it's not working. I'm grabbing the html from www.google.com and running it on it and this is what I've got in my code: #refindnocase('img.*?src=(.*?).*?',cfhttp.fileContent,0,true)# I'm using cfdump to display that info and what I see are 2 arrays (len and pos) and both have values of 1 and 0. I thought that if the first value was 1, the second value would be the position of the occurrence of the search string. I know google has an image, and I'm displaying the cfhttp.filecontent in a textarea above so that I can ensure the results are coming back as expected. Any ideas? Am I doing something wrong? John Burns Certified Advanced ColdFusion MX Developer Wyle Laboratories, Inc. | Web Developer -Original Message- From: Ben Doom [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 22, 2005 4:54 PM To: CF-Talk Subject: Re: regex help for grabbing values of html tag attributes Well, I see a couple of problems with what you're using. First, you've not got a closing on the attribute. Second, you've wrapped a regex that contains a in 's, which will error out if you don't escape the inner 's. You can wrap it with single quotes to fix that. Also, the last * boggles me. I don't know why it's there. Or, try this: '#tag#.*?#att#=(.*?).*?' where (should be obvious) tag and att are defined as the tag and attribute you want. Please note that if you define them as span and class and you have this: spanstuff in betweenspan class=bob the whole tag match will return both span tags and the stuff in between. The attribute match will return bob. So, if this might be the case, lemme know and we'll tweak the regex. Not tested, your miles may vary, trix are for kids, etc. --Ben Burns, John D wrote: 6.1. I was looking at the archives and have come up with this but it's erroring I'm using the img instance because it's easier to test on pages that have multiple images... #refindnocase(img[^]*src=([^]*)*,cfhttp.fileContent,0,true)# ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199716 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
Re: regex help for grabbing values of html tag attributes
Try refindnocase('img.*?src=(.*?).*?',cfhttp.fileContent,1,'true') I think the 0 and the non-quoted true are confusing it. Just a guess, though. Also, have you verified the contents of cfhttp.filecontent? --Ben Burns, John D wrote: Ben, I can see what you've got (I think) and it makes sense, but for some reason, it's not working. I'm grabbing the html from www.google.com and running it on it and this is what I've got in my code: #refindnocase('img.*?src=(.*?).*?',cfhttp.fileContent,0,true)# I'm using cfdump to display that info and what I see are 2 arrays (len and pos) and both have values of 1 and 0. I thought that if the first value was 1, the second value would be the position of the occurrence of the search string. I know google has an image, and I'm displaying the cfhttp.filecontent in a textarea above so that I can ensure the results are coming back as expected. Any ideas? Am I doing something wrong? ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199718 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54
Re: regex help for grabbing values of html tag attributes
What you're trying to do is far from being trivial, however, I'm pretty sure that CF_REextract should help you a lot. See the link below. -- ___ REUSE CODE! Use custom tags; See http://www.contentbox.com/claude/customtags/tagstore.cfm (Please send any spam to this address: [EMAIL PROTECTED]) Thanks. ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199723 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations Support: http://www.houseoffusion.com/tiny.cfm/54