Re: [R] regex - extracting src url
On 03/22/2016 12:44 AM, Omar André Gonzáles Díaz wrote: Hi,I have a DF with a column with "html", like this: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement"> I need to get this: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= ? I've got this so far: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\; BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement With this is the code I've used: carreras_normal$Impression.Tag..image. <- gsub("","\\1",carreras_normal$Impression.Tag..image., ignore.case = T) *But I still need to use get rid of this part:* https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= ?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement* Thank you for your help. You're querying an xml string, so use xpath, e.g., via the XML library > as.character(xmlParse(y)[["//IMG/@SRC"]]) [1] "https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; `xmlParse()` translates the character string into an XML document. `[[` subsets the document to extract a single element. "//IMG/@SRC" follows the xpath specification (this section https://www.w3.org/TR/xpath-31/#abbrev of the specification provides a quick guide) to find, starting from the 'root' of the document, a node, at any depth, labeled IMG containing an attribute labeled SRC. A variation, if there were several IMG tags to be extracted, would be xpathSApply(xmlParse(y), "//IMG/@SRC", as.character) Omar Gonzáles. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - extracting src url
?strsplit #I think My "solution" assumes a fixed format for the URL's as shown in your example. If that is not the case, it doesn't work. > y <- ' SRC="https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; + BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">' > y ## checking that the URL is as expected [1] "https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"\nBORDER=\"0\; HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement\">" > lapply(strsplit(y,"\""),"[",2) ## should work on a vector of URL's, y [[1]] [1] "https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Mar 21, 2016 at 9:44 PM, Omar André Gonzáles Díazwrote: > Hi,I have a DF with a column with "html", like this: > > https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; > BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement"> > > > I need to get this: > > > https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= > ? > > > I've got this so far: > > > https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\; > BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement > > > With this is the code I've used: > > carreras_normal$Impression.Tag..image. <- > gsub(" ","\\1",carreras_normal$Impression.Tag..image., > ignore.case = T) > > > > *But I still need to use get rid of this part:* > > > https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= > ?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement* > > > Thank you for your help. > > Omar Gonzáles. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] regex - extracting src url
Hi,I have a DF with a column with "html", like this: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?; BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement"> I need to get this: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= ? I've got this so far: https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\; BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement With this is the code I've used: carreras_normal$Impression.Tag..image. <- gsub("","\\1",carreras_normal$Impression.Tag..image., ignore.case = T) *But I still need to use get rid of this part:* https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment= ?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement* Thank you for your help. Omar Gonzáles. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.