Hello, All:

      Thanks to Rasmus Liland, William Michels, and Luke Tierney with my earlier web scraping question.  With their help, I've made progress.  Sadly, I still have a problem:  One field has "<br/>", which gets suppressed by XML::readHTMLTable:


sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975";
sosChars <- RCurl::getURL(sosURL)
MOcan <- XML::readHTMLTable(sosChars)
MOcan[[2]][1, 2]
[1] "4476 FIVE MILE RDSENECA MO 64865"


(Seneca <- regexpr('SENECA', sosChars))
substring(sosChars, Seneca-22, Seneca+14)


[1] "4476 FIVE MILE RD<br/>SENECA MO 64865"


      How can I get essentially the same result but without having XML::readHTMLTable suppress "<br/>"?


NOTE:  I get something very similar with xml2::read_html and rvest::html_table:


sosPointers <- xml2::read_html(sosChars)
MOcan2 <- rvest::html_table(sosPointers)
MOcan2[[2]][1, 2]
[1] "4476 FIVE MILE RDSENECA MO 64865"


      MOcan2 does not have names, and some of the fields are automatically converted to integers, which I think is not smart in this application.


      Thanks,
      Spencer Graves

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to