Re: [R] interval between specific characters in a string...
On 04/12/2022 00:25, Hadley Wickham wrote: On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès wrote: On 03/12/2022 07:21, Bert Gunter wrote: Perhaps it is worth pointing out that looping constructs like lapply() can be avoided and the procedure vectorized by mimicking Martin Morgan's solution: ## s is the string to be searched. diff(c(0,grep('b',strsplit(s,'')[[1]]))) However, Martin's solution is simpler and likely even faster as the regex engine is unneeded: diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized This seems much preferable to me. Of all the proposed solutions, Andrew Hart's solution seems the most efficient: big_string <- strrep("abaaabbabaaabaaab", 50) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) #user system elapsed # 0.736 0.028 0.764 system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] == "b" #user system elapsed # 2.100 0.356 2.455 The bigger the string, the bigger the gap in performance. Also, the bigger the average gap between 2 successive b's, the bigger the gap in performance. Finally: always use fixed=TRUE in strsplit() if you don't need to use the regex engine. You can do a bit better if you are willing to use stringr: library(stringr) big_string <- strrep("abaaabbabaaabaaab", 50) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) #>user system elapsed #> 0.126 0.002 0.128 system.time(str_length(str_split(big_string, fixed("b"))[[1]])) #>user system elapsed #> 0.103 0.004 0.107 (And my timings also suggest that it's time for Hervé to get a new computer :P) LOL Actually my timings were for big_string <- strrep("abaaabbabaaabaaab", 150) but I mixed up things when I copy-pasted them in my email. Even though I still need a new laptop and I'm in the process of getting a new one ;-) H. -- Hervé Pagès Bioconductor Core Team hpages.on.git...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R-es] Resumen de R-help-es, Vol 166, Envío 1
Hola, adjunto proceso con lo que hago identificación y exclusión de outliers # analiza y filtra atípicos # ingresos disponibles ing_dsp_out1 <- dlookr::imputate_outlier(eda1,ing_dsp_vl,method="capping") ing_dsp_out2 <- data.table(out_pos=attr(ing_dsp_out1,"outlier_pos"),out_vl=attr(ing_dsp_out1,"outliers"))[order(out_vl)] # estadísticas de atípicos out_sta1 <- data.table(smbinning.eda(ing_dsp_out2,rounding=3,pbar=0)$eda) # Table with basic statistics # filtra valores menores al Q50 de los atípicos eda2 <- eda1[ing_dsp_vl<=out_sta1[Field=="out_vl",Q50]] dlookr::plot_outlier(eda2[,.(ing_dsp_vl)]) El sáb, 3 dic 2022 a la(s) 06:00, escribió: > Envíe los mensajes para la lista R-help-es a > r-help-es@r-project.org > > Para subscribirse o anular su subscripción a través de la WEB > https://stat.ethz.ch/mailman/listinfo/r-help-es > > O por correo electrónico, enviando un mensaje con el texto "help" en > el asunto (subject) o en el cuerpo a: > r-help-es-requ...@r-project.org > > Puede contactar con el responsable de la lista escribiendo a: > r-help-es-ow...@r-project.org > > Si responde a algún contenido de este mensaje, por favor, edite la > linea del asunto (subject) para que el texto sea mas especifico que: > "Re: Contents of R-help-es digest...". Además, por favor, incluya en > la respuesta sólo aquellas partes del mensaje a las que está > respondiendo. > Asuntos del día: > >1. eliminar outliers en un tapply (Manuel Mendoza) > > > -- Forwarded message -- > From: Manuel Mendoza > To: Lista R > Cc: > Bcc: > Date: Sat, 3 Dec 2022 09:14:11 +0100 > Subject: [R-es] eliminar outliers en un tapply > Buenos días, utilizo: > > max <- tapply (Data$varnum, Data$varcat, max) > > para obtener el máximo de varnum en cada una de las categorías de varcat > > ¿cómo podría obtener los máximos, pero sin los outliers (Q75 + 1.5*IQR)? > > Es fácil quitar los outliers superiores de varnum, pero no es eso lo que > necesito quitar, sino los outliers dentro ya de cada categoría de varcat. > > Gracias, como siempre, > Manuel > > [[alternative HTML version deleted]] > > > ___ > R-help-es mailing list > R-help-es@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-help-es > [[alternative HTML version deleted]] ___ R-help-es mailing list R-help-es@r-project.org https://stat.ethz.ch/mailman/listinfo/r-help-es
Re: [R] interval between specific characters in a string...
On 04/12/2022 10:25, Hadley Wickham wrote: On Sun, Dec 4, 2022 at 1:22 PM wrote: This may be a fairly dumb and often asked question about some functions like strsplit() that return a list of things, often a list of ONE thing that be another list or a vector and needs to be made into something simpler.. The examples shown below have used various methods to convert the result to a vector but why is this not a built-in option for such a function to simplify the result either when possible or always? Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce it back to a vector. But when you have a very common idiom and a fact that many people waste lots of time figuring out they had a LIST containing a single vector and debug, maybe it would have made sense to have either a sister function like strsplit_v() that returns what is actually wanted or allow strsplit(whatever, output="vector") or something giving the same result. Yes, I understand that when there is a workaround, it just complicates the base, but there could be a package that consistently does things like this to make the use of such functions easier. The next version of stringr (currently being processed by CRAN) provides str_split_1() for exactly this purpose. Thanks! Well appreciated... Hadley -- Micha Silver Ben Gurion Univ. Sde Boker, Remote Sensing Lab cell: +972-523-665918 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] interval between specific characters in a string...
On Sun, Dec 4, 2022 at 1:22 PM wrote: > > This may be a fairly dumb and often asked question about some functions like > strsplit() that return a list of things, often a list of ONE thing that be > another list or a vector and needs to be made into something simpler.. > > The examples shown below have used various methods to convert the result to a > vector but why is this not a built-in option for such a function to simplify > the result either when possible or always? > > Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce > it back to a vector. But when you have a very common idiom and a fact that > many people waste lots of time figuring out they had a LIST containing a > single vector and debug, maybe it would have made sense to have either a > sister function like strsplit_v() that returns what is actually wanted or > allow strsplit(whatever, output="vector") or something giving the same result. > > Yes, I understand that when there is a workaround, it just complicates the > base, but there could be a package that consistently does things like this to > make the use of such functions easier. The next version of stringr (currently being processed by CRAN) provides str_split_1() for exactly this purpose. Hadley -- http://hadley.nz __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] interval between specific characters in a string...
On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès wrote: > > On 03/12/2022 07:21, Bert Gunter wrote: > > Perhaps it is worth pointing out that looping constructs like lapply() can > > be avoided and the procedure vectorized by mimicking Martin Morgan's > > solution: > > > > ## s is the string to be searched. > > diff(c(0,grep('b',strsplit(s,'')[[1]]))) > > > > However, Martin's solution is simpler and likely even faster as the regex > > engine is unneeded: > > > > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized > > > > This seems much preferable to me. > > Of all the proposed solutions, Andrew Hart's solution seems the most > efficient: > >big_string <- strrep("abaaabbabaaabaaab", 50) > >system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) >#user system elapsed ># 0.736 0.028 0.764 > >system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] > == "b" >#user system elapsed ># 2.100 0.356 2.455 > > The bigger the string, the bigger the gap in performance. > > Also, the bigger the average gap between 2 successive b's, the bigger > the gap in performance. > > Finally: always use fixed=TRUE in strsplit() if you don't need to use > the regex engine. You can do a bit better if you are willing to use stringr: library(stringr) big_string <- strrep("abaaabbabaaabaaab", 50) system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1) #>user system elapsed #> 0.126 0.002 0.128 system.time(str_length(str_split(big_string, fixed("b"))[[1]])) #>user system elapsed #> 0.103 0.004 0.107 (And my timings also suggest that it's time for Hervé to get a new computer :P) It feels like an approach that uses locations should be faster since you wouldn't have to construct all the intermediate strings. system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1]) #>user system elapsed #> 0.075 0.004 0.080 # I suspect this could be optimised with a little thought making this approach # faster overall system.time(c(0, diff(pos)) #>user system elapsed #> 0.022 0.006 0.027 Hadley -- http://hadley.nz __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.