Re: [R] interval between specific characters in a string...

2022-12-04 Thread Hervé Pagès



On 04/12/2022 00:25, Hadley Wickham wrote:

On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès  wrote:

On 03/12/2022 07:21, Bert Gunter wrote:

Perhaps it is worth pointing out that looping constructs like lapply() can
be avoided and the procedure vectorized by mimicking Martin Morgan's
solution:

## s is the string to be searched.
diff(c(0,grep('b',strsplit(s,'')[[1]])))

However, Martin's solution is simpler and likely even faster as the regex
engine is unneeded:

diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized

This seems much preferable to me.

Of all the proposed solutions, Andrew Hart's solution seems the most
efficient:

big_string <- strrep("abaaabbabaaabaaab", 50)

system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
#user  system elapsed
#   0.736   0.028   0.764

system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
== "b"
#user  system elapsed
#  2.100   0.356   2.455

The bigger the string, the bigger the gap in performance.

Also, the bigger the average gap between 2 successive b's, the bigger
the gap in performance.

Finally: always use fixed=TRUE in strsplit() if you don't need to use
the regex engine.

You can do a bit better if you are willing to use stringr:

library(stringr)
big_string <- strrep("abaaabbabaaabaaab", 50)

system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
#>user  system elapsed
#>   0.126   0.002   0.128

system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
#>user  system elapsed
#>   0.103   0.004   0.107

(And my timings also suggest that it's time for Hervé to get a new computer :P)


LOL

Actually my timings were for

  big_string <- strrep("abaaabbabaaabaaab", 150)

but I mixed up things when I copy-pasted them in my email.

Even though I still need a new laptop and I'm in the process of getting 
a new one ;-)


H.

--
Hervé Pagès

Bioconductor Core Team
hpages.on.git...@gmail.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R-es] Resumen de R-help-es, Vol 166, Envío 1

2022-12-04 Thread patricio fuenmayor
Hola, adjunto proceso con lo que hago identificación y exclusión de outliers

# analiza y filtra atípicos
# ingresos disponibles
ing_dsp_out1 <- dlookr::imputate_outlier(eda1,ing_dsp_vl,method="capping")
ing_dsp_out2 <-
data.table(out_pos=attr(ing_dsp_out1,"outlier_pos"),out_vl=attr(ing_dsp_out1,"outliers"))[order(out_vl)]
# estadísticas de atípicos
out_sta1 <- data.table(smbinning.eda(ing_dsp_out2,rounding=3,pbar=0)$eda) #
Table with basic statistics
# filtra valores menores al Q50 de los atípicos
eda2 <- eda1[ing_dsp_vl<=out_sta1[Field=="out_vl",Q50]]

dlookr::plot_outlier(eda2[,.(ing_dsp_vl)])

El sáb, 3 dic 2022 a la(s) 06:00, 
escribió:

> Envíe los mensajes para la lista R-help-es a
> r-help-es@r-project.org
>
> Para subscribirse o anular su subscripción a través de la WEB
> https://stat.ethz.ch/mailman/listinfo/r-help-es
>
> O por correo electrónico, enviando un mensaje con el texto "help" en
> el asunto (subject) o en el cuerpo a:
> r-help-es-requ...@r-project.org
>
> Puede contactar con el responsable de la lista escribiendo a:
> r-help-es-ow...@r-project.org
>
> Si responde a algún contenido de este mensaje, por favor, edite la
> linea del asunto (subject) para que el texto sea mas especifico que:
> "Re: Contents of R-help-es digest...". Además, por favor, incluya en
> la respuesta sólo aquellas partes del mensaje a las que está
> respondiendo.
> Asuntos del día:
>
>1. eliminar outliers en un tapply (Manuel Mendoza)
>
>
> -- Forwarded message --
> From: Manuel Mendoza 
> To: Lista R 
> Cc:
> Bcc:
> Date: Sat, 3 Dec 2022 09:14:11 +0100
> Subject: [R-es] eliminar outliers en un tapply
> Buenos días, utilizo:
>
> max <- tapply (Data$varnum, Data$varcat, max)
>
> para obtener el máximo de varnum en cada una de las categorías de varcat
>
> ¿cómo podría obtener los máximos, pero sin los outliers (Q75 + 1.5*IQR)?
>
> Es fácil quitar los outliers superiores de varnum, pero no es eso lo que
> necesito quitar, sino los outliers dentro ya de cada categoría de varcat.
>
> Gracias, como siempre,
> Manuel
>
> [[alternative HTML version deleted]]
>
>
> ___
> R-help-es mailing list
> R-help-es@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-help-es
>

[[alternative HTML version deleted]]

___
R-help-es mailing list
R-help-es@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-help-es


Re: [R] interval between specific characters in a string...

2022-12-04 Thread Micha Silver



On 04/12/2022 10:25, Hadley Wickham wrote:

On Sun, Dec 4, 2022 at 1:22 PM  wrote:

This may be a fairly dumb and often asked question about some functions like 
strsplit()  that return a list of things, often a list of ONE thing that be 
another list or a vector and needs to be made into something simpler..

The examples shown below have used various methods to convert the result to a 
vector but why is this not a built-in option for such a function to simplify 
the result either when possible or always?

Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce it back to 
a vector. But when you have a very common idiom and a fact that many people waste lots of time 
figuring out they had a LIST containing a single vector and debug, maybe it would have made sense 
to have either a sister function like strsplit_v() that returns what is actually wanted or allow 
strsplit(whatever, output="vector") or something giving the same result.

Yes, I understand that when there is a workaround, it just complicates the 
base, but there could be a package that consistently does things like this to 
make the use of such functions easier.

The next version of stringr (currently being processed by CRAN)
provides str_split_1() for exactly this purpose.



Thanks!

Well appreciated...




Hadley


--
Micha Silver
Ben Gurion Univ.
Sde Boker, Remote Sensing Lab
cell: +972-523-665918

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interval between specific characters in a string...

2022-12-04 Thread Hadley Wickham
On Sun, Dec 4, 2022 at 1:22 PM  wrote:
>
> This may be a fairly dumb and often asked question about some functions like 
> strsplit()  that return a list of things, often a list of ONE thing that be 
> another list or a vector and needs to be made into something simpler..
>
> The examples shown below have used various methods to convert the result to a 
> vector but why is this not a built-in option for such a function to simplify 
> the result either when possible or always?
>
> Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce 
> it back to a vector. But when you have a very common idiom and a fact that 
> many people waste lots of time figuring out they had a LIST containing a 
> single vector and debug, maybe it would have made sense to have either a 
> sister function like strsplit_v() that returns what is actually wanted or 
> allow strsplit(whatever, output="vector") or something giving the same result.
>
> Yes, I understand that when there is a workaround, it just complicates the 
> base, but there could be a package that consistently does things like this to 
> make the use of such functions easier.

The next version of stringr (currently being processed by CRAN)
provides str_split_1() for exactly this purpose.

Hadley

-- 
http://hadley.nz

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interval between specific characters in a string...

2022-12-04 Thread Hadley Wickham
On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès  wrote:
>
> On 03/12/2022 07:21, Bert Gunter wrote:
> > Perhaps it is worth pointing out that looping constructs like lapply() can
> > be avoided and the procedure vectorized by mimicking Martin Morgan's
> > solution:
> >
> > ## s is the string to be searched.
> > diff(c(0,grep('b',strsplit(s,'')[[1]])))
> >
> > However, Martin's solution is simpler and likely even faster as the regex
> > engine is unneeded:
> >
> > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized
> >
> > This seems much preferable to me.
>
> Of all the proposed solutions, Andrew Hart's solution seems the most
> efficient:
>
>big_string <- strrep("abaaabbabaaabaaab", 50)
>
>system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
>#user  system elapsed
>#   0.736   0.028   0.764
>
>system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
> == "b"
>#user  system elapsed
>#  2.100   0.356   2.455
>
> The bigger the string, the bigger the gap in performance.
>
> Also, the bigger the average gap between 2 successive b's, the bigger
> the gap in performance.
>
> Finally: always use fixed=TRUE in strsplit() if you don't need to use
> the regex engine.

You can do a bit better if you are willing to use stringr:

library(stringr)
big_string <- strrep("abaaabbabaaabaaab", 50)

system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
#>user  system elapsed
#>   0.126   0.002   0.128

system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
#>user  system elapsed
#>   0.103   0.004   0.107

(And my timings also suggest that it's time for Hervé to get a new computer :P)

It feels like an approach that uses locations should be faster since
you wouldn't have to construct all the intermediate strings.

system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1])
#>user  system elapsed
#>   0.075   0.004   0.080
# I suspect this could be optimised with a little thought making this approach
# faster overall
system.time(c(0, diff(pos))
#>user  system elapsed
#>   0.022   0.006   0.027

Hadley

-- 
http://hadley.nz

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.