Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread juliosarmientota
66

Sent from my MetroPCS 4G LTE Android deviceOn Jun 30, 2017 5:32 AM, Bernat Gel 
 wrote:
>
> Ok, so it seems more like a bug somewhere than something I falied to 
> understand, then. One of the surprises for me is that shuffling the data so 
> the misses do not happen one after the other seems to s
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Bioconductor stats

2017-06-30 Thread Lluís Revilla
Hi Hervé,

I wasn't aware of the discrepancy between the monthly number of IPs and the
yearly number of IPs.
I didn't realize that my own package showed this distinction between
monthly and yearly number of IPs.
Thanks for pointing it.

Yes, usually the effect of a package being in several categories is quite
low.
But in some packages it is more frequent than once a year and in some
others more than just one download each time it happens.

Cheers,

Lluís



On 30 June 2017 at 11:44, Hervé Pagès  wrote:

> Hi LLuis,
>
> As Sean already said mirrors are not included in the stats. The
> monthly nb of distinct IPs are reset every month and the yearly
> nb of distinct IPs are reset every year.
>
> Some packages are indeed in two categories. Category assignment is
> based on the download URL only. For some mysterious reason the Apache
> logs contain some lines that indicate that AnnotationDbi was downloaded
> from an URL that points to a data experiment repository. These lines
> are very rare though (1 or 2 per year) so overall don't have any
> significant impact on the stats. Anyway that's something we'll need
> to dig into at some point.
>
> Cheers,
> H.
>
>
> On 06/27/2017 04:34 AM, Lluís Revilla wrote:
>
>> Hi,
>>
>> I have been looking at the stats of Bioconductor, and I would like to know
>> more about how are they calculated.
>>
>> Do these stats account for the mirror sites? Are there any stats of the
>> usage of mirrors?
>>
>> I found some packages that for the same month they have downloads in two
>> categories. For instance AnnotationDbi has some downloads as experimental
>> data package:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__biocondu
>> ctor.org_packages_stats_data-2Dexperiment_AnnotationDbi_=
>> DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0W
>> YiZvSXAJJKaaPhzWA=psknpX6b5M14sI3qf3U_bWP0s-rn_fEEnRsPoYrD
>> 2bw=Vn8nr1PNCdozt0465LZB9CXnGpPGdoEu_QskcQ6ehZA=  while
>> most of the downloads are in the software category (The right one). It
>> seems that near 500 packages have downloads in two categories.
>>
>> The "Nb of distinct IPs" if I understand correctly is for each package and
>> month. So if the same IP downloads again the package is listed as a new
>> IP,
>> isn't it? I assume that if mirrors are counted either no one downloads the
>> same packages from different mirrors in the same IP or that these
>> information is shared across mirrors for these stats.
>>
>> Regards,
>>
>> Lluís
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>> hz.ch_mailman_listinfo_bioc-2Ddevel=DwIGaQ=eRAMFD45gAfqt
>> 84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=ps
>> knpX6b5M14sI3qf3U_bWP0s-rn_fEEnRsPoYrD2bw=fX6Iawm0pI8-QEOm
>> Pe6TjPRFoKmYDYtrW6As8eHJ59o=
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread Bernat Gel

Ok, that makes sense

In my current use case I think I'll be able to filter out first the 
elements that will miss, so this behaviour is not triggered.


But it's good to know this happens so I can try to avoid it in the future.

Thanks.

Bernat


*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 









El 06/30/2017 a las 03:20 PM, Michael Lawrence escribió:

The reason it's faster when shuffled vs. all that end is that when a
miss happens R compares the string to all strings before it in the
subscript. So it's a lot worse to have a miss towards the end.

As Martin wrote, there are basically two possible improvements that
are somewhat complementary:
1) Tell stringSubscript() that it is not replacing so there is no need
to do that scan. This would require passing an argument down the call
stack.
2) Do a self match on the subscript like in Martin's patch, although
it should probably be done lazily on the first miss.

Michael

On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel  wrote:

Ok, so it seems more like a bug somewhere than something I falied to
understand, then.

One of the surprises for me is that shuffling the data so the misses do not
happen one after the other seems to solve the issue...

Thanks,

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 










El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:

Hi Bernat, Michael,

FWIW I reported this issue on R-devel a couple of times. Last time was
in 2013:

   https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html

Cheers,
H.

On 06/29/2017 11:58 PM, Bernat Gel wrote:

Yes, that would explain part of the situation. But example cc5 shows
that hash misses would account only for part of the time.

Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org












El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:

Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel  wrote:

Hi all,

This is not strictly a Bioconductor question, but I hope some of the
experts
here can help me understand what's going on with a performance issue
I've
found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

  chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
  chrs

chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
chr12 chr13
chr14 chr15 chr16 chr17
  1 2 3 4 5 6 7 8 9 1011
1213
14151617
chr18 chr19 chr20 chr21 chr22  chrX  chrY
 18192021222324

And we have a second vector of chromosomes (in this case, the
chromosomes
from SNP-array probes)
And we want to use the second vector to select from the first one by
name

  cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
  rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
  rep("chrX", 17498), rep("chrY", 1296))
  print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's 

Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread Michael Lawrence
The reason it's faster when shuffled vs. all that end is that when a
miss happens R compares the string to all strings before it in the
subscript. So it's a lot worse to have a miss towards the end.

As Martin wrote, there are basically two possible improvements that
are somewhat complementary:
1) Tell stringSubscript() that it is not replacing so there is no need
to do that scan. This would require passing an argument down the call
stack.
2) Do a self match on the subscript like in Martin's patch, although
it should probably be done lazily on the first miss.

Michael

On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel  wrote:
> Ok, so it seems more like a bug somewhere than something I falied to
> understand, then.
>
> One of the surprises for me is that shuffling the data so the misses do not
> happen one after the other seems to solve the issue...
>
> Thanks,
>
> Bernat
>
> *Bernat Gel Moreno*
> Bioinformatician
>
> Hereditary Cancer Program
> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
> Germans Trias i Pujol Research Institute (IGTP)
>
> Campus Can Ruti
> Carretera de Can Ruti, Camí de les Escoles s/n
> 08916 Badalona, Barcelona, Spain
>
> Tel: (+34) 93 554 3068
> Fax: (+34) 93 497 8654
> 08916 Badalona, Barcelona, Spain
> b...@igtp.cat 
> www.germanstrias.org 
>
> 
>
>
>
>
>
>
>
>
> El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:
>>
>> Hi Bernat, Michael,
>>
>> FWIW I reported this issue on R-devel a couple of times. Last time was
>> in 2013:
>>
>>   https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html
>>
>> Cheers,
>> H.
>>
>> On 06/29/2017 11:58 PM, Bernat Gel wrote:
>>>
>>> Yes, that would explain part of the situation. But example cc5 shows
>>> that hash misses would account only for part of the time.
>>>
>>> Thanks for taking a look into it
>>>
>>> Bernat
>>>
>>> *Bernat Gel Moreno*
>>> Bioinformatician
>>>
>>> Hereditary Cancer Program
>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>> Germans Trias i Pujol Research Institute (IGTP)
>>>
>>> Campus Can Ruti
>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>> 08916 Badalona, Barcelona, Spain
>>>
>>> Tel: (+34) 93 554 3068
>>> Fax: (+34) 93 497 8654
>>> 08916 Badalona, Barcelona, Spain
>>> b...@igtp.cat 
>>> www.germanstrias.org
>>>
>>> >>  >
>>>
>>>
>>> >>  >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:

 Preliminary analysis suggests that this is due to hash misses. When
 that happens, R ends up doing costly string comparisons that are on
 the order of n^2 where 'n' is the length of the subscript. Looking
 into it.

 On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel  wrote:
>
> Hi all,
>
> This is not strictly a Bioconductor question, but I hope some of the
> experts
> here can help me understand what's going on with a performance issue
> I've
> found working on a package.
>
> It has to do with selecting elements from a named vector.
>
> If we have a vector with the names of the chromosomes and their order
>
>  chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>  chrs
>
> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
> chr12 chr13
> chr14 chr15 chr16 chr17
>  1 2 3 4 5 6 7 8 9 1011
> 1213
> 14151617
> chr18 chr19 chr20 chr21 chr22  chrX  chrY
> 18192021222324
>
> And we have a second vector of chromosomes (in this case, the
> chromosomes
> from SNP-array probes)
> And we want to use the second vector to select from the first one by
> name
>
>  cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
> 14726),
>  rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>  rep("chrX", 17498), rep("chrY", 1296))
>  print(system.time(replicate(10, chrs[cc])))
>
> user  system elapsed
> 0.136   0.004   0.141
>
> It's fast.
>
> However, if I get the wrong names for the last two chromosomes (chr23
> and
> chr24 instead of chrX and chrY)
>
>   cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
> 14726),
>  rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),

Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread Bernat Gel
Ok, so it seems more like a bug somewhere than something I falied to 
understand, then.


One of the surprises for me is that shuffling the data so the misses do 
not happen one after the other seems to solve the issue...


Thanks,

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 









El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:

Hi Bernat, Michael,

FWIW I reported this issue on R-devel a couple of times. Last time was
in 2013:

  https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html

Cheers,
H.

On 06/29/2017 11:58 PM, Bernat Gel wrote:

Yes, that would explain part of the situation. But example cc5 shows
that hash misses would account only for part of the time.

Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org










El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:

Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel  wrote:

Hi all,

This is not strictly a Bioconductor question, but I hope some of the
experts
here can help me understand what's going on with a performance issue
I've
found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

 chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
 chrs

chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
chr12 chr13
chr14 chr15 chr16 chr17
 1 2 3 4 5 6 7 8 9 1011
1213
14151617
chr18 chr19 chr20 chr21 chr22  chrX  chrY
18192021222324

And we have a second vector of chromosomes (in this case, the
chromosomes
from SNP-array probes)
And we want to use the second vector to select from the first one by
name

 cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 
10252),

 rep("chrX", 17498), rep("chrY", 1296))
 print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's fast.

However, if I get the wrong names for the last two chromosomes (chr23
and
chr24 instead of chrX and chrY)

  cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 
10252),

 rep("chr23", 17498), rep("chr24", 1296))
  print(system.time(replicate(10, chrs[cc2])))

user  system elapsed
144.672   0.012 144.675


It is MUCH slower. (1000x)


BUT, if I shuffle the elements in the second vector

 cc3 <- sample(cc2, length(cc), replace = FALSE)
 print(system.time(replicate(10, chrs[cc3])))

user  system elapsed
0.096   0.004   0.102

It's fast again!!!



The elapsed time is related to the number of elements BEFORE the 
failing

names,

 cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
1296))
 print(system.time(replicate(10, chrs[cc4])))

user  system elapsed
17.332   0.004  17.336

 cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
 print(system.time(replicate(10, chrs[cc5])))

user  system elapsed
1.872   0.000   1.901


so my guess is that it might come from moving around the vector in
memory
for each "failed" selection or something similar...

Is it correct? Is there anything I'm missing?

Thanks a lot

Bernat

--

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research 

Re: [Bioc-devel] Bioconductor stats

2017-06-30 Thread Hervé Pagès

Hi LLuis,

As Sean already said mirrors are not included in the stats. The
monthly nb of distinct IPs are reset every month and the yearly
nb of distinct IPs are reset every year.

Some packages are indeed in two categories. Category assignment is
based on the download URL only. For some mysterious reason the Apache
logs contain some lines that indicate that AnnotationDbi was downloaded
from an URL that points to a data experiment repository. These lines
are very rare though (1 or 2 per year) so overall don't have any
significant impact on the stats. Anyway that's something we'll need
to dig into at some point.

Cheers,
H.


On 06/27/2017 04:34 AM, Lluís Revilla wrote:

Hi,

I have been looking at the stats of Bioconductor, and I would like to know
more about how are they calculated.

Do these stats account for the mirror sites? Are there any stats of the
usage of mirrors?

I found some packages that for the same month they have downloads in two
categories. For instance AnnotationDbi has some downloads as experimental
data package:
https://urldefense.proofpoint.com/v2/url?u=http-3A__bioconductor.org_packages_stats_data-2Dexperiment_AnnotationDbi_=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=psknpX6b5M14sI3qf3U_bWP0s-rn_fEEnRsPoYrD2bw=Vn8nr1PNCdozt0465LZB9CXnGpPGdoEu_QskcQ6ehZA=
  while
most of the downloads are in the software category (The right one). It
seems that near 500 packages have downloads in two categories.

The "Nb of distinct IPs" if I understand correctly is for each package and
month. So if the same IP downloads again the package is listed as a new IP,
isn't it? I assume that if mirrors are counted either no one downloads the
same packages from different mirrors in the same IP or that these
information is shared across mirrors for these stats.

Regards,

Lluís

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwIGaQ=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=psknpX6b5M14sI3qf3U_bWP0s-rn_fEEnRsPoYrD2bw=fX6Iawm0pI8-QEOmPe6TjPRFoKmYDYtrW6As8eHJ59o=



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread Hervé Pagès

Hi Bernat, Michael,

FWIW I reported this issue on R-devel a couple of times. Last time was
in 2013:

  https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html

Cheers,
H.

On 06/29/2017 11:58 PM, Bernat Gel wrote:

Yes, that would explain part of the situation. But example cc5 shows
that hash misses would account only for part of the time.

Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org










El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:

Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel  wrote:

Hi all,

This is not strictly a Bioconductor question, but I hope some of the
experts
here can help me understand what's going on with a performance issue
I've
found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

 chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
 chrs

chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
chr12 chr13
chr14 chr15 chr16 chr17
 1 2 3 4 5 6 7 8 91011
1213
14151617
chr18 chr19 chr20 chr21 chr22  chrX  chrY
18192021222324

And we have a second vector of chromosomes (in this case, the
chromosomes
from SNP-array probes)
And we want to use the second vector to select from the first one by
name

 cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
 rep("chrX", 17498), rep("chrY", 1296))
 print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's fast.

However, if I get the wrong names for the last two chromosomes (chr23
and
chr24 instead of chrX and chrY)

  cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
 rep("chr23", 17498), rep("chr24", 1296))
  print(system.time(replicate(10, chrs[cc2])))

user  system elapsed
144.672   0.012 144.675


It is MUCH slower. (1000x)


BUT, if I shuffle the elements in the second vector

 cc3 <- sample(cc2, length(cc), replace = FALSE)
 print(system.time(replicate(10, chrs[cc3])))

user  system elapsed
0.096   0.004   0.102

It's fast again!!!



The elapsed time is related to the number of elements BEFORE the failing
names,

 cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
1296))
 print(system.time(replicate(10, chrs[cc4])))

user  system elapsed
17.332   0.004  17.336

 cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
 print(system.time(replicate(10, chrs[cc5])))

user  system elapsed
1.872   0.000   1.901


so my guess is that it might come from moving around the vector in
memory
for each "failed" selection or something similar...

Is it correct? Is there anything I'm missing?

Thanks a lot

Bernat

--

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org










___

Re: [Bioc-devel] Help understanding an R performance issue

2017-06-30 Thread Bernat Gel
Yes, that would explain part of the situation. But example cc5 shows 
that hash misses would account only for part of the time.


Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 









El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:

Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel  wrote:

Hi all,

This is not strictly a Bioconductor question, but I hope some of the experts
here can help me understand what's going on with a performance issue I've
found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

 chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
 chrs

chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13
chr14 chr15 chr16 chr17
 1 2 3 4 5 6 7 8 91011 1213
14151617
chr18 chr19 chr20 chr21 chr22  chrX  chrY
18192021222324

And we have a second vector of chromosomes (in this case, the chromosomes
from SNP-array probes)
And we want to use the second vector to select from the first one by name

 cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
 rep("chrX", 17498), rep("chrY", 1296))
 print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's fast.

However, if I get the wrong names for the last two chromosomes (chr23 and
chr24 instead of chrX and chrY)

  cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
 rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
 rep("chr23", 17498), rep("chr24", 1296))
  print(system.time(replicate(10, chrs[cc2])))

user  system elapsed
144.672   0.012 144.675


It is MUCH slower. (1000x)


BUT, if I shuffle the elements in the second vector

 cc3 <- sample(cc2, length(cc), replace = FALSE)
 print(system.time(replicate(10, chrs[cc3])))

user  system elapsed
0.096   0.004   0.102

It's fast again!!!



The elapsed time is related to the number of elements BEFORE the failing
names,

 cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
 print(system.time(replicate(10, chrs[cc4])))

user  system elapsed
17.332   0.004  17.336

 cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
 print(system.time(replicate(10, chrs[cc5])))

user  system elapsed
1.872   0.000   1.901


so my guess is that it might come from moving around the vector in memory
for each "failed" selection or something similar...

Is it correct? Is there anything I'm missing?

Thanks a lot

Bernat

--

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 









___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel