This has been a very interesting and illuminating dialogue but it really should be moved to support.bioconductor.org so that a record is available to the general user community.
The solution here > library(Homo.sapiens) Error in library(Homo.sapiens) : there is no package called ‘Homo.sapiens’ is BiocManager::install("Homo.sapiens") On Sun, Mar 8, 2020 at 6:22 AM Mulder, R <r.mulde...@umcg.nl> wrote: > Dear Tim, > > > Thanks again for your (quick) reply. > > > I just ran it on my computer and it gave me some results but also several > errors. > > > The first part went perfect > > > > BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile)) > > names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), > `[`, 1) > > show(BamFiles) > BamFileList of length 2 > names(2): here were my 2 files named without bam extension > > However the second part did give results but also several errors > > > library(Homo.sapiens) > Error in library(Homo.sapiens) : > there is no package called ‘Homo.sapiens’ > > gxs <- genes(Homo.sapiens) > Error in genes(Homo.sapiens) : could not find function "genes" > > names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") > Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") : > could not find function "mapIds" > > SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ] > Error: object 'gxs' not found > > > > sbparam <- ScanBamParam(which=SomeGenes) > Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found > > puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5, > + include_insertions=TRUE, min_minor_allele_depth=1) > > piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, > PileupParam=puparam) > > > > filterInvariant <- function(x) { > + positions <- subset(x, duplicated(pos))$pos > + subset(x, pos %in% positions) > + } > > show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2))) > seqnames pos strand nucleotide count > myfile1 1 36931671 + A 251 > myfile1 1 36931671 + G 2 > myfile2 1 36931673 + - 1 > myfile2 1 36931673 + A 253 > > > Error in library(Homo.sapiens) : > there is no package called ‘Homo.sapiens’ > > For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran > it again, but the errors at the next operations remained. > > > gxs <- genes(Homo.sapiens) > Error in genes(Homo.sapiens) : could not find function "genes" > > Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should > I look up how this genome is called? > > > > names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") > Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") : > could not find function "mapIds" > > Where can I find this function? > > > SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ] > Error: object 'gxs' not found > > Does this have to do with the fact that IGF2 is not included in my panel? > > > sbparam <- ScanBamParam(which=SomeGenes) > Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found > > As a result of the previous error, right? > > Interestingly enough, I did get results. Why? > > > > ________________________________ > Van: Tim Triche, Jr. <tim.tri...@gmail.com> > Verzonden: zaterdag 7 maart 2020 22:28 > Aan: Mulder, R > CC: bioc-devel@r-project.org > Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using > NGS data in r > > something like this perhaps? > > library(Rsamtools) > BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile)) > names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."), > `[`, 1) > show(BamFiles) > #BamFileList of length 14 > #names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19 > REMC_CD3 > > library(Homo.sapiens) > gxs <- genes(Homo.sapiens) > names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") > SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ] > > sbparam <- ScanBamParam(which=SomeGenes) > puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5, > include_insertions=TRUE, min_minor_allele_depth=1) > piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, > PileupParam=puparam) > > filterInvariant <- function(x) { > positions <- subset(x, duplicated(pos))$pos > subset(x, pos %in% positions) > } > show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2))) > # seqnames pos strand nucleotide count > # Normal_0813_S19_R1_001.43 chr11 1979973 - C 1 > # Normal_0813_S19_R1_001.44 chr11 1979973 + T 1 > # Normal_2013_S18_R1_001.16 chr11 1979889 + T 3 > # Normal_2013_S18_R1_001.17 chr11 1979889 - T 1 > # Normal_2714_S20_R1_001.4 chr11 1979933 + A 1 > # Normal_2714_S20_R1_001.5 chr11 1979933 - A 1 > # P01-00020-T06-P-cfDNA.21 chr11 1980045 + G 1 > # P01-00020-T06-P-cfDNA.22 chr11 1980045 - G 1 > # P01-00020-T06-P-cfDNA2.11 chr11 1979884 + T 1 > # P01-00020-T06-P-cfDNA2.12 chr11 1979884 - T 1 > # P01-00021-T06-P-cfDNA.5 chr11 1979879 + C 1 > # P01-00021-T06-P-cfDNA.6 chr11 1979879 - C 1 > # P01-00024-T06-P-cfDNA.9 chr11 1979990 + C 1 > # P01-00024-T06-P-cfDNA.10 chr11 1979990 + T 2 > # P01-00028-T06-N-P01.3 chr11 1979851 + A 1 > # P01-00028-T06-N-P01.4 chr11 1979851 + G 1 > # P01-00028-T06-P-cfDNA.10 chr11 1979948 + A 1 > # P01-00028-T06-P-cfDNA.11 chr11 1979948 + G 1 > # P01-00028-T06-T-P01.2 chr11 1979850 + C 3 > # P01-00028-T06-T-P01.3 chr11 1979850 - C 2 > # P01-00028-T06-T-P02.3 chr11 1979851 + A 1 > # P01-00028-T06-T-P02.4 chr11 1979851 + C 1 > > > Or whatever you like (GAlignments would get you the reads, GAlignmentPairs > the read pairs, and so on) > > > > --t > > > On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulde...@umcg.nl<mailto: > r.mulde...@umcg.nl>> wrote: > > Dear Tim, > > > I installed Rsamtools and ran it on a bamfile using the following command > to get nucleotide count for 2 hotspot regions. > > > library(Rsamtools) > bamfile <- "test.bam" > bf <- BamFile(bamfile) > param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321, > 5073770), end=c(55599321, 5073770)))) > > Now, I want to perform this on more than 1 bam file at once and on more > region. > > Do you perhaps have ideas on how I could do this? > > Best, > > Rene > > ________________________________ > Van: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>> > Verzonden: vrijdag 6 maart 2020 13:57:10 > Aan: Mulder, R > CC: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org> > Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using > NGS data in r > > Oh hey, one last thing — if all you want is to get nucleotide counts per > region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges) > holding a GRanges (Genomic Ranges) of your regions added to scanBamParams > for each BAM. It sounds awkward but in practice it is super fast and will > give you all the nucleotide and read level information you could want. One > of my interns implemented this for mitochondrial variant calling in > MTseeker when we got sick of using gmapR and being flagged for errors on > not-Linux. (We gutted the entire package recently and have new, insanely > deep examples from Oxford Nanopore direct RNA sequencing and from large > single cell datasets; I need to add those and get the package back out of > purgatory). > > That said, in the end you will want a LOT of validation material so this > is very much just a starting point. But still, it’s your starting point, in > R at least. And truthfully I much prefer R/Bioconductor idioms to (say) > pysam or the like. htsnim is nice but then you’ll be implementing the ML > bits from scratch, so I think your instincts to try R first are sensible. > > Good luck! Even if you use this for something else besides MRD, I think it > will become a useful exercise. > > --t > > On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.tri...@gmail.com<mailto: > tim.tri...@gmail.com>> wrote: > > > a few thoughts: > > 1) look into Shearwater ( > https://bioconductor.org/packages/release/bioc/html/deepSNV.html< > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0>), > then > > 2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @ > Erasmus, the usual suspects > > 3) plan to validate w/ddPCR (at the absolute very least) and be aware that > most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR > (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto) > > not saying that ML-based methods might not help, but if you've got a > 30x-100x genome (or even 1000x FM1) and are trying to compete with existing > standard approaches that can detect molecules at 1e-6, it'll be rough. An > alternative approach (that has been used repeatedly) is to throw caution to > the wind, generate primers for numerous subject-specific somatic variants, > and use the ensemble to try and model MRD (speaking of ML). On the one > hand, that could give the clinic a "customer for life"; on the other hand, > it's not conducive to large-scale automation & deployment. As far as I > know, it never got much traction, in leukemia or anywhere else. (Consider > that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million > cells in most clinical flow labs.) > > Best of luck! (and if you're not already working with UMI-tagged reads, > please talk to the people in #2 above; the reason most people won't go > below 5% VAF is that you get thwacked by error rates at that level, and the > reason most NGS-based MRD is based on UMIs is that existing PCR-based > methods have 6 logs sensitivity.) > > --t > > > On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulde...@umcg.nl<mailto: > r.mulde...@umcg.nl>> wrote: > Hi, > > > I was wondering if anyone could help me with a script and support for the > above mentioned goal. > > For this I have several BAM files for which I want to determine de > nucleotide count per region of interest. The latter could be several > hotspot mutation sites. I would like to get an overall overview of all the > BAM files. Next I want to use these counts to determine for any new BAM > file if the count for a particular genomic position is higher than the > allowable range, hence could indicate if a mutation is present. For this I > would like to use the modified Thompson Tau test. I think machine learning > could be used for this. So, why do I want to do all this? Well, normal NGS > pipelines only deal with variants at a frequency of 5%. Mutatios below this > frequency are often missed. To know if a mutation is present below this > level, you showed dive into the alignment and most often manually > investigate the base calls. I know that this races some questions regarding > call qualities, but then again our conventional assays have actually > confirmed some of these low mutations. In addition, NGS can > be used to determine the presence of low frequent mutation which is of > great importance for determining the measurable residual disease after > treatment. > > > I am new to r and bioconductor so I would be very thankful if someone > could help me in my mission to setting up a script for this purpose. > > > Thanks, > > > Rene Mulder > > Laboratory Medicine > > University Medical Center Groningen > > The Netherlands > > ________________________________ > De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}} > > _______________________________________________ > Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel< > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0 > > > ________________________________ > De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de > geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken > van dit bericht, het niet openbaar maken of op enige wijze verspreiden of > vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een > incomplete aankomst of vertraging van dit verzonden bericht. > > The contents of this message are confidential and only intended for the > eyes of the addressee(s). Others than the addressee(s) are not allowed to > use this message, to make it public or to distribute or multiply this > message in any way. The UMCG cannot be held responsible for incomplete > reception or delay of this transferred message. > ________________________________ > De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de > geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken > van dit bericht, het niet openbaar maken of op enige wijze verspreiden of > vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een > incomplete aankomst of vertraging van dit verzonden bericht. > > The contents of this message are confidential and only intended for the > eyes of the addressee(s). Others than the addressee(s) are not allowed to > use this message, to make it public or to distribute or multiply this > message in any way. The UMCG cannot be held responsible for incomplete > reception or delay of this transferred message. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > -- The information in this e-mail is intended only for the ...{{dropped:18}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel