If the chromosome name depends on the assembly, that makes GenomeInfoDb even more useful and necessary. Provided it is supported of course.
On Fri, Dec 13, 2019 at 11:45 AM Vincent Carey <st...@channing.harvard.edu> wrote: > I tried an inline png but I think it was rejected by bioc-devel. Here's > another try. > > On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey <st...@channing.harvard.edu > > > wrote: > > > Thanks -- It is good to know more about the complications of adding > > seqlevelsStyle elements. > > I am not sure how pervasive this will be in SNP annotation in the future. > > The "new API" for dbSNP > > references SPDI annotation conventions. > > > > https://api.ncbi.nlm.nih.gov/variation/v0/ > > > > at least one dbsnp build 152 resource uses this nomenclature. The one > > > > referenced below is the "go-to" resource for current rsid-coordinate > > > > correspondence, as far as I know. > > > > > > > library(VariantAnnotation) > > > > *0/0 packages newly attached/loaded, see sessionInfo() for details.* > > > > > mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note seqnames > > > > > > > nn = readVcf(" > > > ftp://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz > > ", > > > > + genome="GRCh38", param=mypar) > > > > > > > head(rowRanges(nn), 3) > > > > GRanges object with 3 ranges and 5 metadata columns: > > > > seqnames ranges strand | paramRangeID > REF > > > > <Rle> <IRanges> <Rle> | <factor> > <DNAStringSet> > > > > rs1331956057 NC_000001.11 100000 * | <NA> > C > > > > rs1252351580 NC_000001.11 100036 * | <NA> > T > > > > rs1238523913 NC_000001.11 100051 * | <NA> > T > > > > ALT QUAL FILTER > > > > <DNAStringSetList> <numeric> <character> > > > > rs1331956057 T <NA> . > > > > rs1252351580 G <NA> . > > > > rs1238523913 C <NA> . > > > > ------- > > > > seqinfo: 1 sequence from GRCh38 genome; no seqlengths > > > > > > On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo <robert.cast...@upf.edu> > > wrote: > > > >> hi Hervé, > >> > >> i didn't know about this new sequence style until Vince posted his > >> message and we briefly talked about it at the European BioC meeting this > >> week in Brussels. however, i didn't know that the style was specific to > >> a particular assembly. i have no use case of this at the mome moment, > >> i.e., i have not encountered myself any annotation or BAM file with > >> chromosome names written that way, so i don't know how pressing this > >> issue is, maybe Vince can tell us how spread such chromosome naming > >> style may become in the near future. > >> > >> naively, i'd think that it would be matter of adding a > >> reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., but i > >> can imagine that maybe the "reference style" concept might not be the > >> appropriate placeholder to map all different chromosome names of all > >> different individual human genomes uploaded to NCBI. maybe we should > >> wait until we have a specific use case .. Vince? > >> > >> robert. > >> > >> On 12/11/19 10:06 PM, Pages, Herve wrote: > >> > Hi Vince, Robert, > >> > > >> > Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for > chrom > >> > 17 in the GRCh38. > >> > > >> > @Robert: Is this what you're also interested in? > >> > > >> > The problem is that the RefSeq accessions are specific to a particular > >> > assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10 > for > >> > the same chrom in GRCh37). > >> > > >> > Currently seqlevelsStyle() doesn't know how to distinguish between > >> > different assemblies of the same organism. Not saying it couldn't but > it > >> > would require some thinking and some significant refactoring. It > >> > wouldn't be just a matter of adding a column to > >> > genomeStyles()$Homo_sapiens. > >> > > >> > H. > >> > > >> > > >> > On 12/10/19 14:19, Robert Castelo wrote: > >> >> I second this, and would suggest to name the style as 'GRC' for > "Genome > >> >> Reference Consortium". > >> >> > >> >> thanks Vince for bringing this up, being able to easily switch > between > >> >> genome styles is great. > >> >> > >> >> if 'paste0()' in R is one of the most influential contributions to > >> >> statistical computing > >> >> > >> >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e= > >> >> > >> >> i think that 'seqlevelsStyle()' from the GenomeInfoDb package is one > of > >> >> the most influential contributions to human genetics, if you think > >> about > >> >> the time invested by researchers in parsing and changing between > >> >> different styles of chromosome names :) > >> >> > >> >> robert. > >> >> > >> >> On 06/12/2019 15:03, Vincent Carey wrote: > >> >>> I raised this issue previously with little response. > >> >>> > >> >>> I'd propose that we add a column or two to > genomeStyles()$Homo_sapiens > >> >>> > >> >>>> head(genomeStyles()$Homo_sapiens, 2) > >> >>> circular auto sex NCBI UCSC dbSNP Ensembl > >> >>> > >> >>> 1 FALSE TRUE FALSE 1 chr1 ch1 1 > >> >>> > >> >>> 2 FALSE TRUE FALSE 2 chr2 ch2 2 > >> >>> > >> >>> > >> >>> that includes the values for "NCBI reference sequence names" > >> >>> > >> >>> See > >> >>> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e= > >> >>> for one report on chr17, > >> >>> and > >> >>> > >> >>> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e= > >> >>> > >> >>> for a table that includes the Genbank labels. > >> >>> > >> >>> Should I just file a PR at > >> >>> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e= > >> >>> after > >> >>> testing? > >> >>> > >> >> > >> >> _______________________________________________ > >> >> Bioc-devel@r-project.org mailing list > >> >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e= > >> >> > >> > > >> > >> -- > >> Robert Castelo, PhD > >> Associate Professor > >> Dept. of Experimental and Health Sciences > >> Universitat Pompeu Fabra (UPF) > >> Barcelona Biomedical Research Park (PRBB) > >> Dr Aiguader 88 > >> E-08003 Barcelona, Spain > >> telf: +34.933.160.514 > >> fax: +34.933.160.550 > >> > > > > -- > The information in this e-mail is intended only for th...{{dropped:15}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel