Hi Michael, tx_by_gene <- transcriptsBy(txdb, by='gene') returns a named list (or more precisely, a named GRangesList object) where the names are the gene ids. The show() method for these objects actually gives a clue that the list carries names:
> tx_by_gene GRangesList object of length 24467: $`100009600` GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr9 21062393-21067093 - | 74163 ENSMUST00000115494.2 [2] chr9 21062400-21073096 - | 74165 ENSMUST00000213826.1 ------- seqinfo: 66 sequences (1 circular) from mm10 genome $`100009609` GRanges object with 4 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr7 84935565-84964115 - | 60602 ENSMUST00000044583.9 [2] chr7 84935565-84964115 - | 60603 ENSMUST00000232823.1 [3] chr7 84935565-84964115 - | 60604 ENSMUST00000233024.1 [4] chr7 84935565-84964115 - | 60605 ENSMUST00000233469.1 ------- seqinfo: 66 sequences (1 circular) from mm10 genome $`100009614` GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr10 77711457-77712009 + | 79323 ENSMUST00000075421.6 ------- seqinfo: 66 sequences (1 circular) from mm10 genome ... <24464 more elements> Note the difference with: > unname(tx_by_gene) GRangesList object of length 24467: [[1]] GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr9 21062393-21067093 - | 74163 ENSMUST00000115494.2 [2] chr9 21062400-21073096 - | 74165 ENSMUST00000213826.1 ------- seqinfo: 66 sequences (1 circular) from mm10 genome [[2]] GRanges object with 4 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr7 84935565-84964115 - | 60602 ENSMUST00000044583.9 [2] chr7 84935565-84964115 - | 60603 ENSMUST00000232823.1 [3] chr7 84935565-84964115 - | 60604 ENSMUST00000233024.1 [4] chr7 84935565-84964115 - | 60605 ENSMUST00000233469.1 ------- seqinfo: 66 sequences (1 circular) from mm10 genome [[3]] GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr10 77711457-77712009 + | 79323 ENSMUST00000075421.6 ------- seqinfo: 66 sequences (1 circular) from mm10 genome ... <24464 more elements> This is similar to what happens with a named list vs unnamed list in base R: > list(`44`="A", `5`="B") $`44` [1] "A" $`5` [1] "B" > unname(list(`44`="A", `5`="B")) [[1]] [1] "A" [[2]] [1] "B" Like with any other named list in R, you should always use a subscript of type character when you lookup some particular genes with tx_by_gene[my_favorite_genes] or tx_by_gene[[my_favorite_gene]]. As you've learned, using a numeric subscript will get you the wrong answer because then the subsetting is positional. I don't know what type the Ensembl people use to store the ensemblgene_id in their db but note that the UCSC people use a character type: http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=765682895_SNRJX5rTEkoCtJclt55UGpncPIka&hgta_doSchemaDb=mm10&hgta_doSchemaTable=knownToLocusLink Also TxDb objects always use a character type to store external gene ids like Ensembl ids. H. On 10/3/19 03:45, Michael Shapiro wrote: > > Apologies for a previous email that seems content free. > > I've run into a cosmic mis-match between biomaRt and TxDb which is either a > bug or a bug waiting to happen. In brief, biomaRt reports entrezgene_id as a > numeric, but TxDb wants it as a character. What's deadly in this is that > TxDb doesn't fail from being supplied with the numeric, it simply accesses > the wrong gene. Here is a minimal example where I am trying to get from gene > name (Kcnj12) to gene location: > > ## Resolve the gene name: > ensembl = useMart('ensembl', dataset='mmusculus_gene_ensembl') > geneNames=getBM(c('entrezgene_id', 'external_gene_name'), mart= ensembl) > idx = geneNames$external_gene_name == 'Kcnj12' > entrezGeneId = geneNames$entrezgene_id[idx] > > ## Get gene locations: > txdb = TxDb.Mmusculus.UCSC.mm10.knownGene > tbg = transcriptsBy(txdb,by='gene') > > ## Shoot self in foot: > WRONG_LOCATION = tbg[[entrezGeneId]] > > ## Get email from biologist pointing out you've got the wrong gene: > ACTUAL_LOCATION = tbg[[as.character(entrezGeneId)]] > > I would argue that if entrezgene_id is used in some places as a numeric and > others as a character, it's safer if biomaRt returns it as a character. If > your code is wrong, you want it to fail, not quietly mis-perform. A vector > or list will always let you access it using a numeric even when this is > wrong. You will probably get an error if you try to access something with a > character when you should be using a numeric. > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=xxmO-W0aZN8W5TRjjzH4Ypv7FA6B4TkKXJaIJJ1_TVs&s=Ao9Zb1iHPmFD8QMfPu9sZIDSj-oYy5_MqLo7VRUP2ao&e= > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel