Hi,
I obtained the exon sequences and here are the duplicate exon IDs with
different descriptions.
TSS[duplicated(TSS[,1]), 1]
[1] "AT1G68552.1-E12203" "AT1G64140.1-E14755" "AT1G64140.1-E14756"
"AT1G70780.1-E4116"
[5] "AT1G75390.1-E22428" "AT1G06149.1-E1988" "AT1G36730.1-E35050"
"AT1G36730.1-E35051"
[9] "AT1G29952.1-E5728" "AT1G29952.1-E5730" "AT1G29952.1-E5732"
"AT1G29970.2-E8863"
[13] "AT1G29970.2-E8864" "AT1G64628.1-E10574" "AT1G25470.1-E20679"
"AT1G58120.1-E18468"
[17] "AT1G29041.1-E15117" "AT1G23149.1-E13728" "AT1G29952.1-E5728"
"AT1G29952.1-E5732"
[21] "AT2G18162.1-E49029" "AT3G51632.1-E98183" "AT3G22970.1-E89708"
"AT3G45240.2-E86808"
[25] "AT3G18000.1-E98438" "AT3G59052.1-E77046" "AT3G62422.1-E76351"
"AT3G25570.1-E88575"
[29] "AT3G25570.1-E88576" "AT3G10910.1-E77164" "AT3G02468.1-E88931"
"AT3G12010.1-E78704"
[33] "AT3G01470.1-E92685" "AT3G53402.1-E93478" "AT3G26430.1-E85151"
"AT3G26430.1-E85154"
[37] "AT4G19110.1-E121565" "AT4G22592.1-E113550" "AT4G22592.1-E113551"
"AT4G22592.1-E113552"
[41] "AT4G12430.1-E113931" "AT4G12430.1-E113932" "AT4G12430.1-E113933"
"AT4G25670.1-E111076"
[45] "AT4G25670.1-E111077" "AT4G36990.1-E122859" "AT4G14620.1-E120308"
"AT4G34590.1-E116802"
[49] "AT5G09460.1-E136355" "AT5G09460.1-E136357" "AT5G50010.1-E151574"
"AT5G50010.1-E151576"
[53] "AT5G50010.1-E151574" "AT5G50011.1-E153108" "AT5G50011.1-E153110"
"AT5G09460.1-E136355"
[57] "AT5G09463.1-E151757" "AT5G09463.1-E151758" "AT5G52552.1-E136887"
"AT5G52552.1-E136888"
[61] "AT5G41992.1-E154552" "AT5G64341.1-E144370" "AT5G64341.1-E144371"
"AT5G64341.1-E144373"
[65] "AT5G64341.1-E144370" "AT5G64341.1-E144371" "AT5G64343.1-E148873"
"AT5G64341.1-E144373"
[69] "AT5G09460.1-E136355" "AT5G09463.1-E151757" "AT5G09460.1-E136357"
"AT5G09463.1-E151758"
[73] "AT5G49448.1-E171824" "AT5G05282.1-E152619" "AT5G53588.1-E159453"
"AT5G09670.2-E157563"
[77] "AT5G01710.1-E140929" "AT5G64341.1-E144370" "AT5G64343.1-E148873"
"AT5G61230.1-E153842"
[81] "AT5G61230.1-E153843" "AT5G60550.1-E140873" "AT5G64552.1-E148753"
"AT5G64552.1-E148754"
[85] "AT5G45430.1-E151338"
For example,
TSS[TSS[,1]=="AT1G68552.1-E12203",]
ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end strand
3125 AT1G68552.1-E12203 1 25727627 25727701 -1
15537 AT1G68552.1-E12203 1 25727627 25727701 -1
description
3125 CPuORF53 (Conserved peptide upstream open reading frame 53); Upstream
open reading frames (uORFs) are small open reading frames found in the 5' UTR
of a mature mRNA, and can potentially mediate translational regulation of the
largest, or major, ORF (mORF). CPuORF53 represents a conserved upstream opening
reading frame relative to major ORF AT1G68550.1
15537
AP2 domain-containing transcription factor, putative;
encodes a member of the ERF (ethylene response factor) subfamily B-6 of ERF/AP2
transcription factor family. The protein contains one AP2 domain. There are 12
members in this subfamily including RAP2.11.
So I think the database contains errors. In this case, it will require manual
curation to determine which row to choose. Did you contact ensembl about this?
Thanks!
Best regards,
Julie
*******************************************
Lihua Julie Zhu, Ph.D
Research Associate Professor
Program Gene Function and Expression
University of Massachusetts Medical School
364 Plantation Street, Room 613
Worcester, MA 01605
508-856-5256
http://www.umassmed.edu/pgfe/faculty/zhu.cfm
*******************************************
On 3/5/10 6:46 PM, "[email protected]" <[email protected]> wrote:
Dear bioc-sig-sequencing,
I would like to annotate chip-seq peaks for the arabidopsis genome. "TSS" and
"Exon" are two of the arguments for the 'getAnnotation' function. The "TSS"
argument succeeded, but the "Exon" argument failed.
...
> arabdset<-useMart(biomart="plant_mart_4", dataset = "athaliana_eg_gene")
Checking attributes ... ok
Checking filters ... ok
> ExonArabAnno<-getAnnotation(arabdset, featureType="Exon")
Error in `rownames<-`(`*tmp*`, value = c("ATCG00010.1-E176369",
"ATMG00010.1-E176520", :
duplicate rownames not allowed
> sessionInfo()
R version 2.11.0 Under development (unstable) (2010-02-28 r51186)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ChIPpeakAnno_1.3.4 org.Hs.eg.db_2.3.6
[3] GO.db_2.3.5 RSQLite_0.8-3
[5] DBI_0.2-5 AnnotationDbi_1.9.4
[7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.15.11
[9] Biostrings_2.15.22 IRanges_1.5.51
[11] multtest_2.3.0 Biobase_2.7.4
[13] biomaRt_2.3.4
loaded via a namespace (and not attached):
[1] MASS_7.3-5 RCurl_1.3-1 splines_2.11.0 survival_2.35-8
[5] tools_2.11.0 XML_2.6-0
>
Can someone comment?
Thanks,
P. Terry
[email protected]
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing