Hi, The set of RefSeqs for which 100% of sequence aligned is quite small, as the mRNAs include a polyA tail. If you want to select a subset of RefSeq sequences based on alignment details, the information is contained in the 'refSeqAli' table. The table is in PSL format, described here: http://genome.ucsc.edu/FAQ/FAQformat#format2 .
One of our developers suggests another approach to selecting a set of high-quality RefSeq alignments: use only RefSeqs with "Reviewed" status that align uniquely to the genome (except if they align to the haplotype pseudochromosomes or pseudoautosomal regions). Information on the haplotype chromosomes and PAR regions is on the gateway page (http://genome.ucsc.edu/cgi-bin/hgGateway) for the Human, March 2006 assembly. The status of RefSeq genes (such as 'Reviewed') can be found in the 'refSeqStatus' table. I hope this is helpful. -- Brooke Rhead UCSC Genome Bioinformatics Group Dylan Bobby wrote: > Hi Brooke, > > Very helpful response. I apologize for being somewhat redundant with > previous questions, but I did not come across the emails you listed in > my searches. > > In response to (2) then, is it possible to get a list of all the > RefSeqs for which you successfully aligned 100% of sequence to the > genome? Ideally I would like to work with just that subset of > sequences. For now I just need this list for human, but a generalized > approach would be better long term. > > Thanks! > > --- On *Fri, 11/14/08, Brooke Rhead /<[EMAIL PROTECTED]>/* wrote: > > From: Brooke Rhead <[EMAIL PROTECTED]> > Subject: Re: [Genome] Relationship between refGene.txt and > refMrna.fa.gz? > To: [EMAIL PROTECTED] > Cc: [email protected] > Date: Friday, November 14, 2008, 7:47 PM > > Hello Dylan, > > Here are a couple of related previously-answered questions that are > similar to > yours: > http://www.soe.ucsc.edu/pipermail/genome/2008-April/016256.html > http://www.soe.ucsc.edu/pipermail/genome/2007-February/012843.html > > Note that you can search the mailing list archives on this page: > http://genome.ucsc.edu/FAQ/ and browse > or search from this page: > http://genome.ucsc.edu/contacts.html . > > I will also try to answer each of your questions: > > (1) The refMrna.fa.gz file from this page (assuming you are using the > latest > human database): http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ > contains mRNA sequences from the Reference Sequence collection at NCBI > (http://www.ncbi.nlm.nih.gov/RefSeq/). It is updated once a week. The > sequences in this file are incorporated into a track at UCSC called > "RefSeq > Genes". To see how the track is created, click on the track name on the > main Genome Browser page (http://genome.ucsc.edu/cgi-bin/hgTracks). In > the > methods section you will see: > > RefSeq RNAs were aligned against the human genome using blat; those with > an > alignment of less than 15% were discarded. When a single RNA aligned in > multiple > places, the alignment having the highest base identity was identified. > Only > alignments having > a base identity level within 0.1% of the best and at least 96% > base identity with the genomic sequence were kept. > > (2) No. Not all of the sequence in refMrna.fa.gz is aligned to the > reference > genome by blat. > > (3) Click on the "View table schema" link from the track details > page, or select the table in the Table Browser and hit the "describe table > schema" button. > > (4) Sometimes sequence from the refMrna.fa.gz file aligns to the genome > more > than once -- see the methods section of RefSeq Genes. > > Since you are new to the Genome Browser, you might be interested in the > online > Genome Browser tutorials from Open Helix: > http://www.openhelix.com/downloads/ucsc/ucsc_home.shtml > > Good luck with your research. > > -- > Brooke Rhead > UCSC Genome Bioinformatics Group > > > On 11/13/08 17:01, Dylan Bobby wrote: > > Hi, > > > I'm trying to understand the precise relationship between the > RefSeq > annotations in refGene.txt and the sequences in refSeq_mRNA.fa. I am new > to mammalian genomes, RefSeq and the UCSC browser, so I think I just > have some basic misunderstanding and maybe answers to these questions > will help clear it up: > > > > (1) Are the refGene.txt and refSeq_mRNA.fa files synced with one > > another? > > (2) If I was to splice together all the exons annotated in > > refGene.txt for each RefSeq, would I cover all the sequence found in > refSeq_mRNA.fa > or would there still be some extra sequence in refSeq_mRNA.fa? If so, > what is the extra sequence? > > (3) Is there a description of the columns in refGene.txt available? I > > > would like to better understand that table. > > (4) Why are there multiple rows present for some RefSeq Ids in > refGene.txt? > > (5) If these two files aren't meant to correspond, is there another > sequence file that corresponds better with the > annotations in refGene.txt? > > > > Thanks! > > > > > > > > _______________________________________________ > > Genome maillist - [email protected] > > http://www.soe.ucsc.edu/mailman/listinfo/genome > > > _______________________________________________ Genome maillist - [email protected] http://www.soe.ucsc.edu/mailman/listinfo/genome
