Hello Yunfei, Only entries in refGene with annotated 5' UTRs (where txStart is not the same as cdsStart) appear in the upstream files. Since not all refGene entries have this annotation, the two will not match up perfectly.
You should also keep in mind that refGene is updated daily, while upstream1000.fa is updated weekly, so some discrepancies can arise from that as well. I hope this clears things up for you. Best Antonio Coelho UCSC Genome Bioinformatics Group Li, Yunfei wrote: > Hello > > I tried to generate a file like "upstream1000.fa.gz - Sequences 1000 bases > upstream of annotated transcription starts for RefSeq genes with annotated 5' > UTRs". By using "refGene.txt" to locate different refGene and sequence file > of chromosome "chromFaMasked.tar.gz", I can get a file very similar to > "upstream1000.fa", but I found some NM names show in "refGene.txt" do no > contain in "upstream1000.fa", such as "NM_001166752,NM_053230....." -- why > this would happen? > > Would you please give me some instructions on after locating each refgene and > cut their sequence from chromosome what criterion you have used to select > refGene? > > Best, > > Yunfei Li > -------------------------------------------------------------------------------------- > Research Assistant > Department of Statistics & > School of Molecular Biosciences > Biotechnology Life Sciences Building 427 > Washington State University > Pullman, WA 99164-7520 > Phone: 509-339-5096 > http://www.wsu.edu/~ye_lab/people.html > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
