Hello Namjin, You are correct: the knownGene.txStart field is always the lower number, and knownGene.txEnd is the higher coordinate number, regardless of strand.
The lower-case sequence in the chromFa file is sequence that corresponds to repeat regions. If you are using the human, March 2006 assembly from this page: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ You will find this description of the file: chromFa.zip - The assembly sequence in one file per chromosome. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. Repeat masking was done using the following RepeatMasker/RepBase versions: RepBase Update 9.11, RM database version 20050112. The main assembly is found in the chrN.fa files, where N is the name of the chromosome. The chrN_random.fa files contain clones that are not yet finished or cannot be placed with certainty at a specific place on the chromosome. In some cases, including the human HLA region on chromosome 6, the chrN_random.fa files also contain haplotypes that differ from the main assembly. I hope this information is helpful. -- Brooke Rhead UCSC Genome Bioinformatics Group Namjin Koo wrote: > Hi~ I wonder information fo downloaded data. > > There is information of gene position in knownGene.txt > > In case of gene "-" strand, position of txSt and txEnd are changed into > each other. > > Is it right? And I want to extract sequence using the position from > chroma.fa file. > > It represents big and small letter about A.T.G.C. What is difference? > > Thank you. > > Best Regard, > > - Namjin Koo - > > _______________________________________________ Genome maillist - [email protected] http://www.soe.ucsc.edu/mailman/listinfo/genome
