Hello Namjin,

You are correct: the knownGene.txStart field is always the lower number, 
and knownGene.txEnd is the higher coordinate number, regardless of strand.

The lower-case sequence in the chromFa file is sequence that corresponds 
to repeat regions.  If you are using the human, March 2006 assembly from 
this page:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

You will find this description of the file:

chromFa.zip - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case. Repeat masking was done using the following
    RepeatMasker/RepBase versions: RepBase Update 9.11, RM database
    version 20050112. The main assembly is found in the chrN.fa
    files, where N is the name of the chromosome. The chrN_random.fa
    files contain clones that are not yet finished or cannot be placed
    with certainty at a specific place on the chromosome. In some
    cases, including the human HLA region on chromosome 6, the
    chrN_random.fa files also contain haplotypes that differ from the
    main assembly.

I hope this information is helpful.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


Namjin Koo wrote:
> Hi~ I wonder information fo downloaded data.
>
> There is information of gene position in knownGene.txt
>
> In case of gene "-" strand, position of txSt and txEnd are changed into
> each other.
>
> Is it right? And I want to extract sequence using the position from
> chroma.fa file.
>
> It represents big and small letter about A.T.G.C. What is difference?
>
> Thank you.
>
> Best Regard,
>
> - Namjin Koo -
>
>   
_______________________________________________
Genome maillist  -  [email protected]
http://www.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to