Hello, Often, the primary sequence data is obtained from NCBI and we recommend that you download the Genbank records there. The availability of quality scores will vary from data source and data type. For non-Genbank sources, read the track description, check the downloads area for data, and contact the source of you need more information than what they submitted for us to display in the UCSC Browser.
For EST sequence data, some sources provide quality scores and some do not. These are base-calling quality scores. For mRNA/consensus data, some sources may provide quality scores but most will not (if fact, you may not even be able to find out which ESTs where used to create the composite mRNA assembly or where they were incorporated). If there are quality scores, they will represent the quality of the composite sequence. This is (usually) a calculated score based on the base-calling quality score of the underlying ESTs plus the depth of coverage/base agreement of those sequences at each base position in the final assembled consensus sequence. It would be possible to follow the Genbank report's source information and check with the creators of the data if you want more details. They may only publish some of the info at NCBI. For RefSeq sequence data, there are no quality scores, but there is a status flag. The more finished a sequence is, the more confidence one would have in the correctness of the transcript's bases/reading frame for coding sequence. For some track's that are not from Genbank, but may still be based on Genbank data. An example of this is the "AceView Gene" track. Please read the track description for the details of processing and click out to the source for additional info. The main point to be aware of when using this type of "consensus" data as an evidence line is that a single EST can be propagated into multiple sources. The same sequence data, from a single data point, can look like multiple layers of evidence, leading a researcher to believe that the data must be correct since it was "observed" multiple times. This is a false assumption. If you are willing to reduced this type of data to the components, the duplication can be removed. This is a non-trivial task, as some of the data is hard to find, but something to keep in mind. Often it is easy to spot when an error has been propagated. The data will be from a single read, multiple reads from the same clone, or even all reads from a particular EST clone library/data source. There will be conflicting evidence from other sources. Or, more often, a lack of evidence from other sources, which can "promote" an error (that would normally be overruled by correct data) to the top level of a track, causing the possibility of interpreting an error as novelty. We hope this helps you get started, Jennifer Jackson UCSC Genome Bioinformatics Group Meenakshi Roy wrote: > Hi, > I want to look at raw sequencing reads. Ideally I would like to > check both base calls and sequence quality scores (including > information about number and quality of reads) for specific stretches > of sequence. > The reason for doing this is because on 28 way multi-genomic > alignments--some of the aligned sequences from other species have stop > codons within the coding exon, and I want to check how confident the > sequence quality is. > > This is an example of my queries: > > I wanted to look for orthologs for the human IGLC1 gene within other > species. I did this by searching on the human chromosomal location > for this gene- chr22: 21567554-21567874 (Human March 06 assembly). I > wanted only placental mammal orthologs, so I selected this option, and > then obtained the vertebrate multiz alignment and phastcons > conservation for 28 species. I then chose the options "Capitalize > coding exons based on Ensembl genes" from the pulldown menu on top of > the page. > > Here is the link: > http://genome.ucsc.edu/cgi-bin/hgc?hgsid=132217768&o=21567553&t=21567874&g=multiz28way&i=multiz28way&c=chr22&l=21567553&r=21567874&db=hg18&pix=800 > > I then take the capitalized nucleotide sequences for all the species > and translate them into protein. When I do this, with the Bushbaby > and Hamster sequences, I get several stop codons early on in the > protein. Are these stop codons real or could they be due to the > quality of the sequencing reads? > > My OS is MacOSX and I am using a Safari browser. > > > I hope I have been clear and provided sufficient detail. I would > really appreciate any help and/or advice. > > Thanks, > > Dr. Meenakshi Roy > UCLA > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
