Re: [Genome] how do I look at raw sequencing reads? (and check sequence quality scores)

Jennifer Jackson Fri, 15 May 2009 12:46:57 -0700

Hello,

Often, the primary sequence data is obtained from NCBI and we recommend 
that you download the Genbank records there. The availability of quality 
scores will vary from data source and data type. For non-Genbank 
sources, read the track description, check the downloads area for data, 
and contact the source of you need more information than what they 
submitted for us to display in the UCSC Browser.

For EST sequence data, some sources provide quality scores and some do 
not. These are base-calling quality scores.

For mRNA/consensus data, some sources may provide quality scores but 
most will not (if fact, you may not even be able to find out which ESTs 
where used to create the composite mRNA assembly or where they were 
incorporated). If there are quality scores, they will represent the 
quality of the composite sequence. This is (usually) a calculated score 
based on the base-calling quality score of the underlying ESTs plus the 
depth of coverage/base agreement of those sequences at each base 
position in the final assembled consensus sequence. It would be possible 
to follow the Genbank report's source information and check with the 
creators of the data if you want more details. They may only publish 
some of the info at NCBI.

For RefSeq sequence data, there are no quality scores, but there is a 
status flag. The more finished a sequence is, the more confidence one 
would have in the correctness of the transcript's bases/reading frame 
for coding sequence.

For some track's that are not from Genbank, but may still be based on 
Genbank data. An example of this is the "AceView Gene" track. Please 
read the track description for the details of processing and click out 
to the source for additional info.

The main point to be aware of when using this type of "consensus" data 
as an evidence line is that a single EST can be propagated into multiple 
sources. The same sequence data, from a single data point, can look like 
multiple layers of evidence, leading a researcher to believe that the 
data must be correct since it was "observed" multiple times. This is a 
false assumption. If you are willing to reduced this type of data to the 
components, the duplication can be removed. This is a non-trivial task, 
as some of the data is hard to find, but something to keep in mind. 
Often it is easy to spot when an error has been propagated. The data 
will be from a single read, multiple reads from the same clone, or even 
all reads from a particular EST clone library/data source. There will be 
conflicting evidence from other sources. Or, more often, a lack of 
evidence from other sources, which can "promote" an error (that would 
normally be overruled by correct data) to the top level of a track, 
causing the possibility of interpreting an error as novelty.

We hope this helps you get started,
Jennifer Jackson
UCSC Genome Bioinformatics Group

Meenakshi Roy wrote:
> Hi,
>     I want to look at raw sequencing reads.  Ideally I would like to  
> check both base calls and sequence quality scores (including  
> information about number and quality of reads) for specific stretches  
> of sequence.
> The reason for doing this is because on 28 way multi-genomic  
> alignments--some of the aligned sequences from other species have stop  
> codons within the coding exon, and I want to check how confident the  
> sequence quality is.
>
>    This is an example of my queries:
>
> I wanted to look for orthologs for the human IGLC1 gene within other  
> species.  I did this by searching on the human chromosomal location  
> for this gene- chr22: 21567554-21567874 (Human March 06 assembly).   I  
> wanted only placental mammal orthologs, so I selected this option, and  
> then obtained the vertebrate multiz alignment and phastcons  
> conservation for 28 species.  I then chose the options "Capitalize  
> coding exons based on Ensembl genes" from the pulldown menu on top of  
> the page.
>
> Here is the link:
> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=132217768&o=21567553&t=21567874&g=multiz28way&i=multiz28way&c=chr22&l=21567553&r=21567874&db=hg18&pix=800
>
> I then take the capitalized nucleotide sequences for all the species  
> and translate them into protein.  When I do this, with the Bushbaby  
> and Hamster sequences, I get several stop codons early on in the  
> protein.  Are these stop codons real or could they be due to the  
> quality of the sequencing reads?
>
> My OS is MacOSX and I am using a Safari browser.
>
>
> I hope I have been clear and provided sufficient detail.  I would  
> really appreciate any help and/or advice.
>
> Thanks,
>
> Dr. Meenakshi Roy
> UCLA
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] how do I look at raw sequencing reads? (and check sequence quality scores)

Reply via email to