Re: [galaxy-user] Inquiry on FastQC report

Jennifer Jackson Tue, 06 Aug 2013 12:04:38 -0700

Hello,

Your post is very difficult to read with the formatting. The best placeto find out more about the FastQC program is through the tooldocumentation, linked from the tool form but also here:

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/


More below.

On 7/31/13 11:08 PM, Ng Kiaw Kiaw wrote:

Dear Galaxy Officer,

Good day.
I am a new user of Galaxy main server. The tools provided are veryuser-friendly. Thanks for the establishment of these.
I just new to the RNA-seq analysis and now in the learning process ofBioinformatics.
I would like to inquire on the FastaQC report generated on my data.

For your information:

Samples: Plant (dicotyledon)

Type of data: RNA-seq (Illumina HiSeq 2000 with CASAVA v 1.8.2)

Paired ends
Adapter sequence: RPI 15 ( 5'CAAGCAGAAGACGGCATACGAGA*TTGACATG*TGACTGGAGTTCCTTGGCACCCGAGAATTCCA)
Main purpose of my analysis: Identification of novel transcript andgene expression studies
I run FastQC on my raw RNA-seq data both forward and reverse. I attachthe FastQC report in this email.
My questions are:
1) The basic statistics shows that my data encoding is Sanger/illumina1.9. When I grooming my data for downstream analysis in Galaxy, isthat correct I choose "Sanger" for the input FASTQ quality score type?

Yes, if you choose to groom, Sanger is the correct input. Or you canjust assign the datatype to .fastqsanger by clicking on the pencil icon.More help is in this screencast "FASTQ Prep - Illumina"

https://main.g2.bx.psu.edu/u/galaxyproject/p/screencasts-usegalaxyorg

2) Based on the per base sequence quality, the quality scores areabove 20.0 for both forward and reverse data. Do I still need to trimoff my data?

No, most likely not, this is a reasonable quality score to use as abaseline.

3) The result for "Per base sequence content", "Per base GC content","sequence duplication level" are fail. What are these three resultsindicate? What are the solution for these problems?

These are quality metrics and indicate that the data is skewed away fromwhat would be expected in a normal distribution. You could investigatethe library preparation methods is this is your own data.

4) What the overrepresented sequence indicate? Do I need to trim offthe overrepresented sequence?

Same as above. And yes, if it is a great portion of your data,repetitive, or causes problem later on, as it effectively "shortens" thelength of the sequence being aligned, even though the sequence is longer- and this could cause you to pick the wrong length parameters in Tophat.

5) Based on the K-mer content, how could I analyse and justify whetherthis is good data or not?


Same as above.

6) In the reverse data FastQC report, "per sequence GC content" seemnot good. What do this indicate?

Same as above.

7) How could I identify the adapter sequence in my RNA-seq data andhow could could I remove?

Locating the methods associated with the preparation of the data is thefirst place to look. You could also just trim the reads if the"overrepresented sequence" is localized to where the adapter is mostlikely to be, then trim based off of that range.

8) After grooming data, running FastQC on data, adapter removal, isthere any other pre-processing steps need to be done before runningbowtie and top hat?

Because quality is not an issue, no trimming is necessary. You couldhowever filter out short sequences that will never be able to meet thealignment criteria. See the Tophat documentation about how to best tuneparameters to match data based on the length of reads.

All of this said, most of the time, very little needs to be done most ofthe time. Poor reads will simply fall out and not align in the firststeps of the pipeline. Trimming and setting Tophat parameters will havethe greatest impact.


Take care,

Jen
Galaxy team


Many Thanks in advance for your kind assistance and supports.

Best regards
Ng Kiaw Kiaw
PhD student
RIKEN Yokohama Campus
Japan.


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

   http://galaxyproject.org/search/mailinglists/


--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Inquiry on FastQC report

Reply via email to