[galaxy-user] Inquiry on FastQC report

Ng Kiaw Kiaw Thu, 01 Aug 2013 11:00:43 -0700

Dear Galaxy Officer,

Good day.

I am a new user of Galaxy main server. The tools provided are very user-friendly. Thanks for the establishment of these.

I just new to the RNA-seq analysis and now in the learning process of Bioinformatics.

I would like to inquire on the FastaQC report generated on my data.

For your information:

Samples: Plant (dicotyledon)

Type of data: RNA-seq (Illumina HiSeq 2000 with CASAVA v 1.8.2)

Paired ends

Adapter sequence: RPI 15 ( 5’ CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA)

Main purpose of my analysis: Identification of novel transcript and gene _expression_ studies

I run FastQC on my raw RNA-seq data both forward and reverse. I attach the FastQC report in this email.

Title: D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq FastQC Report

Basic Statistics

Measure	Value
Filename	D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	50140881
Filtered Sequences	0
Sequence length	100
%GC	43

Per base sequence quality

Per base quality graph

Per sequence quality scores

Per Sequence quality graph

Per base sequence content

Per base GC content

Per base GC content graph

Per sequence GC content

Per sequence GC content graph

Per base N content

N content graph

Sequence Length Distribution

Sequence length distribution

Sequence Duplication Levels

Duplication level graph

Overrepresented sequences

No overrepresented sequences

Kmer Content

Kmer graph

Sequence	Count	Obs/Exp Overall	Obs/Exp Max	Max Obs/Exp Position
CTCCA	9595585	3.510654	6.3710628	60-64
TTCTC	12116685	3.2885883	5.2676396	45-49
TCTCG	10716155	3.2443743	5.334085	80-84
CCAAG	11237210	3.0798802	5.2831144	55-59
CTTCT	11243660	3.0516405	5.6570544	95-96
GCCAA	11103775	3.0433087	5.249829	55-59
TCTTC	10632210	2.8856869	5.412275	90-94
TGCCA	9846895	2.8365011	5.120373	50-54
CTGCT	9346185	2.8296084	6.1322436	95-96
ACTCC	7230155	2.6452348	5.5269346	60-64
AAAAA	24620425	2.5958595	8.515192	95-96
CCGTC	5590090	2.3977835	6.3717356	85-89
GTGCC	7063720	2.3855677	5.072632	50-54
TCTGC	7538220	2.282237	5.32954	95-96

Files created by FastQC

D1TB3ACXX_PR0055_02A15_H1_L007_R1_fastqc.zip (171.6 KB)

duplication_levels.png (18.5 KB)

fastqc_data.txt (7.8 KB)

fastqc_report.html (8.0 KB)

kmer_profiles.png (65.5 KB)

per_base_gc_content.png (10.2 KB)

per_base_n_content.png (7.7 KB)

per_base_quality.png (9.8 KB)

per_base_sequence_content.png (16.5 KB)

per_sequence_gc_content.png (26.7 KB)

per_sequence_quality.png (19.4 KB)

rgFastQC3Jm48N.log (1.7 KB)

sequence_length_distribution.png (17.2 KB)

summary.txt (751 B)

FastQC documentation and full attribution is here

FastQC was run by Galaxy using the rgenetics rgFastQC wrapper - see http://rgenetics.org for details and licensing

Title: D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq FastQC Report

Summary

Basic Statistics
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Kmer Content

Basic Statistics

Measure	Value
Filename	D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	50140881
Filtered Sequences	0
Sequence length	100
%GC	42

Per base sequence quality

Per base quality graph

Per sequence quality scores

Per Sequence quality graph

Per base sequence content

Per base GC content

Per base GC content graph

Per sequence GC content

Per sequence GC content graph

Per base N content

N content graph

Sequence Length Distribution

Sequence length distribution

Sequence Duplication Levels

Duplication level graph

Overrepresented sequences

Sequence	Count	Percentage	Possible Source
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT	199520	0.39791881598570233	No Hit
GGTAGTTCACTGTAAGATGACCTCGTTCATATCCTTTCCCATCTGTGTCC	107648	0.21469108211321616	No Hit
GATATCTCACTAAAGAGATAGCCGATCTTCCCAATAATCTTCACCAATAC	56585	0.1128520258748545	No Hit
GATTTCATAACGGAGAATAGAGGATTGAAGGAATCTCCACAATCAACAAA	51764	0.10323711703430181	No Hit

Kmer Content

Kmer graph

Sequence	Count	Obs/Exp Overall	Obs/Exp Max	Max Obs/Exp Position
TTTTT	44893920	4.603159	9.396082	2
GGTGG	7636440	4.1987643	8.876761	75-79
AAAAA	33685255	4.0047812	15.362864	95-96
GTAGA	12217490	3.077795	5.6548786	65-69
TGTAG	12441590	3.0428424	5.5109844	60-64
GTCGG	6023890	2.6264553	5.7038784	45-49
CGGTG	5547925	2.4189312	6.189177	75-79
GTGGT	6598845	2.3842838	5.4463615	75-79
GGTCG	4643850	2.024749	5.8377094	80-84
CGCCG	4658255	1.9434872	5.8249826	80-84
AGGCA	4324605	1.3146408	6.3336577	1
GGTAG	3139310	1.1683644	6.0488625	1
AGGCT	3695365	1.090596	5.6976523	1

Files created by FastQC

D1TB3ACXX_PR0055_02A15_H1_L007_R2_fastqc.zip (176.0 KB)

duplication_levels.png (18.8 KB)

fastqc_data.txt (8.1 KB)

fastqc_report.html (8.5 KB)

kmer_profiles.png (65.7 KB)

per_base_gc_content.png (10.3 KB)

per_base_n_content.png (7.7 KB)

per_base_quality.png (9.9 KB)

per_base_sequence_content.png (18.2 KB)

per_sequence_gc_content.png (26.9 KB)

per_sequence_quality.png (19.6 KB)

rgFastQCOnPOyg.log (1.7 KB)

sequence_length_distribution.png (17.2 KB)

summary.txt (751 B)

FastQC documentation and full attribution is here

FastQC was run by Galaxy using the rgenetics rgFastQC wrapper - see http://rgenetics.org for details and licensing

My questions are:

1) The basic statistics shows that my data encoding is Sanger/illumina 1.9. When I grooming my data for downstream analysis in Galaxy, is that correct I choose "Sanger" for the input FASTQ quality score type?

2) Based on the per base sequence quality, the quality scores are above 20.0 for both forward and reverse data. Do I still need to trim off my data?

3) The result for "Per base sequence content", "Per base GC content", "sequence duplication level" are fail. What are these three results indicate? What are the solution for these problems?

4) What the overrepresented sequence indicate? Do I need to trim off the overrepresented sequence?

5) Based on the K-mer content, how could I analyse and justify whether this is good data or not?

6) In the reverse data FastQC report, "per sequence GC content" seem not good. What do this indicate?

7) How could I identify the adapter sequence in my RNA-seq data and how could could I remove?

8) After grooming data, running FastQC on data, adapter removal, is there any other pre-processing steps need to be done before running bowtie and top hat?

Many Thanks in advance for your kind assistance and supports.

Best regards

Ng Kiaw Kiaw

PhD student

RIKEN Yokohama Campus

Japan.

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:


  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Inquiry on FastQC report

Summary

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per base GC content

Per sequence GC content

Per base N content

Sequence Length Distribution

Sequence Duplication Levels

Overrepresented sequences

Kmer Content

Files created by FastQC

Summary

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per base GC content

Per sequence GC content

Per base N content

Sequence Length Distribution

Sequence Duplication Levels

Overrepresented sequences

Kmer Content

Files created by FastQC

Reply via email to