Dear Galaxy Officer, Good day. I am a new user of Galaxy main server. The tools provided are very user-friendly. Thanks for the establishment of these. I just new to the RNA-seq analysis and now in the learning process of Bioinformatics. I would like to inquire on the FastaQC report generated on my data. For your information: Samples: Plant (dicotyledon) Type of data: RNA-seq (Illumina HiSeq 2000 with CASAVA v 1.8.2) Paired ends Adapter sequence: RPI 15 ( 5’ CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA) Main purpose of my analysis: Identification of novel transcript and gene _expression_ studies I run FastQC on my raw RNA-seq data both forward and reverse. I attach the FastQC report in this email. |
FastQC Report
Mon 22 Jul 2013
D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq
D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq
Summary
Basic Statistics
Measure | Value |
---|---|
Filename | D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq |
File type | Conventional base calls |
Encoding | Sanger / Illumina 1.9 |
Total Sequences | 50140881 |
Filtered Sequences | 0 |
Sequence length | 100 |
%GC | 43 |
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
No overrepresented sequences
Kmer Content
Sequence | Count | Obs/Exp Overall | Obs/Exp Max | Max Obs/Exp Position |
---|---|---|---|---|
CTCCA | 9595585 | 3.510654 | 6.3710628 | 60-64 |
TTCTC | 12116685 | 3.2885883 | 5.2676396 | 45-49 |
TCTCG | 10716155 | 3.2443743 | 5.334085 | 80-84 |
CCAAG | 11237210 | 3.0798802 | 5.2831144 | 55-59 |
CTTCT | 11243660 | 3.0516405 | 5.6570544 | 95-96 |
GCCAA | 11103775 | 3.0433087 | 5.249829 | 55-59 |
TCTTC | 10632210 | 2.8856869 | 5.412275 | 90-94 |
TGCCA | 9846895 | 2.8365011 | 5.120373 | 50-54 |
CTGCT | 9346185 | 2.8296084 | 6.1322436 | 95-96 |
ACTCC | 7230155 | 2.6452348 | 5.5269346 | 60-64 |
AAAAA | 24620425 | 2.5958595 | 8.515192 | 95-96 |
CCGTC | 5590090 | 2.3977835 | 6.3717356 | 85-89 |
GTGCC | 7063720 | 2.3855677 | 5.072632 | 50-54 |
TCTGC | 7538220 | 2.282237 | 5.32954 | 95-96 |
Files created by FastQC
FastQC documentation and full attribution is hereFastQC was run by Galaxy using the rgenetics rgFastQC wrapper - see http://rgenetics.org for details and licensing
FastQC Report
Mon 22 Jul 2013
D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq
D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq
Summary
Basic Statistics
Measure | Value |
---|---|
Filename | D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq |
File type | Conventional base calls |
Encoding | Sanger / Illumina 1.9 |
Total Sequences | 50140881 |
Filtered Sequences | 0 |
Sequence length | 100 |
%GC | 42 |
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Sequence | Count | Percentage | Possible Source |
---|---|---|---|
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT | 199520 | 0.39791881598570233 | No Hit |
GGTAGTTCACTGTAAGATGACCTCGTTCATATCCTTTCCCATCTGTGTCC | 107648 | 0.21469108211321616 | No Hit |
GATATCTCACTAAAGAGATAGCCGATCTTCCCAATAATCTTCACCAATAC | 56585 | 0.1128520258748545 | No Hit |
GATTTCATAACGGAGAATAGAGGATTGAAGGAATCTCCACAATCAACAAA | 51764 | 0.10323711703430181 | No Hit |
Kmer Content
Sequence | Count | Obs/Exp Overall | Obs/Exp Max | Max Obs/Exp Position |
---|---|---|---|---|
TTTTT | 44893920 | 4.603159 | 9.396082 | 2 |
GGTGG | 7636440 | 4.1987643 | 8.876761 | 75-79 |
AAAAA | 33685255 | 4.0047812 | 15.362864 | 95-96 |
GTAGA | 12217490 | 3.077795 | 5.6548786 | 65-69 |
TGTAG | 12441590 | 3.0428424 | 5.5109844 | 60-64 |
GTCGG | 6023890 | 2.6264553 | 5.7038784 | 45-49 |
CGGTG | 5547925 | 2.4189312 | 6.189177 | 75-79 |
GTGGT | 6598845 | 2.3842838 | 5.4463615 | 75-79 |
GGTCG | 4643850 | 2.024749 | 5.8377094 | 80-84 |
CGCCG | 4658255 | 1.9434872 | 5.8249826 | 80-84 |
AGGCA | 4324605 | 1.3146408 | 6.3336577 | 1 |
GGTAG | 3139310 | 1.1683644 | 6.0488625 | 1 |
AGGCT | 3695365 | 1.090596 | 5.6976523 | 1 |
Files created by FastQC
FastQC documentation and full attribution is hereFastQC was run by Galaxy using the rgenetics rgFastQC wrapper - see http://rgenetics.org for details and licensing
My questions are: 1) The basic statistics shows that my data encoding is Sanger/illumina 1.9. When I grooming my data for downstream analysis in Galaxy, is that correct I choose "Sanger" for the input FASTQ quality score type? 2) Based on the per base sequence quality, the quality scores are above 20.0 for both forward and reverse data. Do I still need to trim off my data? 3) The result for "Per base sequence content", "Per base GC content", "sequence duplication level" are fail. What are these three results indicate? What are the solution for these problems? 4) What the overrepresented sequence indicate? Do I need to trim off the overrepresented sequence? 5) Based on the K-mer content, how could I analyse and justify whether this is good data or not? 6) In the reverse data FastQC report, "per sequence GC content" seem not good. What do this indicate? 7) How could I identify the adapter sequence in my RNA-seq data and how could could I remove? 8) After grooming data, running FastQC on data, adapter removal, is there any other pre-processing steps need to be done before running bowtie and top hat? Many Thanks in advance for your kind assistance and supports. Best regards Ng Kiaw Kiaw PhD student RIKEN Yokohama Campus Japan. |
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/