Dear Galaxy Officer, 

Good day. 

I am a new user of Galaxy main server. The tools provided are very user-friendly. Thanks for the establishment of these. 

I just new to the RNA-seq analysis and now in the learning process of Bioinformatics. 

I would like to inquire on the FastaQC report generated on my data. 

For your information: 

Samples: Plant (dicotyledon)

Type of data: RNA-seq (Illumina HiSeq 2000 with CASAVA v 1.8.2) 

Paired ends

Adapter sequence: RPI 15 ( 5’ CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA)

Main purpose of my analysis: Identification of novel transcript and gene _expression_ studies

I run FastQC on my raw RNA-seq data both forward and reverse. I attach the FastQC report in this email.

Title: D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq FastQC Report
FastQCFastQC Report
Mon 22 Jul 2013
D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq

[OK] Basic Statistics

Measure Value
Filename D1TB3ACXX_PR0055_02A15_H1_L007_R1.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 50140881
Filtered Sequences 0
Sequence length 100
%GC 43

[OK] Per base sequence quality

Per base quality graph

[OK] Per sequence quality scores

Per Sequence quality graph

[FAIL] Per base sequence content

Per base sequence content

[FAIL] Per base GC content

Per base GC content graph

[OK] Per sequence GC content

Per sequence GC content graph

[OK] Per base N content

N content graph

[OK] Sequence Length Distribution

Sequence length distribution

[FAIL] Sequence Duplication Levels

Duplication level graph

[OK] Overrepresented sequences

No overrepresented sequences

[WARN] Kmer Content

Kmer graph

Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
CTCCA 9595585 3.510654 6.3710628 60-64
TTCTC 12116685 3.2885883 5.2676396 45-49
TCTCG 10716155 3.2443743 5.334085 80-84
CCAAG 11237210 3.0798802 5.2831144 55-59
CTTCT 11243660 3.0516405 5.6570544 95-96
GCCAA 11103775 3.0433087 5.249829 55-59
TCTTC 10632210 2.8856869 5.412275 90-94
TGCCA 9846895 2.8365011 5.120373 50-54
CTGCT 9346185 2.8296084 6.1322436 95-96
ACTCC 7230155 2.6452348 5.5269346 60-64
AAAAA 24620425 2.5958595 8.515192 95-96
CCGTC 5590090 2.3977835 6.3717356 85-89
GTGCC 7063720 2.3855677 5.072632 50-54
TCTGC 7538220 2.282237 5.32954 95-96
Title: D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq FastQC Report
FastQCFastQC Report
Mon 22 Jul 2013
D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq

[OK] Basic Statistics

Measure Value
Filename D1TB3ACXX_PR0055_02A15_H1_L007_R2.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 50140881
Filtered Sequences 0
Sequence length 100
%GC 42

[OK] Per base sequence quality

Per base quality graph

[OK] Per sequence quality scores

Per Sequence quality graph

[FAIL] Per base sequence content

Per base sequence content

[OK] Per base GC content

Per base GC content graph

[WARN] Per sequence GC content

Per sequence GC content graph

[OK] Per base N content

N content graph

[OK] Sequence Length Distribution

Sequence length distribution

[FAIL] Sequence Duplication Levels

Duplication level graph

[WARN] Overrepresented sequences

Sequence Count Percentage Possible Source
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 199520 0.39791881598570233 No Hit
GGTAGTTCACTGTAAGATGACCTCGTTCATATCCTTTCCCATCTGTGTCC 107648 0.21469108211321616 No Hit
GATATCTCACTAAAGAGATAGCCGATCTTCCCAATAATCTTCACCAATAC 56585 0.1128520258748545 No Hit
GATTTCATAACGGAGAATAGAGGATTGAAGGAATCTCCACAATCAACAAA 51764 0.10323711703430181 No Hit

[WARN] Kmer Content

Kmer graph

Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
TTTTT 44893920 4.603159 9.396082 2
GGTGG 7636440 4.1987643 8.876761 75-79
AAAAA 33685255 4.0047812 15.362864 95-96
GTAGA 12217490 3.077795 5.6548786 65-69
TGTAG 12441590 3.0428424 5.5109844 60-64
GTCGG 6023890 2.6264553 5.7038784 45-49
CGGTG 5547925 2.4189312 6.189177 75-79
GTGGT 6598845 2.3842838 5.4463615 75-79
GGTCG 4643850 2.024749 5.8377094 80-84
CGCCG 4658255 1.9434872 5.8249826 80-84
AGGCA 4324605 1.3146408 6.3336577 1
GGTAG 3139310 1.1683644 6.0488625 1
AGGCT 3695365 1.090596 5.6976523 1

My questions are: 

1) The basic statistics shows that my data encoding is Sanger/illumina 1.9. When I grooming my data for downstream analysis in Galaxy, is that correct I choose "Sanger" for the input FASTQ quality score type?

2) Based on the per base sequence quality, the quality scores are above 20.0 for both forward and reverse data. Do I still need to trim off my data? 

3) The result for "Per base sequence content", "Per base GC content", "sequence duplication level" are fail. What are these three results indicate? What are the solution for these problems? 

4) What the overrepresented sequence indicate? Do I need to trim off the overrepresented sequence? 

5) Based on the K-mer content, how could I analyse and justify whether this is good data or not?

6) In the reverse data FastQC  report, "per sequence GC content" seem not good. What do this indicate? 

7) How could I identify the adapter sequence in my RNA-seq data and how could could I remove?

8) After grooming data,  running FastQC on data, adapter removal, is there  any other pre-processing steps need to be done before running bowtie and top hat?  

Many Thanks in advance for your kind assistance and supports. 

Best regards
Ng Kiaw Kiaw
PhD student
RIKEN Yokohama Campus
Japan.
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Reply via email to