Fastq is a format for storing DNA sequences together with the associated quality information often encoded in ascii characters. It is typically made of 4 lines for example 2 fastq entries would look like this.

@seq1
TTATTTTAAT
+
?+BBB/DHH@
@seq2
GACCCTTTGCA
+
?+BHB/DIH@

I do not have a lot of D expirience and I am writing a simple parser to help work with these files. Ideally it should be fast with low memory footprint. I am working with very large files of this type and can be up to 1GB.

module fastq;

import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;

struct Record{

    string sequence;
    string quals;
    string name;
}

auto Records(string filename){

    static auto toRecords(S)(S str){

        auto res = findSplitBefore(str,"+\n");

        auto seq = res[0];
        auto qual = res[1];

        return Record(seq,qual);
    }

    string text = cast(string)std.file.read(filename);

    enforce(text.length > 0 && text[0] == '@');
    text = text[1 .. $];

    auto entries = splitter(text,'@');

    return map!toRecords(entries);
}

The issue with this is that the "+" character can be part of the quality information and I am using it to split the quality information from the sequence information. and ends up splitting the quality information which is wrong. Ideally I do not want to use regex and I have heard of ragel for parsing but never used it. Such a solution would also be welcome, since I read it can be very fast.

Which is the idiomatic way to capture, sequence name (starts with @ character and the first entry) the sequence, (line2) the quality scores( line 4)

Reply via email to