parsing fastq files with D

eastanon via Digitalmars-d-learn Wed, 23 Mar 2016 22:46:59 -0700

Fastq is a format for storing DNA sequences together with theassociated quality information often encoded in ascii characters.It is typically made of 4 lines for example 2 fastq entrieswould look like this.


@seq1
TTATTTTAAT
+
?+BBB/DHH@
@seq2
GACCCTTTGCA
+
?+BHB/DIH@

I do not have a lot of D expirience and I am writing a simpleparser to help work with these files. Ideally it should be fastwith low memory footprint. I am working with very large files ofthis type and can be up to 1GB.


module fastq;

import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;

struct Record{

    string sequence;
    string quals;
    string name;
}

auto Records(string filename){

    static auto toRecords(S)(S str){

        auto res = findSplitBefore(str,"+\n");

        auto seq = res[0];
        auto qual = res[1];

        return Record(seq,qual);
    }

    string text = cast(string)std.file.read(filename);

    enforce(text.length > 0 && text[0] == '@');
    text = text[1 .. $];

    auto entries = splitter(text,'@');

    return map!toRecords(entries);
}

The issue with this is that the "+" character can be part of thequality information and I am using it to split the qualityinformation from the sequence information. and ends up splittingthe quality information which is wrong.Ideally I do not want to use regex and I have heard of ragel forparsing but never used it. Such a solution would also be welcome,since I read it can be very fast.

Which is the idiomatic way to capture, sequence name (starts with@ character and the first entry) the sequence, (line2) thequality scores( line 4)

parsing fastq files with D

Reply via email to