Re: parsing fastq files with D

2016-03-29 Thread rikki cattermole via Digitalmars-d-learn

On 30/03/2016 1:46 AM, eastanon wrote:

On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:

void popFront() {
import std.string : indexOf;

if (source is null) {
isEmpty = true;
return;
}

void tidyInput() {
foreach(i, c; source) {
switch(c) {
case 0: .. case ' ':
break;
default:
source = source[i .. $];
return;
}
}

source = null;
}

tidyInput();


Do you mind to explain  what is really going on at popFront() and
tidyInput() functions?


popFront is a special function used with input ranges.
It has support in the language for e.g. foreach loops.
You also need empty and front to go along with it however.
With front returning a value and empty if there is any more items.

The if statement in popFront is just to make empty set correctly (delayed).

Now tidyInput basically just trims (on the left) the source and sets it 
to null if there is no more of it.

That way things like "   @Myid" will still work fine.

If you need an explanation about the rest of popFront, let me know!


Re: parsing fastq files with D

2016-03-29 Thread eastanon via Digitalmars-d-learn
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
wrote:

void popFront() {
import std.string : indexOf;

if (source is null) {
isEmpty = true;
return;
}

void tidyInput() {
foreach(i, c; source) {
switch(c) {
case 0: .. case ' ':
break;
default:
source = 
source[i .. $];
return;
}
}

source = null;
}

tidyInput();


Do you mind to explain  what is really going on at popFront() and 
tidyInput() functions?


Re: parsing fastq files with D

2016-03-24 Thread eastanon via Digitalmars-d-learn

On Thursday, 24 March 2016 at 13:38:32 UTC, Marc Schütz wrote:
Yes, it's read into your processes memory. You can use 
std.mmfile [1] to make things a bit more efficient. It will, 
too, read the data into memory, but it will do so in a way 
(memory mapping) that only loads what is actually accessed 
(everything in your case), and that allows the operating system 
to efficiently release and reload parts of it if memory runs 
low.


Unfortunately there is no example in the documentation, but it 
works like this (untested):


import std.mmfile;
auto file = new MmFile(inputfile);
string text = cast(string) file[];
...

[1] http://dlang.org/phobos/std_mmfile.html


That is very clever. Thank you for the tip and I have implemented 
it and it works.  I feel like it is a safer way of reading a file.


Re: parsing fastq files with D

2016-03-24 Thread Marc Schütz via Digitalmars-d-learn

On Thursday, 24 March 2016 at 08:24:15 UTC, eastanon wrote:
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
wrote:

As a little fun thing to do I implemented it for you.

It won't allocate. Making this perfect for you.
With a bit of work you could make Result have buffers for 
result instead of using the input array allow for the source 
to be an input range itself.


I made this up on dpaste and single quotes were not playing 
nicely there. So you'll see "\r"[0] as a workaround.


Thank you very much. I think you have exposed me to  a number 
of new concepts that I will go through and annotate the code 
with.  I read all input from file as follows.


string text = cast(string)std.file.read(inputfile);
foreach(record;FastQRecord.parse(text)){
   writeln(record);
}

Does this mean that text is allocated to 
memory? and is there a better way to read and process the 
inputfile? 


Yes, it's read into your processes memory. You can use std.mmfile 
[1] to make things a bit more efficient. It will, too, read the 
data into memory, but it will do so in a way (memory mapping) 
that only loads what is actually accessed (everything in your 
case), and that allows the operating system to efficiently 
release and reload parts of it if memory runs low.


Unfortunately there is no example in the documentation, but it 
works like this (untested):


import std.mmfile;
auto file = new MmFile(inputfile);
string text = cast(string) file[];
...

[1] http://dlang.org/phobos/std_mmfile.html


Re: parsing fastq files with D

2016-03-24 Thread rikki cattermole via Digitalmars-d-learn

On 24/03/16 9:24 PM, eastanon wrote:

On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:

As a little fun thing to do I implemented it for you.

It won't allocate. Making this perfect for you.
With a bit of work you could make Result have buffers for result
instead of using the input array allow for the source to be an input
range itself.

I made this up on dpaste and single quotes were not playing nicely
there. So you'll see "\r"[0] as a workaround.


Thank you very much. I think you have exposed me to  a number of new
concepts that I will go through and annotate the code with. I read all
input from file as follows.

string text = cast(string)std.file.read(inputfile);
foreach(record;FastQRecord.parse(text)){
writeln(record);
}

Does this mean that text is allocated to memory? and is
there a better way to read and process the inputfile? 


Yup, any string usage gets allocated in memory. Since the way I designed 
parse there, it won't allocate during returning of a record. Just don't 
go around modifying that memory or keeping copies (not critical but I 
wouldn't).


There are better ways to read the input file. For example byChunk would 
read it in smaller groups of say 4096 chars. But you would need to 
overload that with support for e.g. lines parsing.


The way I'd go is memory mapped files as an input range.
After all, let the OS help you out with deciding when to put the file 
and parts of it into memory.


After all, if you have 1gb files you want to parse. Could you just as 
easily have 2tb worth? Probably since it is DNA after all.

You kinda don't want that all in memory at once if you know what I mean.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: parsing fastq files with D

2016-03-24 Thread eastanon via Digitalmars-d-learn
On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole 
wrote:

As a little fun thing to do I implemented it for you.

It won't allocate. Making this perfect for you.
With a bit of work you could make Result have buffers for 
result instead of using the input array allow for the source to 
be an input range itself.


I made this up on dpaste and single quotes were not playing 
nicely there. So you'll see "\r"[0] as a workaround.


Thank you very much. I think you have exposed me to  a number of 
new concepts that I will go through and annotate the code with.  
I read all input from file as follows.


string text = cast(string)std.file.read(inputfile);
foreach(record;FastQRecord.parse(text)){
   writeln(record);
}

Does this mean that text is allocated to memory? 
and is there a better way to read and process the inputfile? 







Re: parsing fastq files with D

2016-03-24 Thread rikki cattermole via Digitalmars-d-learn

As a little fun thing to do I implemented it for you.

It won't allocate. Making this perfect for you.
With a bit of work you could make Result have buffers for result instead 
of using the input array allow for the source to be an input range itself.


I made this up on dpaste and single quotes were not playing nicely 
there. So you'll see "\r"[0] as a workaround.


struct FastQRecord {
const(char)[] sequenceId;
const(char)[] sequenceLetters;
const(char)[] quality;

static auto parse(const(char)[] from) {
struct Result {
private {
const(char)[] source;
FastQRecord value;
bool isEmpty;
}

this(const(char)[] source) {
this.source = source;
popFront;
}

@property {
FastQRecord front() {
return value;
}

bool empty() {
return isEmpty;
}
}

void popFront() {
import std.string : indexOf;

if (source is null) {
isEmpty = true;
return;
}

void tidyInput() {
foreach(i, c; source) {
switch(c) {
case 0: .. case ' ':
break;
default:
source = 
source[i .. $];
return;
}
}

source = null;
}

tidyInput();

if (source is null)
return;

// sequenceId

assert(source[0] == '@');

ptrdiff_t len = source.indexOf("\n");
assert(len > 0);

value.sequenceId = source[1 .. len];
if (value.sequenceId[$-1] == "\r"[0])
value.sequenceId = value.sequenceId[0 
.. $-1];

source = source[len + 1 .. $];

// sequenceLetters

len = source.indexOf("\n");
assert(len > 0);

value.sequenceLetters = source[0 .. len];
if (value.sequenceLetters[$-1] == "\r"[0])
value.sequenceLetters = 
value.sequenceLetters[0 .. $-1];

source = source[len + 1 .. $];

// +sequenceId

len = source.indexOf("\n");
assert(len > 0);
source = source[len + 1 .. $];

// quality

len = source.indexOf("\n");
assert(len > 0);

value.quality = source[0 .. len];
if (value.quality[$-1] == "\r"[0])
value.quality = value.quality[0 .. $-1];

if (source.length > len + 1) {
source = source[len + 1 .. $];
  

parsing fastq files with D

2016-03-23 Thread eastanon via Digitalmars-d-learn
Fastq is a format for storing DNA sequences together with the 
associated quality information often encoded in ascii characters. 
It is typically made of 4 lines  for example 2 fastq entries 
would look like this.


@seq1
TTAAAT
+
?+BBB/DHH@
@seq2
GACCCTTTGCA
+
?+BHB/DIH@

I do not have a lot of D expirience and I am writing a simple 
parser to help work with  these files. Ideally it should be fast 
with low memory footprint. I am working with very large files of 
this type and can be  up to 1GB.


module fastq;

import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;

struct Record{

string sequence;
string quals;
string name;
}

auto Records(string filename){

static auto toRecords(S)(S str){

auto res = findSplitBefore(str,"+\n");

auto seq = res[0];
auto qual = res[1];

return Record(seq,qual);
}

string text = cast(string)std.file.read(filename);

enforce(text.length > 0 && text[0] == '@');
text = text[1 .. $];

auto entries = splitter(text,'@');

return map!toRecords(entries);
}

The issue with this is that the "+" character can be part of the 
quality information and I am using it to split the quality 
information from the sequence information. and ends up splitting 
the quality information which is wrong.
Ideally I do not want to use regex and I have heard of ragel for 
parsing but never used it. Such a solution would also be welcome, 
since I read it can be very fast.


Which is the idiomatic way to capture, sequence name (starts with 
@ character and the first entry) the sequence, (line2) the 
quality scores( line 4)