On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
FG wrote:

On 2013-02-04 15:04, bioinfornatics wrote:
I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++
) need 2 min.

My code is maybe not easy as is not easy to parse a fastq file and is more
harder when using memory mapped file.

Why are you using mmap? Don't you just go through the file sequentially?
In that case it should be faster to read in chunks:

     foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

I would go even further, and organise the file so N Data objects fit one page, and read the file page by page. The page-size can easily be obtained from the
system. IMHO that would beat this fastxtoolkit. :)

AFAIK, he is reading text data that needs to be parsed line by line, so byChunk may not be the best approach. Or at least, not the easiest approach.

I'm just wondering if maybe the reason the D code is slow is not just because of:
- unicode.
- front + popFront.

ranges in D are "notorious" for being slow to iterate on text, due to the "double decode".

If you are *certain* that the file contains nothing but ASCII (which should be the case for fastq, right?), you can get more bang for your buck if you attempt to iterate over it as an array of bytes, and convert the bytes to char on the fly, bypassing any and all unicode processing.

Reply via email to