On Jun 21, 11:15 am, "C. Titus Brown" <[email protected]> wrote:

> OK, I've added a parse_attributes option.  It yields about a 35 point
> performance gain (48 seconds rather than 75) for my 1m-row GMAP input
> file.

Hi All,

The performance of GFF parsing interests me a great deal as it is a
file type that users upload into the application that I am writing.
But up till know I had no baseline to help me understand just what
would be a fast parsing, and what could we expect from python.

So I wrote a few lines of code to compare the performance. The job was
to parse a 250K line GFF3 file, select the lines that contain PCR
products then extract and sum up the attribute for amplification (each
of the attribute columns had 6 attributes and I needed one of them).
Now for both python and C to just simply read this file line by line
takes around 200 msec.

For the actual code there are three versions. One is written fully in
C, the second with the pyhon CSV module but is not a generic parser,
it is written for this problem specifically. And finally there is a
version with the this generic GFF parser (see 
http://temp.atlas.bx.psu.edu/temp/).
Now for the results:

Result GCC=13180 in 420 msec
Result CSV=13180 in 1281 msec
Result GFF=13180 in 30450 msec

If we take C the baseline, then the python with the CSV module is 3
times slower, whereas the GFF parser is 72 times slower.

One of the lessons that I take away from this is something that has
been brewing in the back of my mind for a while. Python is not that
well suited for building generic tools when it comes to large scale
data analysis. There is a subtantial overhead for a lot of the
internal processes. Usually these are a small part of the overall
runtime, but can get glaringly large when we say hit a file for each
line etc. In these cases (sadly) we may be better off continuously
reinventing the wheel. A little bit of a somber conclusion because I
really like how the gff parser takes care of everything.

cheers,

Istvan


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to