On Jun 21, 11:15 am, "C. Titus Brown" <[email protected]> wrote:
> OK, I've added a parse_attributes option. It yields about a 35 point > performance gain (48 seconds rather than 75) for my 1m-row GMAP input > file. Hi All, The performance of GFF parsing interests me a great deal as it is a file type that users upload into the application that I am writing. But up till know I had no baseline to help me understand just what would be a fast parsing, and what could we expect from python. So I wrote a few lines of code to compare the performance. The job was to parse a 250K line GFF3 file, select the lines that contain PCR products then extract and sum up the attribute for amplification (each of the attribute columns had 6 attributes and I needed one of them). Now for both python and C to just simply read this file line by line takes around 200 msec. For the actual code there are three versions. One is written fully in C, the second with the pyhon CSV module but is not a generic parser, it is written for this problem specifically. And finally there is a version with the this generic GFF parser (see http://temp.atlas.bx.psu.edu/temp/). Now for the results: Result GCC=13180 in 420 msec Result CSV=13180 in 1281 msec Result GFF=13180 in 30450 msec If we take C the baseline, then the python with the CSV module is 3 times slower, whereas the GFF parser is 72 times slower. One of the lessons that I take away from this is something that has been brewing in the back of my mind for a while. Python is not that well suited for building generic tools when it comes to large scale data analysis. There is a subtantial overhead for a lot of the internal processes. Usually these are a small part of the overall runtime, but can get glaringly large when we say hit a file for each line etc. In these cases (sadly) we may be better off continuously reinventing the wheel. A little bit of a somber conclusion because I really like how the gff parser takes care of everything. cheers, Istvan --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
