On Tue, Jun 16, 2009 at 07:54:44PM -0700, Istvan Albert wrote: -> On Jun 16, 10:22?am, "C. Titus Brown" <[email protected]> wrote: -> -> > Questions & comments welcome! ?Watch the github space for updates and -> > bugfixes. -> -> One possible issue with this approach is that it always unpacks all -> fields, even if one has no interest in using them. Especially the -> attribute columns are less frequently used but have a strong effect on -> performance. -> -> This can lead to somewhat sluggish performance - most data sources -> distribute GFF files that happen to store a lot of attributes - but -> all the user is interested is separating by strand or operating on -> intervals (at least this is very common in the type of analyses that I -> run). The parser will be substantially slower (possibly one or two -> orders of magnitude) than just splitting manually. A quick test (6 -> attributes, 100K lines) finishes in 12 seconds vs 1 second a -> csv.DictReader or 0.5 seconds for a csv.reader. As long as the GFF -> files are short this is not really a problem, but for larger files it -> will be noticeable.
OK, I've added a parse_attributes option. It yields about a 35 point performance gain (48 seconds rather than 75) for my 1m-row GMAP input file. cheers, --titus -- C. Titus Brown, [email protected] --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
