OK, I have a better idea. We can simply restrict this reindexing behavior to the specific operation of looking up IDs during a BLAST search. We only implemented this behavior to deal with BLAST's buggy mangling of sequence IDs, so there's no need to apply it in other situations. If it isn't be applied at any other time, looking up an ID that isn't in the database will simply fail (KeyError), with no delay.
I renamed the reindexing class from BlastDB to BlastIDIndex. It is now only used for looking up IDs while processing BLAST results in process_blast(). I renamed BlastDBbase to be the new BlastDB. Reindexing will never happen in normal usage; only when actually processing BLAST results. This resolves Issue 49. Questions: - should we do the initial reindexing at the same time as the formatdb step? This might reduce user annoyance, since users expect formatdb to take some time to reindex the database. - Should we print out a warning message explaining that we're reindexing the BLAST database? This might also reduce user annoyance / confusion, by clearing up the mystery of "why is Pygr so slow?". - Should we allow the user to turn off reindexing (which means that BLAST will not work on NCBI databases with "mangled blob" IDs)? - Can we auto-detect whether reindexing is needed (i.e. detect whether the sequence IDs are blobs that blastall will mangle?). Then we could dispense with it completely on non-NCBI databases (or more specifically, databases whose IDs blastall won't mangle). --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
