Good suggestion ;) However, I am not familiar with Groovy. I'll look for something similar in Java.
Regards, khalil On 17 Jun 2011, at 12:36, Martin Jones wrote: > Yes, this approach won't be much use if you are interested in the > contents of every genbank record. > > Have you thought about parsing the gb files in parallel? In my > experience, parsing genbank files scales quite nicely when done in > multiple threads. I have used the GPars library for this type of job > and it is very nice to use: > > http://gpars.codehaus.org/Parallelizer > > > M > > > > On 17 June 2011 11:33, Khalil El Mazouari <[email protected]> wrote: >> Thanks Martin, >> >> I already tried the regex. The performance increase was < 10%. >> >> My situation is different in 2 points: >> 1. info to extract from genbank file is always present. >> 2. there is multiple feature to extract from each record. >> >> I agree with you. Extracting a single field from a genbank file, is done >> munch faster with simple regex than with FeatureFilter. >> >> Regards, >> >> khalil >> >> On 17 Jun 2011, at 12:12, Martin Jones wrote: >> >>> Hi, >>> >>> I have had the same issue when parsing large sets of genbank files. In >>> my case, the workaround was to first treat the whole genbank record as >>> a string, and do a quick regex match to check if it contained >>> something of interest (in my case I was searching for specific >>> taxids): >>> >>> // first do a quick pattern-match to extract the taxid so we can >>> exit early without the overhead of parsing the whole file >>> private final Pattern taxidPattern = >>> Pattern.compile("db_xref=\\\"taxon:(\\d+)"); >>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord); >>> if (taxidMatcher.find()) { >>> def taxid = taxidMatcher[0][1].toInteger() >>> if (!taxidList.contains(taxid)) { >>> return >>> } >>> // here do the slow part of actually parsing all the features >>> >>> >>> This is in Groovy so there are a few syntactical differences. If you >>> are only interested in a subset of the GenBank records, then this >>> approach might be of use. >>> >>> M >>> >>> >>> >>> >>> On 17 June 2011 10:16, Khalil El Mazouari <[email protected]> >>> wrote: >>>> Hi, >>>> >>>> I am developing an app where features are extracted from a large genbank >>>> file, and processed: multiple alignment, annotation.... >>>> >>>> The feature extraction is a real bottleneck in my app. It consumes 87% of >>>> total execution time. >>>> >>>> Feature extraction is done via: >>>> >>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value); >>>> FeatureHolder fh = richSequence.filter(ff); >>>> Feature feat = fh.features().next(); >>>> ... >>>> >>>> Any suggestion on how to improve the performance of features extraction is >>>> welcome. >>>> >>>> Thanks, >>>> >>>> khalil >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> >> >> >> _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
