Re: [Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Carl Mäsak Mon, 23 Nov 2009 11:02:56 -0800

Richard (>):
> Ah OK I see what's going on.
>
> The convenience method you're using, RichSequence.IOTools.readStream(), uses
> FastaFormat to try and guess the alphabet to use based on the first line of
> the input sequence.
>
> In FastaFormat, it does this by searching for matching non-DNA symbols. The
> search is case-sensitive:
>
>        protected static final Pattern aminoAcids =
> Pattern.compile(".*[FLIPQE].*");
>
> FastaFormat needs patching to make this pattern non-case-sensitive.


Patch attached.

I also took the opportunity to remove the occurrences of .* in the
Pattern above. Generally, once should be using Matcher.find() when one
is interested in matching a part of a string. This is more efficient
than using Matcher.matches() and surrounding the desired regular
expression with .*, since the latter will cause a lot of unnecessary
backtracking and make the search quadratic.

This effect only shows up for very long strings, but long strings can
and do happen in bioinformatics. The below measurements show the
quadratic behaviour of the former approach.

$ for length in 100 1000 10000 100000 1000000; do (time java
WithDotStar $length) 2>&1 | grep real; done
real    0m0.371s
real    0m0.367s
real    0m0.577s
real    0m2.735s
real    0m25.275s

$ for length in 100 1000 10000 100000 1000000; do (time java
WithoutDotStar $length) 2>&1 | grep real; done
real    0m0.309s
real    0m0.361s
real    0m0.468s
real    0m1.184s
real    0m9.703s

Kindly,
// Carl

aminoAcids.patch
Description: Binary data

WithDotStar.java
Description: Binary data

WithoutDotStar.java
Description: Binary data

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Reply via email to