Richard (>):
> Ah OK I see what's going on.
>
> The convenience method you're using, RichSequence.IOTools.readStream(), uses
> FastaFormat to try and guess the alphabet to use based on the first line of
> the input sequence.
>
> In FastaFormat, it does this by searching for matching non-DNA symbols. The
> search is case-sensitive:
>
> protected static final Pattern aminoAcids =
> Pattern.compile(".*[FLIPQE].*");
>
> FastaFormat needs patching to make this pattern non-case-sensitive.Patch attached. I also took the opportunity to remove the occurrences of .* in the Pattern above. Generally, once should be using Matcher.find() when one is interested in matching a part of a string. This is more efficient than using Matcher.matches() and surrounding the desired regular expression with .*, since the latter will cause a lot of unnecessary backtracking and make the search quadratic. This effect only shows up for very long strings, but long strings can and do happen in bioinformatics. The below measurements show the quadratic behaviour of the former approach. $ for length in 100 1000 10000 100000 1000000; do (time java WithDotStar $length) 2>&1 | grep real; done real 0m0.371s real 0m0.367s real 0m0.577s real 0m2.735s real 0m25.275s $ for length in 100 1000 10000 100000 1000000; do (time java WithoutDotStar $length) 2>&1 | grep real; done real 0m0.309s real 0m0.361s real 0m0.468s real 0m1.184s real 0m9.703s Kindly, // Carl
aminoAcids.patch
Description: Binary data
WithDotStar.java
Description: Binary data
WithoutDotStar.java
Description: Binary data
_______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
