Mark,
Thanks for the info. This is sort of a test project for us. We have a
few classes and data structures in C++ that handle operations like
sequence io and packing, and are fairly fast. However, we've also come
to the realization that we've spent a lot of time dealing with
cross-platform and compiler-related problems, and if Java can give us
comparable performance then we might switch to it. If nothing else, the
opportunity costs would be lower, since we could write and test more
code, in the same amount of time. The tools are good-deal better for
Java development than C++.
We're at the point where we can either continue to invest time in our
library or rewrite what we have using BioJava and other libraries. I've
written a lot of Java-code over the past 10 years and suggested that we
try Java both using the standard javac compiler and gcj to see if we can
get C++ like performance.
Thanks for your help,
Mark
[EMAIL PROTECTED] wrote:
There is probably not any performance benefit except in the case of very
large sequences which are often compressed behind the scenes by biojava.
The benefits may come from ease of use and object orientation.
eg, There is probably already a parser to read in an validate your
sequence, The windowing or nMer stuff is already figured out and has been
used by lots of people so it's been "stress tested". Also the objects
themselves have a lot of functionality built in that a character stream
does not. The downside of using objects is they take up memory and there
is a certain amount of overhead in there construction. To help overcome
this SymbolLists are actually lists of references to Symbols not lists of
Symbols themselves. This makes them much smaller (although not as small as
char[]'s).
If you want superfast performance then you should bit encode the data and
operate over it with memory pointers as in C or machine code. You should
be aware though that any intensive loop like the ones that would be used
to carry out this operation in biojava will almost certainly be detected
and compiled into native code by the Java Runtime on the fly. This might
make it hard to say if the java code would be much slower than the C code.
- Mark
Mark Fortner <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
12/16/2005 10:36 AM
Please respond to m.fortner
To: biojava-list <biojava-l@biojava.org>
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-l] Sequence Iteration in BioJava(x)
Richard,
Thanks for the example. Your approach is very similar to a non-BioJava
approach that I had worked out earlier. I was wondering if the
BioJava(x) API provides any performance benefit over simply running a
window along a character stream?
The work that we're doing involves iterating through the human genome,
(and in a number of cases, metagenomic sequences) and we're trying to
squeeze as much performance out of it as possible while minimizing the
memory footprint.
Thanks,
Mark
Richard HOLLAND wrote:
orderNSymbolList splits the sequence into non-overlapping chunks. What
is required here is chunks that are only one base different (further on)
than the previous chunk.
The simplest way would be this:
SymbolList mySeq; // this is your sequence from somewhere
else
for (int i = 1 ; i <= mySeq.length()-2; i++) {
SymbolList trimer = mySeq.subSeq(i,i+2);
// coords are
inclusive so i to i+2 = 3 bases
// do something with your trimer here
}
Note that the index starts at 1 and goes right up to and including
length(), as symbols in a SymbolList are 1-indexed, not 0-indexed.
cheers,
Richard
Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of David Huen
Sent: Friday, December 16, 2005 7:34 AM
To: [EMAIL PROTECTED]
Cc: biojava-list
Subject: Re: [Biojava-l] Sequence Iteration in BioJava(x)
On Dec 15 2005, Mark Fortner wrote:
I think what you want is the SymbolListViews.orderNSymbolList method.
It will take a SymbolList and turn it into another where it
is viewed in
another compound alphabet of the required order.
I'm looking for the best way to iterate through all
nmers within a given sequence. For example, given a
sequence that looks like this:
ACTGACTGACTG
If I extract all trimers from this I should get:
ACT
CTG
TGA
GAC
ACT
CTG
TGA
GAC
ACT
CTG
Is there an existing class that will allow me to
iterate through a sequence this way, or am I on my
own?
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________
Biojava-l mailing list - Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l