Re: [Biojava-l] Different implementation of Sequence?

Simon Foote Thu, 05 Jun 2003 22:08:30 -0700

Just to add my 2 cents worth.

I'm using the latest version of the BioSQL schema within MySQL and the filters are quite fast. On a database containing 18 complete bacterial genomes, fetching a given gene by name which uses a combination of 5 filters in my case, takes approx. 1-2 seconds.

Alas, the current version of biojava doesn't support the latest schema, but I have modified all of the BioSQL classes to handle most of the new schema and it does add, remove and filter sequences correctly according to all my tests so far. And now that I have cvs access (Thanks Thomas, it works), I will be checking in these updates hopefully within the next day or 2.

If you want them sooner, I can email them to you directly. Let me know.

Cheers,
Simon Foote

--
Bioinformatics Specialist
Institute for Biological Sciences
National Research Council of Canada
[T] 613-990-0561  [F] 613-952-9092
[EMAIL PROTECTED]

Thomas Down wrote:

Once upon a time, Y D Sun wrote:

Hi,

It seems that the implmentation for a sequence that is read from a plain
text file (e.g. Embl file) or from a BioSQL database is different.

I apply a feature filter to a sequence seq like:

           //make a Filter for "CDS" types
           FeatureFilter ff = new FeatureFilter.ByType("CDS");

           //get the filtered Features
           FeatureHolder fh = seq.filter(ff);

The feature filtering takes longer time for a sequence from database. In
my experiment, for example,

If seq is read from an Embl file, the time cost of seq.filter(ff) is 54
ms;
If the same seq is read from a BioSQL database, the time is 51518 ms (as
high as 1000 times).

The latter also requires more memory space in execution.

Could anybody give some justification for this phenomenon?


If you load a sequence from a file, it's all loaded into memory.
The filtering process is a simple in-memory operation.  When
a sequence is fetched from BioSQL, it's just a lazy reference
to the database.  The features are only being fetched when you
perform the filter operation.  This will be slower.  I'm
surprised it uses more memory, though -- certainly when you're
working with large numbers of sequences, BioSQL should be more
efficient.

That said, the time you quote is very, very, slow.  Where
did you get the BioSQL schema from?  Some versions are circulating
which seem to be missing some critical "CREATE INDEX" statements,
which makes feature-filtering substantially slower than it should
be...

Thomas.

_______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l

_______________________________________________
Biojava-l mailing list  -  [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Different implementation of Sequence?

Reply via email to