Re: [Biojava-l] Large RichSequence collection

Khalil El Mazouari Thu, 01 Aug 2013 03:40:20 -0700

Hi,

thanks for you proposal ;)

I have no problem in reading the sequences from fasta file. RichSequence 
iterator is doing the job very well. 
I am processing the input sequences one by one. Each RichSequence is annotated 
and added into a specific group (ArrayList) based on the annotation results. 
All annotated sequence are kept in memory and re-processed later ... which 
prevent GC from cleaning the heap.
I can serialize the processed sequences, but IO also have performance issues .

I can inspect the heap with eclipse memory analyzer. SimpleRichSequence object 
consume a lot of memory.

Best

khalil

-----

Confidentiality Notice: This e-mail and any files transmitted with it are 
private and confidential and are solely for the use of the addressee. It may 
contain material which is legally privileged. If you are not the addressee or 
the person responsible for delivering to the addressee, please notify that you 
have received this e-mail in error and that any use of it is strictly 
prohibited. It would be helpful if you could notify the author by replying to 
it.

On 01 Aug 2013, at 07:58, Amr AL-HOSSARY wrote:

> If your problem is in parsing/loading all the sequences in memory first,
> before managing them, I had created a method public LinkedHashMap<String,S>
> process(int max) in Class FastaReader in BioJava 3.0.6. It reads a maximum
> (max) sequences to parse, then read next sequenes in a subsequent call.
> You can use it. If you need a similar one in Biojava 1, I can make it for
> you.
> 
> Otherwise, you will need to modify your algorithm to deal with smaller
> clusters, based on the task you are doing.
> 
> Amr
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Khalil El
> Mazouari
> Sent: Thursday, August 01, 2013 1:17 AM
> To: [email protected]
> Subject: [Biojava-l] Large RichSequence collection
> 
> Hi,
> 
> I have to process large dataset of DNA sequence(>= 120.000 seq). Sequences
> are first annotated, clustered ... I end up with huge collection of
> SimpleRichSequence objects consuming a lot of RAM.
> 
> Any suggestion on how to deal effectively with large collection of
> RichSequence objects is welcome.
> 
> Thanks in advance.
> 
> khalil
> 
> 
> 
> 
> 
> 
> -----
> 
> Confidentiality Notice: This e-mail and any files transmitted with it are
> private and confidential and are solely for the use of the addressee. It may
> contain material which is legally privileged. If you are not the addressee
> or the person responsible for delivering to the addressee, please notify
> that you have received this e-mail in error and that any use of it is
> strictly prohibited. It would be helpful if you could notify the author by
> replying to it.
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://lists.open-bio.org/mailman/listinfo/biojava-l

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Large RichSequence collection

Reply via email to