Re: [Denovoassembler-users] Ray Meta Memory Usage For Large Datasets

Boisvert, Sebastien Mon, 21 Jul 2014 00:12:08 -0700

> From: Lee Bergstrand [[email protected]]
> Sent: Monday, July 21, 2014 1:41 AM
> To: Boisvert, Sebastien
> Subject: Ray Meta Memory Usage For Large Datasets
> 
> 
> Hello Sébastien,
> 
> I was asked contact you by one of my lab-mates, Roli Wilhelm, a graduate 
> student at the University of British Columbia’s Life Sciences Institute. I 
> would like to ask some questions about one of your previous software 
> projects, the Ray assembler, with reference
> to its memory usage when used on large metagenomic datasets.
>


OK.


> Roli’s metagenomic dataset consists of the following:
> 
> A paired-end file (20Gb; 143,384,708 reads)
> An unpaired-end file (4.2Gb; 34,404,832 reads)
> All data is from soil samples.

Cool.

> 
> We are attempting to assemble this metagenome on a workstation with the 
> following specs:
> 
> 
> Intel(R) Xeon(R) CPU E5-2670 8 core (16 thread) clocked at 2.60GHz
> 128 Gigabytes of ECC DRAM.
> 128 Gigabyte of dedicated swap on an SSD.



> 
> 
> We have run into a problem with Ray's “excessive” memory usage. With our 
> dataset, the memory used by Ray instances continually increases in a 
> step-wise manor as the assembly proceeds, eventually consuming all of the ram 
> and swap in our workstation. We
> have run Ray with between 5 and 8 mpiexec instances (mpiexec -n 10 Ray ...).

The thing with soil samples is that they contain a lot of unique kmers.

Also, I don't think it is worth it to use the hyperthreads. You are probably 
better off using 8 MPI ranks ("-n 8") 

=> 
http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

Otherwise, you'll have some threads competing for the same L1 cache lines and 
this is called ("cache trashing").at
Also, I don't think that you'll get a good performance when Linux starts to 
swap pages in and out of a SSD.

A page fault is very expensive. http://en.wikipedia.org/wiki/Page_fault



> 
> I noticed that the memory usage stabilized at different stages of the 
> assembly, however, overtime the memory usage increased. Is memory usage in 
> excess of 200 Gb typical for the Ray assembler when operating on data sets in 
> excess of 20 Gbs?

It really depends on the nature of the data.

Let's say you have 200 000 000 sequences of length 100 nucleotides (20 Gb) and 
that you're using a kmer length of 43.

Then, if all those fancy sequences are unique (that could happen in soil 
samples if you are "under-sequencing"), then you get an upper bound
of 11600000000 canonical kmers (200000000 * (100 - 43 + 1)). That's 
498800000000 nucleotides (11600000000 * 43) or 
124700000000 bytes assuming 2 bits per nucleotides.

So you are already at 124 GB of RAM just for the kmers.

Obviously, a Bloom filter can filter some of those.

You should look into Compute Canada. Ray is targeted for super computers. Sure, 
it can run on 1 computer, but you'll get
better performance by having everything in distributed RAM without any of the 
page faults you get when using a swap mount point.


> 
> Thanks,
> 
> Lee
> 
> P.S. I will be checking out biosal.

Yeah. Some background on this (the biosal architecture is very exciting)

biosal is a research project at Argonne. The scope is very large, we aim to 
create a library for analyzing sequences at scale.
We use the actor model (the same computation model used by Erlang, you might 
have heard of Erlang, it is very popular on the
interwebs),

We started the code of biosal on May 22, 2014.

Right now, it has a distributed actor model engine called "Thorium", a bunch of 
general examples, a kmer counter called argonnite.
The main app will be "Spate", a metagenome assembler with integrated genome 
isolation.

--seb

> 
> 
> 
> Lee Bergstrand
> Undergraduate
> BSc. CMMB / BCS
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Ray Meta Memory Usage For Large Datasets

Reply via email to