> From: Lee Bergstrand [lee.h.bergstr...@gmail.com] > Sent: Monday, July 21, 2014 1:41 AM > To: Boisvert, Sebastien > Subject: Ray Meta Memory Usage For Large Datasets > > > Hello Sébastien, > > I was asked contact you by one of my lab-mates, Roli Wilhelm, a graduate > student at the University of British Columbia’s Life Sciences Institute. I > would like to ask some questions about one of your previous software > projects, the Ray assembler, with reference > to its memory usage when used on large metagenomic datasets. >
OK. > Roli’s metagenomic dataset consists of the following: > > A paired-end file (20Gb; 143,384,708 reads) > An unpaired-end file (4.2Gb; 34,404,832 reads) > All data is from soil samples. Cool. > > We are attempting to assemble this metagenome on a workstation with the > following specs: > > > Intel(R) Xeon(R) CPU E5-2670 8 core (16 thread) clocked at 2.60GHz > 128 Gigabytes of ECC DRAM. > 128 Gigabyte of dedicated swap on an SSD. > > > We have run into a problem with Ray's “excessive” memory usage. With our > dataset, the memory used by Ray instances continually increases in a > step-wise manor as the assembly proceeds, eventually consuming all of the ram > and swap in our workstation. We > have run Ray with between 5 and 8 mpiexec instances (mpiexec -n 10 Ray ...). The thing with soil samples is that they contain a lot of unique kmers. Also, I don't think it is worth it to use the hyperthreads. You are probably better off using 8 MPI ranks ("-n 8") => http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI Otherwise, you'll have some threads competing for the same L1 cache lines and this is called ("cache trashing").at Also, I don't think that you'll get a good performance when Linux starts to swap pages in and out of a SSD. A page fault is very expensive. http://en.wikipedia.org/wiki/Page_fault > > I noticed that the memory usage stabilized at different stages of the > assembly, however, overtime the memory usage increased. Is memory usage in > excess of 200 Gb typical for the Ray assembler when operating on data sets in > excess of 20 Gbs? It really depends on the nature of the data. Let's say you have 200 000 000 sequences of length 100 nucleotides (20 Gb) and that you're using a kmer length of 43. Then, if all those fancy sequences are unique (that could happen in soil samples if you are "under-sequencing"), then you get an upper bound of 11600000000 canonical kmers (200000000 * (100 - 43 + 1)). That's 498800000000 nucleotides (11600000000 * 43) or 124700000000 bytes assuming 2 bits per nucleotides. So you are already at 124 GB of RAM just for the kmers. Obviously, a Bloom filter can filter some of those. You should look into Compute Canada. Ray is targeted for super computers. Sure, it can run on 1 computer, but you'll get better performance by having everything in distributed RAM without any of the page faults you get when using a swap mount point. > > Thanks, > > Lee > > P.S. I will be checking out biosal. Yeah. Some background on this (the biosal architecture is very exciting) biosal is a research project at Argonne. The scope is very large, we aim to create a library for analyzing sequences at scale. We use the actor model (the same computation model used by Erlang, you might have heard of Erlang, it is very popular on the interwebs), We started the code of biosal on May 22, 2014. Right now, it has a distributed actor model engine called "Thorium", a bunch of general examples, a kmer counter called argonnite. The main app will be "Spate", a metagenome assembler with integrated genome isolation. --seb > > > > Lee Bergstrand > Undergraduate > BSc. CMMB / BCS > > > > > > > > > > ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users