On 20/10/11 12:51 PM, Ola Wallerman wrote: > Hi Sebastien, > Hi Ola,
> I am try ing out Ray for assembly of a human genome. I must say I am > quite surprised by the results from my first try since it worked > straight away without any problems, with only one program to run, > which is not what one is used to in the NGS field... I installed v > 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished > without any errors after ~ 12h, with ~1 Gbp assembled. > > One thing our team is aiming for with Ray is ease of use for the user. (It just works TM) The complexity (like the various stages of the algorithm) is encapsulated in Ray. Just out of curiosity, what is the inter-node latency of your compute resource ? Ray tests the network before doing its deed so the latency is in the file NetworkTest.txt. I am just curious though. > I wonder if you could give me any advice on how to run it in the best > way, eg should one use as many nodes as possible (we have 384 nodes > with at least 24 GB) and should reads be quality filtered beforehand? For the assemblathon, we did not filter reads at all. With Ray, filtering reads only reduces memory usage, I believe. If you know you have DNA contamination in your reads (non-human for instance), then you should filter reads. Adaptors utilised for the construction of so-called mate-pairs through the circularisation of long DNA molecules may be present if you have mate pairs. So far, it seems that the optical read markers in Ray deal with that. > > The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts > ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I > tried now with k=27, but perhaps a higher k is better for the long > reads? The reason for doing the assembly is to use the contigs to get > a better precision in calling indels and rearangements. > > Did you provide all the reads ? You have to be careful with a k that is too large because Ray does not attempt at all to correct the reads. The erroneous k-mers all go in an abyss (not the assembler !) and are not really utilised at all. In my experience, k=21 works well for bacteria and for other larger genomes, I usually utilise k=25 or k=31 although I don't have that much experience on large genomes aside from the assemblathon. You can check some files generated by Ray for your first assembly. In your assembly directory, the file CoverageDistributionAnalysis.txt contains the peak coverage. For your paired reads, the file LibraryStatistics.txt contains what Ray detected in your reads. This step is very important as paired reads are the workhorse to go from reads to k-mer graph to seeds to extensions. The importance of pairs is also highlighted by the recent application note published by Illumina using the MiSeq and Ray. http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf There is also a file called SeedLengthDistribution.txt This file contains the distribution of seed lengths. In Ray, a seed is a region of the genome that is unique. I like to say that a seed is mostly similar conceptually to unitigs in overlap-layout-consensus assemblers although I suspect there are some differences. Increasing k increases the uniqueness of sub-sequences extracted from reads but also reduces the usable sub-sequence coverage because of the sequencing errors. As I said above, you can assess your sub-sequence coverage (also known as k-mer coverage) by reading the content of the file CoverageDistributionAnalysis.txt > Best regards, > > Let me know if you have any other questions. > Ola > > > > Sébastien > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Ola Wallerman, PhD > IGP, Uppsala Universitet > > waller...@gmail.com > olawallerman@skype > 0736400172 > > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users