Hello,
> ________________________________________ > De : Aarti Desai [aarti_de...@persistent.co.in] > Date d'envoi : 12 septembre 2011 08:55 > À : Sébastien Boisvert > Objet : Question on RAY > > Hi Sebastein, > My name is Aarti Desai and I am using Ray to do assembly of some e coli and > yeast illumina data. The problem is I have high level of duplication in the > dataset. It is 87% in the e coli data (avg depth is about 800) and abt 56% in > the yeast data (avg depth about 275X) as estimated using FASTQC. From what I > understand, FASTQC uses the first 50bp of a read to identify duplicates in > the dataset. > > I want to know how Ray handles duplicate reads in a dataset? Does it ignore > duplicate reads? Ray will count the k-mers in all the reads, build a distributed graph and find paths in it. The maximum k-mer coverage in Ray is 65535, so technically that won't be a problem. You just have to be sure that the reads are mostly randomly distributed across the genome. You can do some quality control with Ray too. Just run it on your data with: mpiexec -n 24 Ray -k 31 -p 1_1.fastq 1_2.fastq -p 2_1.fastq 2_2.fastq -o QualityControl Then, you check this file: QualityControl/CoverageDistributionAnalysis.txt k-mer length: 31 Lowest coverage observed: 2 MinimumCoverage: 33 PeakCoverage: 108 RepeatCoverage: 183 Number of k-mers with at least MinimumCoverage: 9097604 k-mers Estimated genome length: 4548802 nucleotides Percentage of vertices with coverage 2: 33.1731 % DistributionFile: coli/CoverageDistribution.txt If the estimated genome length is correct, then these duplications should not be a problem. This estimation is done on the distributed graph before attempting to assemble anything. If most of the genome positions have simply no reads because most of the reads come from a few highly represented regions, then this will be a problem with any assembler. What is exactly the 87% ? Sébastien http://github.com/sebhtml/ray > Thanks for your help. > Aarti > > > Dr. Aarti Desai | Domain Specialist – Life Sciences Domain > aarti_de...@persistent.co.in<mailto:aarti_de...@persistent.co.in> | Cell: > +91-9673009492 | Tel: +91-20-30236348 > Persistent Systems Ltd. | Partners in Innovation | > www.persistentsys.com<http://www.persistentsys.com/> > > > DISCLAIMER ========== This e-mail may contain privileged and confidential > information which is the property of Persistent Systems Ltd. It is intended > only for the use of the individual or entity to which it is addressed. If you > are not the intended recipient, you are not authorized to read, retain, copy, > print, distribute or use this message. If you have received this > communication in error, please notify the sender and delete all copies of > this message. Persistent Systems Ltd. does not accept any liability for virus > infected mails. > > ------------------------------------------------------------------------------ Doing More with Less: The Next Generation Virtual Desktop What are the key obstacles that have prevented many mid-market businesses from deploying virtual desktops? How do next-generation virtual desktops provide companies an easier-to-deploy, easier-to-manage and more affordable virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/ _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users