Hello,


> ________________________________________
> De : Aarti Desai [aarti_de...@persistent.co.in]
> Date d'envoi : 12 septembre 2011 08:55
> À : Sébastien Boisvert
> Objet : Question on RAY
> 
> Hi Sebastein,
> My name is Aarti Desai and I am using Ray to do assembly of some e coli and 
> yeast illumina data. The problem is I have high level of duplication in the 
> dataset. It is 87% in the e coli data (avg depth is about 800) and abt 56% in 
> the yeast data (avg depth about 275X) as estimated using FASTQC. From what I 
> understand, FASTQC uses the first 50bp of a read to identify duplicates in 
> the dataset.
> 
> I want to know how Ray handles duplicate reads in a dataset? Does it ignore 
> duplicate reads?

Ray will count the k-mers in all the reads, build a distributed graph and find 
paths in it.



The maximum k-mer coverage in Ray is 65535, so technically that won't be a 
problem.


You just have to be sure that the reads are mostly randomly distributed across 
the genome.


You can do some quality control with Ray too. Just run it on your data with:



mpiexec -n 24 Ray -k 31 -p 1_1.fastq 1_2.fastq -p 2_1.fastq 2_2.fastq -o 
QualityControl


Then, you check this file:


QualityControl/CoverageDistributionAnalysis.txt


k-mer length:   31
Lowest coverage observed:       2
MinimumCoverage:        33
PeakCoverage:   108
RepeatCoverage: 183
Number of k-mers with at least MinimumCoverage: 9097604 k-mers
Estimated genome length:        4548802 nucleotides
Percentage of vertices with coverage 2: 33.1731 %
DistributionFile: coli/CoverageDistribution.txt


If the estimated genome length is correct, then these duplications should not 
be a problem.

This estimation is done on the distributed graph before attempting to assemble 
anything.


If most of the genome positions have simply no reads because most of the reads 
come from a few highly represented regions, then this will be a problem with 
any assembler.


What is exactly the 87% ?



                              Sébastien

             http://github.com/sebhtml/ray



> Thanks for your help.
> Aarti
> 
> 
> Dr. Aarti Desai | Domain Specialist – Life Sciences Domain
> aarti_de...@persistent.co.in<mailto:aarti_de...@persistent.co.in> | Cell: 
> +91-9673009492 | Tel: +91-20-30236348
> Persistent Systems Ltd. | Partners in Innovation | 
> www.persistentsys.com<http://www.persistentsys.com/>
> 
> 
> DISCLAIMER ========== This e-mail may contain privileged and confidential 
> information which is the property of Persistent Systems Ltd. It is intended 
> only for the use of the individual or entity to which it is addressed. If you 
> are not the intended recipient, you are not authorized to read, retain, copy, 
> print, distribute or use this message. If you have received this 
> communication in error, please notify the sender and delete all copies of 
> this message. Persistent Systems Ltd. does not accept any liability for virus 
> infected mails.
> 
> 
------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop 
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to