Hi, I have run Ray with only single end reads now, and it appears that it still tries to estate insert sizes. Is there any way I can turn this off?
Ola *** Step: Estimation of outer distances for paired reads Date: Sat Oct 22 13:25:25 2011 Elapsed time: 5 hours, 34 minutes, 13 seconds Since beginning: 20 hours, 20 minutes, 17 seconds *** Citerar Sébastien Boisvert <sebastien.boisver...@ulaval.ca>: > On 21/10/11 12:39 PM, Ola Wallerman wrote: >> Hi again, >> >> just for your information: the strange insert size was indeed a fault >> on my side (I used the same file for both ends for that library...). > > ;) > >> I >> will try to run the whole dataset during the weekend. I also did a run >> on the ION ecoli dataset with a longer k-mer (51); it improved the N50 >> quite substantially, from 8,1 to 16.9 kb. Perhaps it is due to the >> high indel rate. >> >> > > Presumably it is. > >> The bacterial contamination is not unexpected, when making libraries >> with very low amount of input the risk of bacterial DNa from some of >> the reagents to be included increases. >> >> > > OK ! > > >> Ola >> >> >> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>: >> >> >>> On 20/10/11 05:20 PM, Ola Wallerman wrote: >>> >>>> Hi, >>>> >>>> thanks for the quick reply. The latency was between 245 - 275 ms. I am >>>> running a tes with ION torrent data now (from the new 318 ChIP) and >>>> latencies are now ~ 133 ms. What is expected / good numbers? >>>> >>>> >>>> >>> I guess you mean microseconds, not milliseconds. >>> The latency depends on your interconnect technology. >>> >>> >>>> For the first test run I just used a subset of reads that was at hand >>>> in the right format. Also, I wasn't sure how many nodes would be >>>> needed to handle all reads. I actually got an error at first - I used >>>> Illumina _sequence.txt files an had to change filenames to *.fastq. I >>>> suppose the quality values will be off, but I am not sure if it is >>>> used in Ray? >>>> >>>> >>>> >>> Ray does not utilise the qualities. >>> >>> >>>> One reason to do de novo was to find contaminations, and apparently we >>>> have a bacteraial contaminant in some of the libraries (ChIP-seq >>>> libraries with low input). >>>> >>>> >>>> >>> Is this because your data was bar-coded and multiplexed with some >>> other experiments as well ? >>> >>> >>>> There were two things that apparently did not work: insert size was >>>> way off for one of the libraries (103 bp, sd 8, it should be ~ 280 >>>> with sd 40). >>>> >>> This may indicates some problem with your reads. >>> >>> >>>> This was the major library (93 M pairs), the other insert >>>> sizes are ok. Next time I will set it manually. The results for >>>> contigs and scaffolds were exactly the same, we dont have any >>>> mate-pairs but I would think the PE reads would help some with >>>> scaffolding? >>>> >>>> >>>> >>> Mostly to go through small repeats although these are utilised for >>> scaffolding too. >>> >>> >>>> The ION assembly already finished, if you are interrested these are >>>> the stats from 5,5 M reads on 4 nodes: >>>> >>>> Network testing: 5 seconds >>>> File partitioning: 23 seconds >>>> Sequence loading: 2 minutes, 8 seconds >>>> K-mer counting: 4 minutes, 50 seconds >>>> Coverage distribution analysis: 1 seconds >>>> Graph construction: 13 minutes, 17 seconds >>>> Edge purge: 1 minutes, 31 seconds >>>> Selection of optimal read markers: 5 minutes, 39 seconds >>>> Detection of assembly seeds: 1 minutes, 56 seconds >>>> Estimation of outer distances for paired reads: 11 seconds >>>> Bidirectional extension of seeds: 4 minutes, 33 seconds >>>> Merging of redundant contigs: 4 minutes, 39 seconds >>>> Generation of contigs: 0 seconds >>>> Scaffolding of contigs: 1 minutes, 28 seconds >>>> Total: 40 minutes, 42 seconds >>>> >>>> Contigs>= 100 nt >>>> Number: 1348 >>>> Total length: 4451596 >>>> Average: 3302 >>>> N50: 8156 >>>> Median: 1194 >>>> Largest: 40215 >>>> Contigs>= 500 nt >>>> Number: 852 >>>> Total length: 4337720 >>>> Average: 5091 >>>> N50: 8305 >>>> Median: 3586 >>>> Largest: 40215 >>>> >>>> Cheers, >>>> >>>> >>>> >>> Nice. For 454, Ion Torrent and PacBio, I need to add something to >>> handle the insertions and deletions. >>> >>> >>> >>>> Ola >>>> >>>> >>>> >>>> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>: >>>> >>>> >>>> >>>>> On 20/10/11 12:51 PM, Ola Wallerman wrote: >>>>> >>>>> >>>>>> Hi Sebastien, >>>>>> >>>>>> >>>>>> >>>>> Hi Ola, >>>>> >>>>> >>>>> >>>>>> I am try ing out Ray for assembly of a human genome. I must say I am >>>>>> quite surprised by the results from my first try since it worked >>>>>> straight away without any problems, with only one program to run, >>>>>> which is not what one is used to in the NGS field... I installed v >>>>>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished >>>>>> without any errors after ~ 12h, with ~1 Gbp assembled. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> One thing our team is aiming for with Ray is ease of use for the >>>>> user. (It just works TM) >>>>> The complexity (like the various stages of the algorithm) is >>>>> encapsulated in Ray. >>>>> >>>>> Just out of curiosity, what is the inter-node latency of your >>>>> compute resource ? >>>>> >>>>> Ray tests the network before doing its deed so the latency is in the file >>>>> NetworkTest.txt. I am just curious though. >>>>> >>>>> >>>>> >>>>>> I wonder if you could give me any advice on how to run it in the best >>>>>> way, eg should one use as many nodes as possible (we have 384 nodes >>>>>> with at least 24 GB) and should reads be quality filtered beforehand? >>>>>> >>>>>> >>>>> For the assemblathon, we did not filter reads at all. >>>>> With Ray, filtering reads only reduces memory usage, I believe. >>>>> >>>>> If you know you have DNA contamination in your reads (non-human for >>>>> instance), >>>>> then you should filter reads. >>>>> >>>>> Adaptors utilised for the construction of so-called mate-pairs >>>>> through the circularisation of long DNA molecules may be present if you >>>>> have mate pairs. So far, it seems that the optical read markers in >>>>> Ray deal with >>>>> that. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts >>>>>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I >>>>>> tried now with k=27, but perhaps a higher k is better for the long >>>>>> reads? The reason for doing the assembly is to use the contigs to get >>>>>> a better precision in calling indels and rearangements. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Did you provide all the reads ? >>>>> >>>>> You have to be careful with a k that is too large because Ray does not >>>>> attempt at all to correct the reads. The erroneous k-mers all go in >>>>> an abyss (not the assembler !) >>>>> and are not really utilised at all. >>>>> >>>>> In my experience, k=21 works well for bacteria and for other >>>>> larger genomes, >>>>> I usually utilise k=25 or k=31 although I don't have that much >>>>> experience on large genomes >>>>> aside from the assemblathon. >>>>> >>>>> >>>>> You can check some files generated by Ray for your first assembly. >>>>> >>>>> In your assembly directory, the file CoverageDistributionAnalysis.txt >>>>> contains the peak coverage. >>>>> >>>>> For your paired reads, the file LibraryStatistics.txt contains what >>>>> Ray detected >>>>> in your reads. >>>>> >>>>> This step is very important as paired reads are the workhorse to go >>>>> from reads >>>>> to k-mer graph to seeds to extensions. >>>>> >>>>> The importance of pairs is also highlighted by the recent >>>>> application note >>>>> published by Illumina using the MiSeq and Ray. >>>>> >>>>> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf >>>>> >>>>> There is also a file called SeedLengthDistribution.txt >>>>> This file contains the distribution of seed lengths. In Ray, a seed >>>>> is a region of the genome >>>>> that is unique. I like to say that a seed is mostly similar >>>>> conceptually to unitigs in >>>>> overlap-layout-consensus assemblers although I suspect there are >>>>> some differences. >>>>> >>>>> >>>>> Increasing k increases the uniqueness of sub-sequences extracted >>>>> from reads but also reduces the usable sub-sequence coverage because >>>>> of the sequencing errors. As I said above, you can assess your >>>>> sub-sequence >>>>> coverage (also known as k-mer coverage) by reading the content >>>>> of the file >>>>> CoverageDistributionAnalysis.txt >>>>> >>>>> >>>>> >>>>>> Best regards, >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Let me know if you have any other questions. >>>>> >>>>> >>>>> >>>>>> Ola >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Sébastien >>>>> >>>>> >>>>> >>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>> Ola Wallerman, PhD >>>>>> IGP, Uppsala Universitet >>>>>> >>>>>> waller...@gmail.com >>>>>> olawallerman@skype >>>>>> 0736400172 >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> Ola Wallerman, PhD >>>> IGP, Uppsala Universitet >>>> >>>> waller...@gmail.com >>>> olawallerman@skype >>>> 0736400172 >>>> >>>> >>>> >>> >>> >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> Ola Wallerman, PhD >> IGP, Uppsala Universitet >> >> waller...@gmail.com >> olawallerman@skype >> 0736400172 >> >> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ola Wallerman, PhD IGP, Uppsala Universitet waller...@gmail.com olawallerman@skype 0736400172 ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users