On 21/10/11 12:39 PM, Ola Wallerman wrote: > Hi again, > > just for your information: the strange insert size was indeed a fault > on my side (I used the same file for both ends for that library...).
;) > I > will try to run the whole dataset during the weekend. I also did a run > on the ION ecoli dataset with a longer k-mer (51); it improved the N50 > quite substantially, from 8,1 to 16.9 kb. Perhaps it is due to the > high indel rate. > > Presumably it is. > The bacterial contamination is not unexpected, when making libraries > with very low amount of input the risk of bacterial DNa from some of > the reagents to be included increases. > > OK ! > Ola > > > Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>: > > >> On 20/10/11 05:20 PM, Ola Wallerman wrote: >> >>> Hi, >>> >>> thanks for the quick reply. The latency was between 245 - 275 ms. I am >>> running a tes with ION torrent data now (from the new 318 ChIP) and >>> latencies are now ~ 133 ms. What is expected / good numbers? >>> >>> >>> >> I guess you mean microseconds, not milliseconds. >> The latency depends on your interconnect technology. >> >> >>> For the first test run I just used a subset of reads that was at hand >>> in the right format. Also, I wasn't sure how many nodes would be >>> needed to handle all reads. I actually got an error at first - I used >>> Illumina _sequence.txt files an had to change filenames to *.fastq. I >>> suppose the quality values will be off, but I am not sure if it is >>> used in Ray? >>> >>> >>> >> Ray does not utilise the qualities. >> >> >>> One reason to do de novo was to find contaminations, and apparently we >>> have a bacteraial contaminant in some of the libraries (ChIP-seq >>> libraries with low input). >>> >>> >>> >> Is this because your data was bar-coded and multiplexed with some >> other experiments as well ? >> >> >>> There were two things that apparently did not work: insert size was >>> way off for one of the libraries (103 bp, sd 8, it should be ~ 280 >>> with sd 40). >>> >> This may indicates some problem with your reads. >> >> >>> This was the major library (93 M pairs), the other insert >>> sizes are ok. Next time I will set it manually. The results for >>> contigs and scaffolds were exactly the same, we dont have any >>> mate-pairs but I would think the PE reads would help some with >>> scaffolding? >>> >>> >>> >> Mostly to go through small repeats although these are utilised for >> scaffolding too. >> >> >>> The ION assembly already finished, if you are interrested these are >>> the stats from 5,5 M reads on 4 nodes: >>> >>> Network testing: 5 seconds >>> File partitioning: 23 seconds >>> Sequence loading: 2 minutes, 8 seconds >>> K-mer counting: 4 minutes, 50 seconds >>> Coverage distribution analysis: 1 seconds >>> Graph construction: 13 minutes, 17 seconds >>> Edge purge: 1 minutes, 31 seconds >>> Selection of optimal read markers: 5 minutes, 39 seconds >>> Detection of assembly seeds: 1 minutes, 56 seconds >>> Estimation of outer distances for paired reads: 11 seconds >>> Bidirectional extension of seeds: 4 minutes, 33 seconds >>> Merging of redundant contigs: 4 minutes, 39 seconds >>> Generation of contigs: 0 seconds >>> Scaffolding of contigs: 1 minutes, 28 seconds >>> Total: 40 minutes, 42 seconds >>> >>> Contigs>= 100 nt >>> Number: 1348 >>> Total length: 4451596 >>> Average: 3302 >>> N50: 8156 >>> Median: 1194 >>> Largest: 40215 >>> Contigs>= 500 nt >>> Number: 852 >>> Total length: 4337720 >>> Average: 5091 >>> N50: 8305 >>> Median: 3586 >>> Largest: 40215 >>> >>> Cheers, >>> >>> >>> >> Nice. For 454, Ion Torrent and PacBio, I need to add something to >> handle the insertions and deletions. >> >> >> >>> Ola >>> >>> >>> >>> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>: >>> >>> >>> >>>> On 20/10/11 12:51 PM, Ola Wallerman wrote: >>>> >>>> >>>>> Hi Sebastien, >>>>> >>>>> >>>>> >>>> Hi Ola, >>>> >>>> >>>> >>>>> I am try ing out Ray for assembly of a human genome. I must say I am >>>>> quite surprised by the results from my first try since it worked >>>>> straight away without any problems, with only one program to run, >>>>> which is not what one is used to in the NGS field... I installed v >>>>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished >>>>> without any errors after ~ 12h, with ~1 Gbp assembled. >>>>> >>>>> >>>>> >>>>> >>>> One thing our team is aiming for with Ray is ease of use for the >>>> user. (It just works TM) >>>> The complexity (like the various stages of the algorithm) is >>>> encapsulated in Ray. >>>> >>>> Just out of curiosity, what is the inter-node latency of your >>>> compute resource ? >>>> >>>> Ray tests the network before doing its deed so the latency is in the file >>>> NetworkTest.txt. I am just curious though. >>>> >>>> >>>> >>>>> I wonder if you could give me any advice on how to run it in the best >>>>> way, eg should one use as many nodes as possible (we have 384 nodes >>>>> with at least 24 GB) and should reads be quality filtered beforehand? >>>>> >>>>> >>>> For the assemblathon, we did not filter reads at all. >>>> With Ray, filtering reads only reduces memory usage, I believe. >>>> >>>> If you know you have DNA contamination in your reads (non-human for >>>> instance), >>>> then you should filter reads. >>>> >>>> Adaptors utilised for the construction of so-called mate-pairs >>>> through the circularisation of long DNA molecules may be present if you >>>> have mate pairs. So far, it seems that the optical read markers in >>>> Ray deal with >>>> that. >>>> >>>> >>>> >>>> >>>> >>>>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts >>>>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I >>>>> tried now with k=27, but perhaps a higher k is better for the long >>>>> reads? The reason for doing the assembly is to use the contigs to get >>>>> a better precision in calling indels and rearangements. >>>>> >>>>> >>>>> >>>>> >>>> Did you provide all the reads ? >>>> >>>> You have to be careful with a k that is too large because Ray does not >>>> attempt at all to correct the reads. The erroneous k-mers all go in >>>> an abyss (not the assembler !) >>>> and are not really utilised at all. >>>> >>>> In my experience, k=21 works well for bacteria and for other >>>> larger genomes, >>>> I usually utilise k=25 or k=31 although I don't have that much >>>> experience on large genomes >>>> aside from the assemblathon. >>>> >>>> >>>> You can check some files generated by Ray for your first assembly. >>>> >>>> In your assembly directory, the file CoverageDistributionAnalysis.txt >>>> contains the peak coverage. >>>> >>>> For your paired reads, the file LibraryStatistics.txt contains what >>>> Ray detected >>>> in your reads. >>>> >>>> This step is very important as paired reads are the workhorse to go >>>> from reads >>>> to k-mer graph to seeds to extensions. >>>> >>>> The importance of pairs is also highlighted by the recent application note >>>> published by Illumina using the MiSeq and Ray. >>>> >>>> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf >>>> >>>> There is also a file called SeedLengthDistribution.txt >>>> This file contains the distribution of seed lengths. In Ray, a seed >>>> is a region of the genome >>>> that is unique. I like to say that a seed is mostly similar >>>> conceptually to unitigs in >>>> overlap-layout-consensus assemblers although I suspect there are >>>> some differences. >>>> >>>> >>>> Increasing k increases the uniqueness of sub-sequences extracted >>>> from reads but also reduces the usable sub-sequence coverage because >>>> of the sequencing errors. As I said above, you can assess your sub-sequence >>>> coverage (also known as k-mer coverage) by reading the content of the file >>>> CoverageDistributionAnalysis.txt >>>> >>>> >>>> >>>>> Best regards, >>>>> >>>>> >>>>> >>>>> >>>> Let me know if you have any other questions. >>>> >>>> >>>> >>>>> Ola >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> Sébastien >>>> >>>> >>>> >>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>> Ola Wallerman, PhD >>>>> IGP, Uppsala Universitet >>>>> >>>>> waller...@gmail.com >>>>> olawallerman@skype >>>>> 0736400172 >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Ola Wallerman, PhD >>> IGP, Uppsala Universitet >>> >>> waller...@gmail.com >>> olawallerman@skype >>> 0736400172 >>> >>> >>> >> >> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Ola Wallerman, PhD > IGP, Uppsala Universitet > > waller...@gmail.com > olawallerman@skype > 0736400172 > > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users