Re: [Denovoassembler-users] RAY questions

Sébastien Boisvert Fri, 21 Oct 2011 09:43:57 -0700

On 21/10/11 12:39 PM, Ola Wallerman wrote:
> Hi again,
>
> just for your information: the strange insert size was indeed a fault
> on my side (I used the same file for both ends for that library...).


;)

>   I
> will try to run the whole dataset during the weekend. I also did a run
> on the ION ecoli dataset with a longer k-mer (51); it improved the N50
> quite substantially, from 8,1 to 16.9 kb. Perhaps it is due to the
> high indel rate.
>
>    

Presumably it is.

> The bacterial contamination is not unexpected, when making libraries
> with very low amount of input the risk of bacterial DNa from some of
> the reagents to be included increases.
>
>    

OK !


> Ola
>
>
> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>:
>
>    
>> On 20/10/11 05:20 PM, Ola Wallerman wrote:
>>      
>>> Hi,
>>>
>>> thanks for the quick reply. The latency was between 245 - 275 ms. I am
>>> running a tes with ION torrent data now (from the new 318 ChIP) and
>>> latencies are now ~ 133 ms. What is expected / good numbers?
>>>
>>>
>>>        
>> I guess you mean microseconds, not milliseconds.
>> The latency depends on your interconnect technology.
>>
>>      
>>> For the first test run I just used a subset of reads that was at hand
>>> in the right format. Also, I wasn't sure how many nodes would be
>>> needed to handle all reads. I actually got an error at first - I used
>>> Illumina _sequence.txt files an had to change filenames to *.fastq. I
>>> suppose the quality values will be off, but I am not sure if it is
>>> used in Ray?
>>>
>>>
>>>        
>> Ray does not utilise the qualities.
>>
>>      
>>> One reason to do de novo was to find contaminations, and apparently we
>>> have a bacteraial contaminant in some of the libraries (ChIP-seq
>>> libraries with low input).
>>>
>>>
>>>        
>> Is this because your data was bar-coded and multiplexed with some
>> other experiments as well ?
>>
>>      
>>> There were two things that apparently did not work: insert size was
>>> way off for one of the libraries (103 bp, sd 8, it should be ~ 280
>>> with sd 40).
>>>        
>> This may indicates some problem with your reads.
>>
>>      
>>> This was the major library (93 M pairs), the other insert
>>> sizes are ok. Next time I will set it manually. The results for
>>> contigs and scaffolds were exactly the same, we dont have any
>>> mate-pairs but I would think the PE reads would help some with
>>> scaffolding?
>>>
>>>
>>>        
>> Mostly to go through small repeats although these are utilised for
>> scaffolding too.
>>
>>      
>>> The ION assembly already finished, if you are interrested these are
>>> the stats from 5,5 M reads on 4 nodes:
>>>
>>>    Network testing: 5 seconds
>>>    File partitioning: 23 seconds
>>>    Sequence loading: 2 minutes, 8 seconds
>>>    K-mer counting: 4 minutes, 50 seconds
>>>    Coverage distribution analysis: 1 seconds
>>>    Graph construction: 13 minutes, 17 seconds
>>>    Edge purge: 1 minutes, 31 seconds
>>>    Selection of optimal read markers: 5 minutes, 39 seconds
>>>    Detection of assembly seeds: 1 minutes, 56 seconds
>>>    Estimation of outer distances for paired reads: 11 seconds
>>>    Bidirectional extension of seeds: 4 minutes, 33 seconds
>>>    Merging of redundant contigs: 4 minutes, 39 seconds
>>>    Generation of contigs: 0 seconds
>>>    Scaffolding of contigs: 1 minutes, 28 seconds
>>>    Total: 40 minutes, 42 seconds
>>>
>>> Contigs>= 100 nt
>>>    Number: 1348
>>>    Total length: 4451596
>>>    Average: 3302
>>>    N50: 8156
>>>    Median: 1194
>>>    Largest: 40215
>>> Contigs>= 500 nt
>>>    Number: 852
>>>    Total length: 4337720
>>>    Average: 5091
>>>    N50: 8305
>>>    Median: 3586
>>>    Largest: 40215
>>>
>>> Cheers,
>>>
>>>
>>>        
>> Nice. For 454, Ion Torrent and PacBio, I need to add something to
>> handle the insertions and deletions.
>>
>>
>>      
>>> Ola
>>>
>>>
>>>
>>> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>:
>>>
>>>
>>>        
>>>> On 20/10/11 12:51 PM, Ola Wallerman wrote:
>>>>
>>>>          
>>>>> Hi Sebastien,
>>>>>
>>>>>
>>>>>            
>>>> Hi Ola,
>>>>
>>>>
>>>>          
>>>>> I am try ing out Ray for assembly of a human genome. I must say I am
>>>>> quite surprised by the results from my first try since it worked
>>>>> straight away without any problems, with only one program to run,
>>>>> which is not what one is used to in the NGS field... I installed v
>>>>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished
>>>>> without any errors after ~ 12h, with ~1 Gbp assembled.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> One thing our team is aiming for with Ray is ease of use for the
>>>> user. (It just works TM)
>>>> The complexity (like the various stages of the algorithm) is
>>>> encapsulated in Ray.
>>>>
>>>> Just out of curiosity, what is the inter-node latency of your
>>>> compute resource ?
>>>>
>>>> Ray tests the network before doing its deed so the latency is in the file
>>>> NetworkTest.txt. I am just curious though.
>>>>
>>>>
>>>>          
>>>>> I wonder if you could give me any advice on how to run it in the best
>>>>> way, eg should one use as many nodes as possible (we have 384 nodes
>>>>> with at least 24 GB) and should reads be quality filtered beforehand?
>>>>>
>>>>>            
>>>> For the assemblathon, we did not filter reads at all.
>>>> With Ray, filtering reads only reduces memory usage, I believe.
>>>>
>>>> If you know you have DNA contamination in your reads (non-human for
>>>> instance),
>>>> then you should filter reads.
>>>>
>>>> Adaptors utilised for the construction of so-called mate-pairs
>>>> through the circularisation of long DNA molecules may be present if you
>>>> have mate pairs. So far, it seems that the optical read markers in
>>>> Ray deal with
>>>> that.
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts
>>>>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I
>>>>> tried now with k=27, but perhaps a higher k is better for the long
>>>>> reads? The reason for doing the assembly is to use the contigs to get
>>>>> a better precision in calling indels and rearangements.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Did you provide all the reads ?
>>>>
>>>> You have to be careful with a k that is too large because Ray does not
>>>> attempt at all to correct the reads. The erroneous k-mers all go in
>>>> an abyss (not the assembler !)
>>>> and are not really utilised at all.
>>>>
>>>> In my experience, k=21 works well for bacteria and for other
>>>> larger genomes,
>>>> I usually utilise k=25 or k=31 although I don't have that much
>>>> experience on large genomes
>>>> aside from the assemblathon.
>>>>
>>>>
>>>> You can check some files generated by Ray for your first assembly.
>>>>
>>>> In your assembly directory, the file CoverageDistributionAnalysis.txt
>>>> contains the peak coverage.
>>>>
>>>> For your paired reads, the file LibraryStatistics.txt contains what
>>>> Ray detected
>>>> in your reads.
>>>>
>>>> This step is very important as paired reads are the workhorse to go
>>>> from reads
>>>> to k-mer graph to seeds to extensions.
>>>>
>>>> The importance of pairs is also highlighted by the recent application note
>>>> published by Illumina using the MiSeq and Ray.
>>>>
>>>> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf
>>>>
>>>> There is also a file called SeedLengthDistribution.txt
>>>> This file contains the distribution of seed lengths. In Ray, a seed
>>>> is a region of the genome
>>>> that is unique. I like to say that a seed is mostly similar
>>>> conceptually to unitigs in
>>>> overlap-layout-consensus assemblers although I suspect there are
>>>> some differences.
>>>>
>>>>
>>>> Increasing k increases the uniqueness of sub-sequences extracted
>>>> from reads but also reduces the usable sub-sequence coverage because
>>>> of the sequencing errors. As I said above, you can assess your sub-sequence
>>>> coverage (also known as k-mer coverage) by reading the content of the file
>>>> CoverageDistributionAnalysis.txt
>>>>
>>>>
>>>>          
>>>>> Best regards,
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Let me know if you have any other questions.
>>>>
>>>>
>>>>          
>>>>> Ola
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Sébastien
>>>>
>>>>
>>>>          
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> Ola Wallerman, PhD
>>>>> IGP, Uppsala Universitet
>>>>>
>>>>> waller...@gmail.com
>>>>> olawallerman@skype
>>>>> 0736400172
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Ola Wallerman, PhD
>>> IGP, Uppsala Universitet
>>>
>>> waller...@gmail.com
>>> olawallerman@skype
>>> 0736400172
>>>
>>>
>>>        
>>
>>      
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Ola Wallerman, PhD
> IGP, Uppsala Universitet
>
> waller...@gmail.com
> olawallerman@skype
> 0736400172
>
>    


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] RAY questions

Reply via email to