[Denovoassembler-users] Insert size estimation for single end reads

Ola Wallerman Tue, 25 Oct 2011 08:10:51 -0700

Hi,

I have run Ray with only single end reads now, and it appears that it  
still tries to estate insert sizes. Is there any way I can turn this  
off?


Ola

***
Step: Estimation of outer distances for paired reads
Date: Sat Oct 22 13:25:25 2011
Elapsed time: 5 hours, 34 minutes, 13 seconds
Since beginning: 20 hours, 20 minutes, 17 seconds
***



Citerar Sébastien Boisvert <sebastien.boisver...@ulaval.ca>:

> On 21/10/11 12:39 PM, Ola Wallerman wrote:
>> Hi again,
>>
>> just for your information: the strange insert size was indeed a fault
>> on my side (I used the same file for both ends for that library...).
>
> ;)
>
>>  I
>> will try to run the whole dataset during the weekend. I also did a run
>> on the ION ecoli dataset with a longer k-mer (51); it improved the N50
>> quite substantially, from 8,1 to 16.9 kb. Perhaps it is due to the
>> high indel rate.
>>
>>
>
> Presumably it is.
>
>> The bacterial contamination is not unexpected, when making libraries
>> with very low amount of input the risk of bacterial DNa from some of
>> the reagents to be included increases.
>>
>>
>
> OK !
>
>
>> Ola
>>
>>
>> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>:
>>
>>
>>> On 20/10/11 05:20 PM, Ola Wallerman wrote:
>>>
>>>> Hi,
>>>>
>>>> thanks for the quick reply. The latency was between 245 - 275 ms. I am
>>>> running a tes with ION torrent data now (from the new 318 ChIP) and
>>>> latencies are now ~ 133 ms. What is expected / good numbers?
>>>>
>>>>
>>>>
>>> I guess you mean microseconds, not milliseconds.
>>> The latency depends on your interconnect technology.
>>>
>>>
>>>> For the first test run I just used a subset of reads that was at hand
>>>> in the right format. Also, I wasn't sure how many nodes would be
>>>> needed to handle all reads. I actually got an error at first - I used
>>>> Illumina _sequence.txt files an had to change filenames to *.fastq. I
>>>> suppose the quality values will be off, but I am not sure if it is
>>>> used in Ray?
>>>>
>>>>
>>>>
>>> Ray does not utilise the qualities.
>>>
>>>
>>>> One reason to do de novo was to find contaminations, and apparently we
>>>> have a bacteraial contaminant in some of the libraries (ChIP-seq
>>>> libraries with low input).
>>>>
>>>>
>>>>
>>> Is this because your data was bar-coded and multiplexed with some
>>> other experiments as well ?
>>>
>>>
>>>> There were two things that apparently did not work: insert size was
>>>> way off for one of the libraries (103 bp, sd 8, it should be ~ 280
>>>> with sd 40).
>>>>
>>> This may indicates some problem with your reads.
>>>
>>>
>>>> This was the major library (93 M pairs), the other insert
>>>> sizes are ok. Next time I will set it manually. The results for
>>>> contigs and scaffolds were exactly the same, we dont have any
>>>> mate-pairs but I would think the PE reads would help some with
>>>> scaffolding?
>>>>
>>>>
>>>>
>>> Mostly to go through small repeats although these are utilised for
>>> scaffolding too.
>>>
>>>
>>>> The ION assembly already finished, if you are interrested these are
>>>> the stats from 5,5 M reads on 4 nodes:
>>>>
>>>>   Network testing: 5 seconds
>>>>   File partitioning: 23 seconds
>>>>   Sequence loading: 2 minutes, 8 seconds
>>>>   K-mer counting: 4 minutes, 50 seconds
>>>>   Coverage distribution analysis: 1 seconds
>>>>   Graph construction: 13 minutes, 17 seconds
>>>>   Edge purge: 1 minutes, 31 seconds
>>>>   Selection of optimal read markers: 5 minutes, 39 seconds
>>>>   Detection of assembly seeds: 1 minutes, 56 seconds
>>>>   Estimation of outer distances for paired reads: 11 seconds
>>>>   Bidirectional extension of seeds: 4 minutes, 33 seconds
>>>>   Merging of redundant contigs: 4 minutes, 39 seconds
>>>>   Generation of contigs: 0 seconds
>>>>   Scaffolding of contigs: 1 minutes, 28 seconds
>>>>   Total: 40 minutes, 42 seconds
>>>>
>>>> Contigs>= 100 nt
>>>>   Number: 1348
>>>>   Total length: 4451596
>>>>   Average: 3302
>>>>   N50: 8156
>>>>   Median: 1194
>>>>   Largest: 40215
>>>> Contigs>= 500 nt
>>>>   Number: 852
>>>>   Total length: 4337720
>>>>   Average: 5091
>>>>   N50: 8305
>>>>   Median: 3586
>>>>   Largest: 40215
>>>>
>>>> Cheers,
>>>>
>>>>
>>>>
>>> Nice. For 454, Ion Torrent and PacBio, I need to add something to
>>> handle the insertions and deletions.
>>>
>>>
>>>
>>>> Ola
>>>>
>>>>
>>>>
>>>> Citerar Sébastien Boisvert<sebastien.boisver...@ulaval.ca>:
>>>>
>>>>
>>>>
>>>>> On 20/10/11 12:51 PM, Ola Wallerman wrote:
>>>>>
>>>>>
>>>>>> Hi Sebastien,
>>>>>>
>>>>>>
>>>>>>
>>>>> Hi Ola,
>>>>>
>>>>>
>>>>>
>>>>>> I am try ing out Ray for assembly of a human genome. I must say I am
>>>>>> quite surprised by the results from my first try since it worked
>>>>>> straight away without any problems, with only one program to run,
>>>>>> which is not what one is used to in the NGS field... I installed v
>>>>>> 1.7, run it with 300 M HiSeq PE reads on 20 nodes and it finished
>>>>>> without any errors after ~ 12h, with ~1 Gbp assembled.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> One thing our team is aiming for with Ray is ease of use for the
>>>>> user. (It just works TM)
>>>>> The complexity (like the various stages of the algorithm) is
>>>>> encapsulated in Ray.
>>>>>
>>>>> Just out of curiosity, what is the inter-node latency of your
>>>>> compute resource ?
>>>>>
>>>>> Ray tests the network before doing its deed so the latency is in the file
>>>>> NetworkTest.txt. I am just curious though.
>>>>>
>>>>>
>>>>>
>>>>>> I wonder if you could give me any advice on how to run it in the best
>>>>>> way, eg should one use as many nodes as possible (we have 384 nodes
>>>>>> with at least 24 GB) and should reads be quality filtered beforehand?
>>>>>>
>>>>>>
>>>>> For the assemblathon, we did not filter reads at all.
>>>>> With Ray, filtering reads only reduces memory usage, I believe.
>>>>>
>>>>> If you know you have DNA contamination in your reads (non-human for
>>>>> instance),
>>>>> then you should filter reads.
>>>>>
>>>>> Adaptors utilised for the construction of so-called mate-pairs
>>>>> through the circularisation of long DNA molecules may be present if you
>>>>> have mate pairs. So far, it seems that the optical read markers in
>>>>> Ray deal with
>>>>> that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> The dataset I have is around 200 M HiSeq paired reads (100 bp, inserts
>>>>>> ~150 to 300 bp) and ~3 billion short single end reads (~36 bp). I
>>>>>> tried now with k=27, but perhaps a higher k is better for the long
>>>>>> reads? The reason for doing the assembly is to use the contigs to get
>>>>>> a better precision in calling indels and rearangements.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Did you provide all the reads ?
>>>>>
>>>>> You have to be careful with a k that is too large because Ray does not
>>>>> attempt at all to correct the reads. The erroneous k-mers all go in
>>>>> an abyss (not the assembler !)
>>>>> and are not really utilised at all.
>>>>>
>>>>> In my experience, k=21 works well for bacteria and for other
>>>>> larger genomes,
>>>>> I usually utilise k=25 or k=31 although I don't have that much
>>>>> experience on large genomes
>>>>> aside from the assemblathon.
>>>>>
>>>>>
>>>>> You can check some files generated by Ray for your first assembly.
>>>>>
>>>>> In your assembly directory, the file CoverageDistributionAnalysis.txt
>>>>> contains the peak coverage.
>>>>>
>>>>> For your paired reads, the file LibraryStatistics.txt contains what
>>>>> Ray detected
>>>>> in your reads.
>>>>>
>>>>> This step is very important as paired reads are the workhorse to go
>>>>> from reads
>>>>> to k-mer graph to seeds to extensions.
>>>>>
>>>>> The importance of pairs is also highlighted by the recent  
>>>>> application note
>>>>> published by Illumina using the MiSeq and Ray.
>>>>>
>>>>> http://www.illumina.com/documents/%5Cproducts%5Cappnotes%5Cappnote_miseq_denovo.pdf
>>>>>
>>>>> There is also a file called SeedLengthDistribution.txt
>>>>> This file contains the distribution of seed lengths. In Ray, a seed
>>>>> is a region of the genome
>>>>> that is unique. I like to say that a seed is mostly similar
>>>>> conceptually to unitigs in
>>>>> overlap-layout-consensus assemblers although I suspect there are
>>>>> some differences.
>>>>>
>>>>>
>>>>> Increasing k increases the uniqueness of sub-sequences extracted
>>>>> from reads but also reduces the usable sub-sequence coverage because
>>>>> of the sequencing errors. As I said above, you can assess your  
>>>>> sub-sequence
>>>>> coverage (also known as k-mer coverage) by reading the content  
>>>>> of the file
>>>>> CoverageDistributionAnalysis.txt
>>>>>
>>>>>
>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Let me know if you have any other questions.
>>>>>
>>>>>
>>>>>
>>>>>> Ola
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Sébastien
>>>>>
>>>>>
>>>>>
>>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>> Ola Wallerman, PhD
>>>>>> IGP, Uppsala Universitet
>>>>>>
>>>>>> waller...@gmail.com
>>>>>> olawallerman@skype
>>>>>> 0736400172
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> Ola Wallerman, PhD
>>>> IGP, Uppsala Universitet
>>>>
>>>> waller...@gmail.com
>>>> olawallerman@skype
>>>> 0736400172
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Ola Wallerman, PhD
>> IGP, Uppsala Universitet
>>
>> waller...@gmail.com
>> olawallerman@skype
>> 0736400172
>>
>>
>
>



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ola Wallerman, PhD
IGP, Uppsala Universitet

waller...@gmail.com
olawallerman@skype
0736400172


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

[Denovoassembler-users] Insert size estimation for single end reads

Reply via email to