Hello,

Thank you for the positive feedback.

This is how open innovation happens -- through collaborative efforts.

I think that the user experience with Ray is straightforward.

I am presently testing the change I made and it seems to solve the issue of
having duplicated sequences with exactly the same reads. The same bug can also
lead to misassemblies, in theory.

At the organization where I do my research, we are presently assembling bacterial communities
and single bacterial genomes.

One of my colleagues also reported a similar issue this week.

Because Ray is distributed, the same path in the distributed de Bruijn graph may
be assembled more than once by several processor cores.

Hence, it is required to remove this redundancy. The plugin that does that
used a shortcut, some sort of magical code that will work in almost all the cases,
that assumed that the overlaps are randomly distributed.

In those rare failed cases, the same code could lead to a misassembly, however. This was not occuring on
any of my ~20 system tests.


p.s.: I redacted some bits that may be sensitive from your message.




Mitchell Stanton-Cook a écrit :
Hi Seb,

Thank you. We're still doing some more work this end. We really really really like Ray! We are more than happy to work through these problems.

We hope to be pushing 1000's of bacterial genomes through Ray in a few months. We just want to be sure that we are getting useful results.

If there is anything else we can do to help/bugfix please do not hesitate to let us know.


Regards

Mitch



On Fri, May 25, 2012 at 11:23 AM, Sébastien Boisvert <sebastien.boisver...@ulaval.ca <mailto:sebastien.boisver...@ulaval.ca>> wrote:

    Just to let you know that I may have found the problem.

    See https://github.com/sebhtml/ray/issues/55



                                                        Sébastien
    ________________________________________
    De : Mitchell Stanton-Cook [m.stantonc...@gmail.com
    <mailto:m.stantonc...@gmail.com>]
    Date d'envoi : 17 mai 2012 22:32
    À : Sébastien Boisvert
    Objet : Re: Duplicate sequence in contigs

    Hi Seb,

    Thanks for the reply.


    A bit more background:
    -------------------------------
    This is 100 bp Illumina PE data (insert of ~300 bp s.d of about ~10%).

    ~ 1000X coverage. Ray (1.7) did not like this high coverage (you
    have previously commented on the user-list about this). We sampled
    this down to ~100X coverage.

    We also cleaned the reads (you have pointed this is not necessary,
    but having a consistent cleaned set makes downstream analysis i.e.
    snp calling much easier). All input bases have a Q score >= 30.
    Reads > 70bp after trimming were filtered. After this the mean
    reads size is ~97 bp. We only consider read pairs (if 1 one of the
    sequences in the pairs fails a cleaning criterion, both do) and
    hence no single end reads go into Ray.

    Ray was executed like this (we used Ray's internal
    estimations/calculations to determine the best parameters):

    mpiexec -n $PROC Ray -i XXXX.fastq -k $K -o $OUT

    We looked at kmers from 15-35 in increments of 2.

    The genome is ~1.8 Mb. From the assemblies it appears there are
    not a lot repetitive elements in the genome. I have attached a csv
    of the results (XXXX_ALL.csv) ( abbreviations: c= contigs, s=
    scaffold, N = scaffolding character).

    Results for kmer 21 and kmer 23 are interesting.

    kmer 21) As previously mentioned we have a 63 Kb duplication. For
    63 Kb the start of the two contigs are almost identical (3
    mismatches and 1 gap):



     Score = 1.165e+05 bits (63086),  Expect = 0.0

     Identities = 63093/63096 <tel:63093%2F63096><tel:63093%2F63096>
    (99%), Gaps = 1/63096 (0%)
     Strand=Plus/Plus

    I have attached the alignment (XXXX_b2s.aln)

    There is also a smaller 5 Kb duplication detected:


    Score = 9583 bits (5189),  Expect = 0.0

     Identities = 5192/5193 (99%), Gaps = 1/5193 (0%)
     Strand=Plus/Minus

    kmer 23) (Notice there is about 63 Kb difference between the "c
    100 bp total len" in the .csv file of kmer 21 and kmer 23). From
    the blast2seq there is no such 63 Kb duplication found.

    Now, once again focusing on the "c 100 bp total len"  in the .csv
    file. I propose kmer 21 duplicate, kmer 23 no-duplicate, kmer 25
    non-duplicate, kmer 27 duplicate (I verified this true), kmer 29
    non-duplicate. Strangely kmer 31 is missing about 200 Kb in
    comparison to the other assemblies.

    What we are wondering is:

    1) Why is there large, almost identical duplicates present in the
    assemblies?
    2) Why do we see it in some kmers and not others? (I could
    understand if lower kmer = duplicates, higher kmers =
    non-duplicates if these duplicates are propagated by a sequencing
    error)
    3) Is this a bug or an inherit issue with graph based assembly?
    4) If possible, how can we fix this?
    5) How can I help you with this?


    I hope I have provided you with enough information. If not please
    let me know.

    p.s. I have know problem with this going to the list as long as
    the attachments are not included.



    Regards

    Mitch


    On Fri, May 18, 2012 at 12:25 AM, Sébastien Boisvert
    <sebastien.boisver...@ulaval.ca
    
<mailto:sebastien.boisver...@ulaval.ca><mailto:sebastien.boisver...@ulaval.ca
    <mailto:sebastien.boisver...@ulaval.ca>>> wrote:
    Hi,

    Is the 63 kB perfectly duplicated in your assembly ?


    Le 2012-05-17 01:26, Mitchell Stanton-Cook a écrit :

    Hi Seb,

    Hope all is well.

    I was wondering if you have ever seen duplicate sequence in contigs.

    We have ~63 kB duplicate at the start of two different contigs.
    Beyond this it's unique.

    This is Ray 1.7.

    I'm re-running with Ray2.0rc5.

    I came across these posts (ABySS specific):

    
http://groups.google.com/group/abyss-users/browse_thread/thread/264e894b4ec0c96d/30dab75afa686878
    
http://groups.google.com/group/abyss-users/browse_thread/thread/7a03ee033b11afc4
    
http://groups.google.com/group/abyss-users/browse_thread/thread/f0f3a650bd12cf1e


    Any ideas?

    Regards

    Mitch






------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to