[Please CC the mailing list]

On 24/04/13 12:33 PM, Alison Cloutier wrote:
> Hi Danny & Sebastien-
>
>       Sorry for the delay in responding- all of your emails from yesterday 
> just appeared in my inbox (U of T mail seems to hang on to things 
> sometimes...).
>
>       I'm trying to assemble 5 HiSeq runs of paired end data.

With such a huge amount of data, you should not be collapsing all the files in 
3 huge files.

> In pre-processing the data, I first collapsed identical read pairs within each
>  run (i.e. PCR optical duplicates), then quality-trimmed the reads to a 
> cutoff of Q=13 and minimum length
>  threshold of 25 bp, and lastly collapsed any overlapping read pairs into a 
> single contiguous read
>  (this is fairly common in our dataset as it's ancient DNA so many of the 
> inserts are of small size).

Ray will perform better if you provide it with your overlapping paired reads 
because its heuristics for paired reads
are much better than its heuristics for single-end reads.

>
>      I then ran each dataset on its own with Ray 2.1.0 (and a more reasonable 
> number of cores, i.e. 10 nodes [80 processes] on the large memory 32g nodes), 
> and had no problems.

Do you have 5 independant unrelated samples or you have 5 HiSeq runs for the 
same sample ?

>   I did this step so that I could map the raw reads back onto the scaffolds, 
> and then remove reads that map to scaffolds that

>blast as bacterial (i.e. to remove contaminant sequences, which account for a 
>substantial proportion of the data when working
>with ancient DNA).  After this final pre-processing step, the full dataset 
>consisted of:

Oh I see. You did these small assemblies to weed out contaminations, right ?

> 151,284,953 paired reads (up to 101 bp each)

That's not a lot considering that you had 5 HiSeq runs (on HiSeq 2000 run is 6 
000 000 000 sequences), unless you are talking about partial
HiSeq runs.

> 23,258,458 single reads (up to 101 bp, cases where partner was removed in 
> quality trimming)
> 84,988,500 'overlap' single reads (where overlapping pairs were merged, so up 
> to ~200 bp length)

Again, Ray will perform better with the non-merged version of these pairs.

>
> In comparison, the largest of the single run datasets consisted of: 
> 111,497,923 paired and 15,938,968 single reads so
>  running a full assembly didn't seem unreasonable. However, in the full 
> assembly the program would stop partway through with an error that one of the 
> ranks
>  had exited with 'signal 9 (killed)'.  I tried another run with the 
> 'show-memory-usage' options to try to see if it really was a memory issue, 
> and I also found
>  that as I incrementally increased the number of cores, Ray would progress 
> further but then would just stall.

You should update to Ray v2.2.0 as it has an adaptive Bloom filter that will 
take lower the memory usage of sequencing errors.

>  In reading through the denovoassembler
>  postings, Sebastien had suggested to other users that: routing could be 
> necessary when using a large number  of processes and/or that disabling read

It depends if you get a bad latency.

What are the first few lines of NetworkTest.txt ?


>recycling could avoid cases where the program gets stalled in a loop, so 
>that's why I was trying out those options.

Where exactly is Ray stalling on your dataset ? (You can check that with the 
standard output file)

>It seems like it was stupid of me to leave all the reads in single gigantic 
>files instead of splitting them up again after the pre-processing,

It's better for the storage to have many files.

> but I'm not sure that this is the only issue as the same was done for the 
> single-dataset runs (or is it just the huge files PLUS
>  a huge number of processes that really causes issues?).
>Could there also be issues caused by the data itself- i.e. with ancient DNA,

I don't know about ancient DNA.

> we might expect a lot of fairly short contigs and much missing data (due to 
> the stochastic nature of what DNA survives intact from an ancient sample),
>  and ancient DNA itself is often more error prone due to damage/lesions.  
> Could these features cause issues with the assembly algorithm?

No. You would just get a graph with missing parts.

>
> Any ideas would be greatly appreciated- it seems like using the new Ray 
> release and splitting up the input files would be good ideas,
>and I'm really, really, really sorry to have caused issues for SciNet 
>(really!).
>
>                Alison


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to