Hello,

I found why larger-than-32 k-mers were slowish to compute by Ray 1.6.0.

vertexRank [core/common_functions.cpp] takes a Kmer and maps it to an
MPI rank.

For k-mers longer than 32, at least 2 uint64_t (64-bit integers) are
needed and the hash for these longer k-mers was buggy because it was
computed only on the first uint64_t -- thus breaking the uniformity of
vertex distribution across MPI ranks.

This was fixed in https://github.com/sebhtml/ray/commit/e1309521f7c2e


So, yes, k-mers longer than 32 are not uniformly distributed on MPI
ranks and this leads to slow assemblies because balanced communication
between MPI ranks is not maintained.

This is now fixed.

Sébastien 
http://twitter.com/sebhtml

On Thu, 2011-06-16 at 07:39 -0400, Adrian Platts, Mr wrote:
> Hi Sebastien,
> 
> I'm replying to you directly rather than to the list as I get rather annoyed 
> with lists where every minutiae of the debugging process goes to everyone
> on the list.
> 
> 
> You said it was still in 'computing vertices and edges'.
> 
> Do you get updates or it looks hanged ?
> 
> Yep, after 16 hours it moved on to the next stage, it was going very slowly 
> (not frozen) so i killed it in order to adjust maxkmer down from 128 to 64.
> 
> OK, that must be something elsewhere.
> 
> Can you try with MAXKMERLENGTH=64 ?
> 
> I recompiled with 64 and reran.  It was definitely a bit faster, finishing 
> the computing vertices and edges in about 8 hours
> (with the short kmers the whole assembly normally completes in under 3 hours 
> - see output below).
> 
> I left it running overnight and checked this morning.  The processes had 
> perhaps partially completed but only a few output files had been
> generated.  It seems to have halted abnormally at some point, perhaps while 
> computing contigs?
> 
> [aplatts@grandiflora Ray-1.6.0]$ ls -alt
> total 472
> -rw-rw-r--  1 aplatts aplatts   1523 Jun 15 19:17 
> SI_61_MP28065MPSI_Scaff.LibraryStatistics.txt
> drwxr-xr-x  8 aplatts aplatts   4096 Jun 15 19:17 .
> -rw-rw-r--  1 aplatts aplatts  42080 Jun 15 19:07 
> SI_61_MP28065MPSI_Scaff.SeedLengthDistribution.txt
> -rw-rw-r--  1 aplatts aplatts    329 Jun 15 16:10 
> SI_61_MP28065MPSI_Scaff.CoverageDistributionAnalysis.txt
> -rw-rw-r--  1 aplatts aplatts 110604 Jun 15 16:10 
> SI_61_MP28065MPSI_Scaff.CoverageDistribution.txt
> -rw-rw-r--  1 aplatts aplatts    517 Jun 15 11:32 
> SI_61_MP28065MPSI_Scaff.RayCommand.txt
> -rw-rw-r--  1 aplatts aplatts     19 Jun 15 11:32 
> SI_61_MP28065MPSI_Scaff.RayVersion.txt
> -rw-rw-r--  1 aplatts aplatts      9 Jun 15 11:31 TARGETS
> -rw-rw-r--  1 aplatts aplatts     17 Jun 15 11:31 PREFIX
> drwxr-xr-x 12 aplatts aplatts   4096 Jun 15 11:31 code
> -rw-rw-r--  1 aplatts aplatts      0 Jun 15 11:30 showOptions
> drwxrwxr-x  2 aplatts aplatts     99 Jun 14 13:22 Ray-Large-k-mers
> drwxrwxrwx 41 aplatts aplatts   4096 Jun 14 13:10 ..
> 
> Sorry, I wasn't collecting the output during this run - I guess I should 
> rerun it?  I have the exact same run with k=31 (1.4.0) where
> there weren't any problems:
> 
> -rw-rw-r-- 1 aplatts aplatts 223547921 May 31 10:21 
> SI_31_MP28065MPSI_Scaff.Scaffolds.fasta
> -rw-rw-r-- 1 aplatts aplatts       285 May 31 10:20 
> SI_31_MP28065MPSI_Scaff.OutputNumbers.txt
> -rw-rw-r-- 1 aplatts aplatts   1877068 May 31 10:20 
> SI_31_MP28065MPSI_Scaff.ScaffoldComponents.txt
> -rw-rw-r-- 1 aplatts aplatts    625703 May 31 10:20 
> SI_31_MP28065MPSI_Scaff.ScaffoldLengths.txt
> -rw-rw-r-- 1 aplatts aplatts    914872 May 31 10:20 
> SI_31_MP28065MPSI_Scaff.ContigLengths.txt
> -rw-rw-r-- 1 aplatts aplatts    511876 May 31 10:20 
> SI_31_MP28065MPSI_Scaff.ScaffoldLinks.txt
> -rw-rw-r-- 1 aplatts aplatts 220073016 May 31 10:15 
> SI_31_MP28065MPSI_Scaff.Contigs.fasta
> -rw-rw-r-- 1 aplatts aplatts      1523 May 31 09:45 
> SI_31_MP28065MPSI_Scaff.LibraryStatistics.txt
> -rw-rw-r-- 1 aplatts aplatts     54734 May 31 09:42 
> SI_31_MP28065MPSI_Scaff.SeedLengthDistribution.txt
> -rw-rw-r-- 1 aplatts aplatts       171 May 31 08:46 
> SI_31_MP28065MPSI_Scaff.CoverageDistributionAnalysis.txt
> -rw-rw-r-- 1 aplatts aplatts    189495 May 31 08:46 
> SI_31_MP28065MPSI_Scaff.CoverageDistribution.txt
> -rw-rw-r-- 1 aplatts aplatts       500 May 31 08:02 
> SI_31_MP28065MPSI_Scaff.RayCommand.txt
> -rw-rw-r-- 1 aplatts aplatts        19 May 31 08:02 
> SI_31_MP28065MPSI_Scaff.RayVersion.txt
> 
> I couldn't see many differences in the output files, the percentage of 
> vertices with coverage 1 was very slightly
> higher in the k=61 run (32% v 30% - these both seem high?) but not by much.
> 
> I guess it could have run out of memory but 256GB was available and when I 
> left it only about 32 GB was being used by
> Ray and I don't see anything in the kernel messages about alloc failures or 
> OOM killer activity.
> 
> Adrian
> 
> 
> 
> 




------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to