Ketil, 

Given that the dump looks fine, my gut feeling is that indeed the histogram 
function operates on partially evaluated data and somehow the memory is 
corrupted. Can you write a Deepseq instance for your Judy array? After all, you 
could dump to a [(String,Int)] instead of an IO handle and re-parse the data. 
I guess that forcing the k-mer itself to be fully evaluated does no good (it's 
a Word64, anyways?). You'd need to make sure the insertWith (+) operation is 
fully evaluated. 

Finally, can you rule out a normal overflow? I had sequencing runs where 
fastx_nucleotide_distribution_graph.sh simply overflowed the Int counter. Using 
Haskell's Integer produced correct results. 
Yet the thing with the bit number influencing the outcome sounds very much like 
data corruption. 

Olaf

> Am 09.05.2017 um 09:29 schrieb Ketil Malde <ke...@malde.org>:
> 
> 
> On 2017-05-04 12:14 (+0200), Michał J Gajda <mjga...@gmail.com> wrote:
> 
>> Interesting. Do You have an error report filed anywhere to peruse?
> 
> Well, I wrote the below.  In the mean time, I've tried replacing
> "insertWith" with a manual lookup and update.  This changed the cases
> where the segfaults happen, but didn't eliminate them.
> 
> And I tried using the safe freeze function, to no avail.
> 
> And I implemented some of the functionality using a Vector, which gives
> correct results (but without reimplementing a real hashtable, it can
> only be used for the histogram collection, not k-mer counting, since
> with large k that requires full 64-bit indexes into the vector)
> 
> I've also run on several different computers, so I don't think it's a
> hardware error.
> 
> -k
> 
> ----------------------------------------
>  So: I have a program that indexes k-mers (fixed-size words) as Word64's,
>  storing the different ones along with their counts in a Judy array, and
>  at the end, writing to a file.  So an associative map mapping  
>  k-word -> count. 
> 
>  Then, the second stage re-reads this file, and builds a histogram,
>  tallying how often each count occurs.  let's call this count -> frequency
> 
>  For some time, this just seemed to work.  But then I implemented an
>  expectation-maximization algorithm to model the frequency distributions,
>  and that failed to terminate in some cases.  Going back, I found the
>  following histogram to cause the trouble:
> 
>  ==> SRR901891_1.32.hist <==
>  865     809
>  866     775
>  867     757
>  868     781
>  869     725
>  870     778
>  871     795
>  872     774
>  873     771
>  48036128        2017612633061982208
> 
>  Notice that last value?  Apparently, there are 2.018e18 different
>  k-mers with a frequency of 48036128.  This is clearly not right.
> 
>  Notice also that the latter number is 0x01c000...000.  Does this carry
>  any special meaning?
> 
>  One possibility is of course some garbage in the k-mer count index, but
>  I also have a dump command that prints the contents as text.  I can find
>  nothing wrong with that, and the highest count k-mers look like this:
> 
>  % sort -nr -k2 *dump* | head         
>  tgggggtccagccatggagaagagtttagaca        1670602
>  gggggtccagccatggagaagagtttagacac        1653146
>  gtctaaactcttctccatggctggacccccaa        1638158
>  gaatattatttgggggtccagccatggagaag        1567591
>  ggaatattatttgggggtccagccatggagaa        1539378
>  actagtgcttaggaaatctattggaggcagaa        1532596
>  gcttctgcctccaatagatttcctaagcacta        1531091
>  aggaatattatttgggggtccagccatggaga        1528806
>  ctagtgcttaggaaatctattggaggcagaag        1527682
>  aaaggaatattatttgggggtccagccatgga        1518392
> 
>  One thing to look into is how the histogram is built.  I have wrapped
>  this in a function (mk_judy, code below) that returns a struct of
>  operations, called FreqCount.  mk_judy takes as an argument the number
>  of bits to use in the index - but this isn't really used for anything
>  (JudyL is JudyL).  Yet, when I changed this parameter from 16 to 32, the
>  last line changed to:
> 
>  17180896        2017612633061982208
> 
>  And with 64:
> 
>  37271776        2017612633061982208
> 
>  The latter number is always the same, the former ends with 0x8e0, but
>  appears to vary in the most significant bits.
> 
>  Everything else is the same in the histograms, I did a diff on the
>  outputs.
> 
>  -- Code ----------------------------------------
>  mk_judy :: Int -> IO FreqCount
>  mk_judy l = do
>    j <- J.new :: IO (J.JudyL Int)
>    let ac k = k `seq` J.insertWith (+) k 1 j
>       gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j
>       sc k v = J.insert k v j
>       es = J.unsafeFreeze j >>= J.elems
>       ks = J.unsafeFreeze j >>= J.keys
>       as = J.unsafeFreeze j >>= J.toList
>    return $ FreqCount { key_bits = l
>                      , add_count = ac, get_count = gc, set_count = sc
>                      , keys = ks, counts = es, assocs = as
>                      }
>  -- Code ----------------------------------------
> 
>  If we're sure there is nothing wrong with the Judy library, the only
>  other thing I can think of is the unsafeFreeze.  I'm pretty sure I don't
>  modify the Judy array after accessing it through the "frozen" functions
>  - but could it e.g. be the case that j is frozen with some operations
>  still unevaluated?  I tried adding "j `seq`" before unsafeFreeze, and
>  got:
> 
>  33769696        2017612633061982208
> 
>  Again, 0x...8e0 and the usual 0x01c0...
> 
>  Anyway, that's where it stands at the moment.  I'm not sure what to do,
>  it does worry me that this counting seems to fail on occasion.  Note
>  also that the histogram gets truncated (always at the same place, after
>  873), and it should really go up to 1.5 million.
> ----------------------------------------
> 
> -- 
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Biohaskell mailing list
> Biohaskell@biohaskell.org
> http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to