Ketil, Given that the dump looks fine, my gut feeling is that indeed the histogram function operates on partially evaluated data and somehow the memory is corrupted. Can you write a Deepseq instance for your Judy array? After all, you could dump to a [(String,Int)] instead of an IO handle and re-parse the data. I guess that forcing the k-mer itself to be fully evaluated does no good (it's a Word64, anyways?). You'd need to make sure the insertWith (+) operation is fully evaluated.
Finally, can you rule out a normal overflow? I had sequencing runs where fastx_nucleotide_distribution_graph.sh simply overflowed the Int counter. Using Haskell's Integer produced correct results. Yet the thing with the bit number influencing the outcome sounds very much like data corruption. Olaf > Am 09.05.2017 um 09:29 schrieb Ketil Malde <ke...@malde.org>: > > > On 2017-05-04 12:14 (+0200), Michał J Gajda <mjga...@gmail.com> wrote: > >> Interesting. Do You have an error report filed anywhere to peruse? > > Well, I wrote the below. In the mean time, I've tried replacing > "insertWith" with a manual lookup and update. This changed the cases > where the segfaults happen, but didn't eliminate them. > > And I tried using the safe freeze function, to no avail. > > And I implemented some of the functionality using a Vector, which gives > correct results (but without reimplementing a real hashtable, it can > only be used for the histogram collection, not k-mer counting, since > with large k that requires full 64-bit indexes into the vector) > > I've also run on several different computers, so I don't think it's a > hardware error. > > -k > > ---------------------------------------- > So: I have a program that indexes k-mers (fixed-size words) as Word64's, > storing the different ones along with their counts in a Judy array, and > at the end, writing to a file. So an associative map mapping > k-word -> count. > > Then, the second stage re-reads this file, and builds a histogram, > tallying how often each count occurs. let's call this count -> frequency > > For some time, this just seemed to work. But then I implemented an > expectation-maximization algorithm to model the frequency distributions, > and that failed to terminate in some cases. Going back, I found the > following histogram to cause the trouble: > > ==> SRR901891_1.32.hist <== > 865 809 > 866 775 > 867 757 > 868 781 > 869 725 > 870 778 > 871 795 > 872 774 > 873 771 > 48036128 2017612633061982208 > > Notice that last value? Apparently, there are 2.018e18 different > k-mers with a frequency of 48036128. This is clearly not right. > > Notice also that the latter number is 0x01c000...000. Does this carry > any special meaning? > > One possibility is of course some garbage in the k-mer count index, but > I also have a dump command that prints the contents as text. I can find > nothing wrong with that, and the highest count k-mers look like this: > > % sort -nr -k2 *dump* | head > tgggggtccagccatggagaagagtttagaca 1670602 > gggggtccagccatggagaagagtttagacac 1653146 > gtctaaactcttctccatggctggacccccaa 1638158 > gaatattatttgggggtccagccatggagaag 1567591 > ggaatattatttgggggtccagccatggagaa 1539378 > actagtgcttaggaaatctattggaggcagaa 1532596 > gcttctgcctccaatagatttcctaagcacta 1531091 > aggaatattatttgggggtccagccatggaga 1528806 > ctagtgcttaggaaatctattggaggcagaag 1527682 > aaaggaatattatttgggggtccagccatgga 1518392 > > One thing to look into is how the histogram is built. I have wrapped > this in a function (mk_judy, code below) that returns a struct of > operations, called FreqCount. mk_judy takes as an argument the number > of bits to use in the index - but this isn't really used for anything > (JudyL is JudyL). Yet, when I changed this parameter from 16 to 32, the > last line changed to: > > 17180896 2017612633061982208 > > And with 64: > > 37271776 2017612633061982208 > > The latter number is always the same, the former ends with 0x8e0, but > appears to vary in the most significant bits. > > Everything else is the same in the histograms, I did a diff on the > outputs. > > -- Code ---------------------------------------- > mk_judy :: Int -> IO FreqCount > mk_judy l = do > j <- J.new :: IO (J.JudyL Int) > let ac k = k `seq` J.insertWith (+) k 1 j > gc k = k `seq` fromMaybe 0 `fmap` J.lookup k j > sc k v = J.insert k v j > es = J.unsafeFreeze j >>= J.elems > ks = J.unsafeFreeze j >>= J.keys > as = J.unsafeFreeze j >>= J.toList > return $ FreqCount { key_bits = l > , add_count = ac, get_count = gc, set_count = sc > , keys = ks, counts = es, assocs = as > } > -- Code ---------------------------------------- > > If we're sure there is nothing wrong with the Judy library, the only > other thing I can think of is the unsafeFreeze. I'm pretty sure I don't > modify the Judy array after accessing it through the "frozen" functions > - but could it e.g. be the case that j is frozen with some operations > still unevaluated? I tried adding "j `seq`" before unsafeFreeze, and > got: > > 33769696 2017612633061982208 > > Again, 0x...8e0 and the usual 0x01c0... > > Anyway, that's where it stands at the moment. I'm not sure what to do, > it does worry me that this counting seems to fail on occasion. Note > also that the histogram gets truncated (always at the same place, after > 873), and it should really go up to 1.5 million. > ---------------------------------------- > > -- > If I haven't seen further, it is by standing in the footprints of giants > _______________________________________________ > Biohaskell mailing list > Biohaskell@biohaskell.org > http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell